最近在練習python爬蟲, 以下是對岸用正規表示法寫的
http://cuiqingcai.com/1001.html
Python爬虫实战四之抓取淘宝MM照片-->我不是為了標題才改code的哦!
然後又看到大數學堂使用BeautifulSoup去爬ptt,那我來改寫對岸寫的東西
import requests
from bs4 import BeautifulSoup
res =
requests.get('http://mm.taobao.com/json/request_top_list.htm',verify=False)
soup = BeautifulSoup(res.text)
for entry in soup.select('.list-item'):
print
entry.select('.lady-name')[0].text,entry.select('strong')[0].text+,entry.select('span')[0].text
結果如下圖
太簡單啦~正規表示法我還認真的看了半天才知道在做什麼
BeatifulSoup真是好東西
用re正規表示法也可以,只是麻煩了點
import urllib
import urllib2
import re
url = 'http://mm.taobao.com/json/request_top_list.htm'
try:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
content = response.read().decode('gbk')
pattern = re.compile('<div class="list-item".*?<a class="lady-name.*?>(.*?)</a>.*?<strong>(.*?)</strong>.*?<span>(.*?)</span>',re.S)
items = re.findall(pattern,content)
for item in items:
print item[0],item[1],item[2]
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
結果如下圖
用re正規表示法也可以,只是麻煩了點
import urllib
import urllib2
import re
url = 'http://mm.taobao.com/json/request_top_list.htm'
try:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
content = response.read().decode('gbk')
pattern = re.compile('<div class="list-item".*?<a class="lady-name.*?>(.*?)</a>.*?<strong>(.*?)</strong>.*?<span>(.*?)</span>',re.S)
items = re.findall(pattern,content)
for item in items:
print item[0],item[1],item[2]
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
結果如下圖
如同以前自控老師說,假設成功殊途同歸!!!
沒有留言:
張貼留言