Ches拔的學習筆記: Python爬CNBC科技版

有時在看科技新聞收集新情報，覺得一頁一頁翻實在太慢

不如用爬蟲把標題一次抓下來，然後CRTL+F找想要看的東西

例如CRTL+F找google，然後這樣是不是比較快!?

既然想到了，就用python完成吧

首先下載ipython notebook，再下載Beautifulsoup套件

然後觀察...

如果用Chrome的插件，InfoLite去分析cnbc的technology，發現新聞有二塊

第一塊是在#feature內

第二塊在#pipeline內

又發現這個標題都在div a裡面

所以就在.headline的loop內，使用get去爬裡面包含a的東東

import requests

from bs4 import BeautifulSoup

res = requests.get("http://www.cnbc.com/technology/")

soup = BeautifulSoup(res.text)

for item in soup.select('#pipeline'):

print item.select('a')

這樣會發現html標籤含a的全都爬了下來

但因為我只要字，所以在print item.select('a')後面加了[0].text，這樣只會爬第一個標題的字

後來發現換頁時，Request URL會出現http://www.cnb.com/technology/?page=2

此時，寫下第一個loop，因為引號裡面會變字串，所以我把i加了一個str()

import requests

from bs4 import BeautifulSoup

i = 1

while i < 10:

res = requests.get("http://www.cnbc.com/technology/?page="+ str(i))

soup = BeautifulSoup(res.text)

for item in soup.select('#pipeline'):

print item.select('.headline')[3].text

i += 1

寫第二個loop，以下是完成版本

import requests

from bs4 import BeautifulSoup

i = 1

while i < 10: #我要爬10頁!!!

res = requests.get("http://www.cnbc.com/technology/?page="+ str(i))

soup = BeautifulSoup(res.text)

for item in soup.select('#pipeline'):

j = 0

while j < 20: #每一頁最多只有20條

print item.select('.headline')[j].text

j += 1

i += 1

接著，就是存到txt檔裡頭

import requests

from bs4 import BeautifulSoup

f = open('C:\crawler_test.txt', 'w')

i = 1

while i < 4:

res = requests.get("http://www.cnbc.com/technology/?page="+ str(i))

soup = BeautifulSoup(res.text)

for item in soup.select('#pipeline'):

j = 0

while j < 10:

print item.select('.headline')[j].text #為了測試有列印到螢幕

data = item.select('.headline')[j].text

f.write(data +'\n')

j += 1

i += 1

f.close()

發現怪怪的，為什麼會出錯咧?

詢問google大神後，它告訴我其他大大是會轉成utf-8碼解決

所以完整版長這樣，一次爬個20頁

import requests

from bs4 import BeautifulSoup

f = open('C:\crawler_test.txt', 'w')

i = 1

while i < 21:

res = requests.get("http://www.cnbc.com/technology/?page="+ str(i))

soup = BeautifulSoup(res.text)

for item in soup.select('#pipeline'):

j = 0

while j < 10:

print item.select('.headline')[j].text #這是列印到螢幕上確定有輸出

data = item.select('.headline')[j].text

f.write(data.encode("utf-8") +'\n') #一定要變成utf-8不然會失效

j += 1

i += 1

f.close()

接下來打開crawler_test.txt，直接CTRL+F找google的新聞標題

但發現這樣半人工CTRL+F有點瞎，所以

1、後面的星期x與時間是我不想要的，這要想辦法在爬蟲時解決(或其他方式??)

2、想想hadoop/spark能不能幫忙找出google的標題(或其他方式??)

3、這樣寫沒有把#feature的新聞爬下來，但其實是想讓電腦自動去找出我要的標題，所以整支程式需要修正

Ches拔的學習筆記

2015年11月17日星期二

Python爬CNBC科技版

沒有留言:

張貼留言

FB設定搶先看的方式

檢舉濫用情形

2015年11月17日 星期二

Python爬CNBC科技版

沒有留言:

張貼留言

FB設定搶先看的方式

2015年11月17日星期二