python爬虫基础

Dec 2, 2014 1 分钟阅读时间 dashayu.tk

本文通过工具从以前的 html 转成 markdown，格式可能有问题。

基本的抓取

import urllib
content = urllib.urlopen('http://www.x.com').read()

使用代理服务器

proxy = urllib2.ProxyHandler({'http':'http://host:port'}) 
opener = urllib2.build_opener(proxy, urllib2.HTTPHandler) 
urllib2.install_opener(opener) 
content = urllib2.urlopen('http://www.xxxx.com').read()

import urllib2, cookielib 
cookie = urllib2.HTTPCookieProcessor(cookielib.CookieJar()) 
opener = urllib2.build_opener(cookie, urllib2.HTTPHandler) 
urllib2.install_opener(opener) 
content = urllib2.urlopen('http://www.xxx.com').read()

POST 数据

比如说需要向 http://www.xxx.com/post/ 接口 POST 数据 name=’liluo’, age=’21’, blog=’http://liluo.org’

首先需要准备数据

data = urllib.urlencode({ 
    'name': 'liluo', 
    'age' : '21', 
    'blog': 'http://liluo.org' 
})

然后生成并发送 HTTP 请求

req = urllib2.Request(url='http://www.xxx.com/post/', data=data) 
ret = urllib2.urlopen(req).read()

伪装成浏览器

很多网站不喜欢爬虫（比如糗事百科），发送的请求会被拒绝。这个时候我们可以用修改 HTTP headers 信息来伪装成浏览器:

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19'
} 
req = urllib2.Request( 
    url = 'http://www.xxx.com', 
    headers = headers 
) 
ret = urllib2.urlopen(req).read()

绕过“反盗链”

某些网站（再比如糗事百科）图片会有所谓的反盗链设置，其实就是检查 HTTP 请求的 headers 里的 referer 是否来自该网站。所以只需改下 headers:

headers = {'Referer': 'http://www.qiushibai.com' } 
req = urllib2.Request( 
    url = 'http://qiushibaike.com/', 
    headers = headers
)

huiren

Code Artisan

问渠那得清如许，为有源头活水来