说明:
本文参照了的 dmoz 爬虫例子。不过这个例子有些年头了,而 dmoz.org 的网页结构已经不同以前。所以我对xpath
也相应地进行了修改。
概要:
本文提出了scrapy 的三个入门应用场景- 爬取单页
- 根据目录页面,爬取所有指向的页面
- 爬取第一页,然后根据第一页的连接,再爬取下一页...。依此,直到结束
对于场景二、场景三可以认为都属于:链接跟随()
链接跟随的特点就是:在 parse 函数结束时,必须 yield 一个带回调函数 callback 的 Request 类的实例
本文基于:windows 7 (64) + python 3.5 (64) + scrapy 1.2
场景一
描述:
爬取单页内容
示例代码:
import scrapyfrom tutorial.items import DmozItemclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): for div in response.xpath('//div[@class="title-and-desc"]'): item = DmozItem() item['title'] = div.xpath('a/div/text()').extract_first().strip() item['link'] = div.xpath('a/@href').extract_first() item['desc'] = div.xpath('div[@class="site-descr "]/text()').extract_first().strip() yield item
场景二
描述:
- ①进入目录,提取连接。
- ②然后爬取连接指向的页面的内容 其中①的yield scrapy.Request的callback指向②
:
...extract the links for the pages you are interested, follow them and then extract the data you want for all of them.
示例代码:
import scrapyfrom tutorial.items import DmozItemclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ 'http://www.dmoz.org/Computers/Programming/Languages/Python/' # 这是目录页面 ] def parse(self, response): for a in response.xpath('//section[@id="subcategories-section"]//div[@class="cat-item"]/a'): url = response.urljoin(a.xpath('@href').extract_first().split('/')[-2]) yield scrapy.Request(url, callback=self.parse_dir_contents) def parse_dir_contents(self, response): for div in response.xpath('//div[@class="title-and-desc"]'): item = DmozItem() item['title'] = div.xpath('a/div/text()').extract_first().strip() item['link'] = div.xpath('a/@href').extract_first() item['desc'] = div.xpath('div[@class="site-descr "]/text()').extract_first().strip() yield item
场景三
描述:
- ①进入页面,爬取内容,并提取下一页的连接。
- ②然后爬取下一页连接指向的页面的内容 其中①的yield scrapy.Request的callback指向①自己
:
A common pattern is a callback method that extracts some items, looks for a link to follow to the next page and then yields a Request with the same callback for it
示例代码:
import scrapyfrom myproject.items import MyItemclass MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): for h3 in response.xpath('//h3').extract(): yield MyItem(title=h3) for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
说明:
第三个场景未测试!