使用CrawlSpider()在Python中爬取网页的步骤和方法

发布时间：2023-12-23 20:36:06

使用CrawlSpider类进行网页爬取的步骤如下：

步骤1：导入相关模块

首先，我们需要导入Scrapy中的相关模块，包括CrawlSpider类、Rule类和LinkExtractor类。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

步骤2：创建爬虫类

接下来，我们创建一个爬虫类，继承自CrawlSpider类，并定义爬虫的名称和允许爬取的域名。同时，我们还可以定义一些其他的属性，如headers、user_agent等。

class MySpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    }

步骤3：定义规则

在爬虫类中，我们需要定义一些规则，告诉爬虫如何进行页面的解析和跟进。这些规则通常由两个组件组成：LinkExtractor和Rule。

LinkExtractor用于定义从页面中提取链接的规则，包括允许的链接、禁止的链接、提取链接的正则表达式等。我们可以使用LinkExtractor的allow、deny和restrict_xpaths参数设置这些规则。

Rule用于定义如何跟进链接和进行页面解析的规则。我们使用Rule的link_extractor参数指定LinkExtractor对象，callback参数指定解析链接后的回调函数，follow参数指定是否继续跟进链接。

    rules = (
        Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),
    )

步骤4：编写解析函数

接下来，我们需要编写解析函数来处理从网页中提取的数据。解析函数通常以parse_开头，接收一个response对象作为参数，并使用XPath或CSS选择器来从response中提取所需的数据。

    def parse_item(self, response):
        # 使用XPath提取数据
        title = response.xpath('//h1/text()').get()
        content = response.xpath('//p/text()').get()
        
        # 或者使用CSS选择器提取数据
        title = response.css('h1::text').get()
        content = response.css('p::text').get()
        
        # 处理提取的数据
        # ...
        
        # 返回提取的数据
        yield {
            'title': title,
            'content': content
        }

步骤5：启动爬虫

最后，在命令行中运行爬虫，使用scrapy crawl命令，并指定爬虫的名称。

$ scrapy crawl example

使用实例：

下面是一个简单的例子，演示了如何使用CrawlSpider类爬取网页中的标题和内容。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    }

    rules = (
        Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        title = response.xpath('//h1/text()').get()
        content = response.xpath('//p/text()').get()
        
        yield {
            'title': title,
            'content': content
        }

在命令行中运行以下命令来启动爬虫：

$ scrapy crawl example

爬虫会自动爬取网页中的标题和内容，并将其打印出来或保存到文件中。