用Python编写简单的爬虫程序

发布时间：2023-12-04 19:15:58

爬虫是一种获取网页信息的程序，可以用来抓取网页上的数据，分析网页的结构，提取所需的信息，并进行处理和存储。

以Python为例，编写一个简单的爬虫程序如下：

import requests
from bs4 import BeautifulSoup

# 设置请求头，模拟浏览器发送请求
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# 发送HTTP GET请求获取网页内容
def get_html(url):
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        return response.text
    except Exception as e:
        print('请求网页失败:', e)
        return None

# 解析网页内容，提取所需信息
def parse_html(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        # 以百度首页为例，提取搜索框的name属性和按钮的value属性
        search_input = soup.find('input', {'id': 'kw'}).get('name')
        search_button = soup.find('input', {'type': 'submit'}).get('value')
        return search_input, search_button
    except Exception as e:
        print('解析网页失败:', e)
        return None

# 主函数，调用上述函数实现爬虫程序的流程
def main():
    url = 'https://www.baidu.com'
    html = get_html(url)
    if html:
        result = parse_html(html)
        if result:
            print('搜索框的name属性:', result[0])
            print('按钮的value属性:', result[1])

if __name__ == '__main__':
    main()

上述代码使用了两个第三方库，requests用于发送HTTP请求，BeautifulSoup用于解析HTML内容。

首先，通过设置请求头模拟浏览器发送请求，构造get_html函数发送HTTP GET请求获取网页内容。其中，url为要爬取的网页地址。

然后，使用BeautifulSoup解析网页内容，提取所需信息。以百度首页为例，通过find方法找到搜索框的input标签，并获取其name属性；再通过find方法找到按钮的input标签，并获取其value属性。

最后，在主函数main中调用上述函数实现整个爬虫程序的流程。可以将要爬取的网页地址替换为其他网页，根据需求解析不同的内容。

总结起来，这个简单的爬虫程序主要包括三个步骤：发送HTTP请求获取网页内容，解析网页内容提取所需信息，最终处理和存储提取到的信息。通过编写类似的程序，可以爬取任何感兴趣的网页，并获取其上的数据。