网络爬虫：使用Python函数获取网页上的信息

发布时间：2023-06-25 05:07:41

网络爬虫指的是从互联网上自动抓取信息的程序。利用网络爬虫可以获取网页上的各种信息，例如文本、图片、视频、音频等等。Python是一种广泛使用的编程语言，拥有强大的网络爬虫库。本文介绍如何使用Python函数获取网页上的信息。

1. 爬虫初识

网络爬虫一般分为四个步骤：发起请求、获取响应、解析响应和存储数据。其中，发起请求和获取响应是最为关键的步骤。在Python中，我们可以使用requests库进行请求发送和响应获取。

首先，我们需要安装requests库。在Python中，使用pip命令即可安装：

pip install requests

安装完毕后，我们可以使用requests库中的get()函数向指定的URL发送请求，代码如下：

import requests

url = 'https://www.baidu.com/'
response = requests.get(url)
print(response.text)

在这个例子中，我们向百度搜索首页发送了一个请求，获取了响应。使用response.text可以输出响应的内容，也就是页面上的HTML代码。

2. 解析HTML

获取到HTML代码后，我们需要从中提取我们需要的信息。这个过程称为解析HTML。在Python中，可以使用BeautifulSoup库进行HTML解析。

首先，我们需要安装BeautifulSoup库。在Python中，使用pip命令即可安装：

pip install beautifulsoup4

安装完毕后，我们可以使用BeautifulSoup库中的prettify()函数将HTML代码格式化，便于我们阅读：

import requests
from bs4 import BeautifulSoup

url = 'https://www.baidu.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

在这个例子中，我们将获取到的HTML代码通过BeautifulSoup库的prettify()函数进行格式化输出，达到更好的阅读效果。

接下来，我们可以使用BeautifulSoup库中的一系列函数对HTML代码进行解析。例如，我们可以使用find()函数查找HTML中的特定标签，代码如下：

import requests
from bs4 import BeautifulSoup

url = 'https://www.baidu.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.find('title'))

在这个例子中，我们使用find()函数查找HTML中的<title>标签，并输出其内容。

3. 存储数据

最后，我们需要将解析后的数据进行存储。在Python中，我们可以使用csv库将数据存储为CSV文件，也可以使用pandas库将数据存储为Excel文件。

首先，我们需要安装csv库。在Python中，使用pip命令即可安装：

pip install csv

然后，我们可以使用csv库中的writerow()函数将数据写入CSV文件中，代码如下：

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://movie.douban.com/top250'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
with open('douban_top250.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    writer.writerow(['电影名称', '导演', '主演', '上映日期', '评分', '评价人数', '经典语录'])
    for item in soup.find_all('div', {'class': 'info'}):
        name = item.find_all('span', {'class': 'title'})[0].text
        director = item.find_all('p')[0].text.split('
')[1][4:]
        actors = item.find_all('p')[0].text.split('
')[2][3:]
        date = item.find_all('p')[0].text.split('
')[3][5:]
        rating = item.find_all('span', {'class': 'rating_num'})[0].text
        comments = item.find_all('span', {'class': 'comment'})[0].text
        quote = item.find_all('p', {'class': 'quote'})[0].text.strip()
        writer.writerow([name, director, actors, date, rating, comments, quote])

在这个例子中，我们使用requests库向豆瓣电影TOP250发送请求，然后使用csv库将解析后的电影信息存储为CSV文件。

除了使用csv库，还可以使用pandas库将数据存储为Excel文件。首先，我们需要安装pandas库。在Python中，使用pip命令即可安装：

pip install pandas

然后，我们可以使用pandas库中的DataFrame()函数将数据存储为DataFrame格式，然后使用to_excel()函数将DataFrame存储为Excel文件，代码如下：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://movie.douban.com/top250'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
for item in soup.find_all('div', {'class': 'info'}):
    name = item.find_all('span', {'class': 'title'})[0].text
    director = item.find_all('p')[0].text.split('
')[1][4:]
    actors = item.find_all('p')[0].text.split('
')[2][3:]
    date = item.find_all('p')[0].text.split('
')[3][5:]
    rating = item.find_all('span', {'class': 'rating_num'})[0].text
    comments = item.find_all('span', {'class': 'comment'})[0].text
    quote = item.find_all('p', {'class': 'quote'})[0].text.strip()
    data.append([name, director, actors, date, rating, comments, quote])
df = pd.DataFrame(data, columns=['电影名称', '导演', '主演', '上映日期', '评分', '评价人数', '经典语录'])
df.to_excel('douban_top250.xlsx', index=False)

在这个例子中，我们使用pandas库将解析后的电影信息存储为DataFrame格式，然后使用to_excel()函数将DataFrame存储为Excel文件。

总结

本文介绍了如何使用Python函数获取网页上的信息。通过requests库和BeautifulSoup库，我们可以方便地获取并解析HTML代码中的各种信息。通过csv库和pandas库，我们可以将解析后的数据存储为CSV文件或Excel文件，方便后续的数据分析和处理。网络爬虫是一项有趣的技术，希望读者可以通过本文了解到其中的一些基础知识。