Python中使用HTMLParser解析网页数据

发布时间：2023-12-26 03:13:57

HTMLParser是Python标准库中的模块，用于解析HTML代码。它提供了一种便捷的方式来提取HTML标签、文本等信息，并可以实现网页数据的清洗和提取。

使用HTMLParser需要先定义一个继承自HTMLParser的子类，并重写其中的方法来处理HTML代码。以下是一个使用HTMLParser解析网页数据的例子：

from html.parser import HTMLParser

# 定义一个继承自HTMLParser的子类
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        """处理开始标签"""
        print("Start tag:", tag)
        for attr in attrs:
            print("Attribute:", attr)

    def handle_endtag(self, tag):
        """处理结束标签"""
        print("End tag :", tag)

    def handle_data(self, data):
        """处理文本数据"""
        print("Data     :", data)

    def handle_comment(self, data):
        """处理注释"""
        print("Comment  :", data)

    def handle_entityref(self, name):
        """处理实体引用"""
        print("Entity   :", name)

    def handle_charref(self, name):
        """处理字符引用"""
        print("Char ref :", name)

# 创建一个HTMLParser的实例
parser = MyHTMLParser()

# 解析HTML代码
html_code = '''
<html>
<body>
<h1>Title</h1>
<!-- This is a comment -->
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<a href="http://www.example.com">Link</a>
</body>
</html>
'''

parser.feed(html_code)

运行上述代码，输出结果为：

Start tag: html
Start tag: body
Start tag: h1
Data     : Title
End tag : h1
Comment  :  This is a comment 
Start tag: p
Data     : Paragraph 1
End tag : p
Start tag: p
Data     : Paragraph 2
End tag : p
Start tag: a
Attribute: ('href', 'http://www.example.com')
Data     : Link
End tag : a
End tag : body
End tag : html

上述代码中，我们定义了一个子类MyHTMLParser，重写了父类HTMLParser中的一些方法来处理特定的HTML标签、文本数据、注释、实体引用和字符引用。然后我们创建了一个HTMLParser的实例parser，并调用其feed方法来解析HTML代码。

在解析过程中，每当遇到开始标签、结束标签、文本数据等，相应的方法就会被调用，并输出相应的信息。

这只是一个简单的例子，你可以根据需要来自定义子类的行为，从而实现自己的网页数据解析逻辑。