使用HTMLParser.HTMLParser在Python中提取HTML标签和内容的方法

发布时间：2024-01-03 08:16:55

在Python中，我们可以使用HTMLParser模块中的HTMLParser类来解析HTML标签和内容。HTMLParser类提供了一些方法，用于处理HTML标签的开始、结束和数据的内容。

下面是一个使用HTMLParser来提取HTML标签和内容的例子：

from html.parser import HTMLParser

# 继承HTMLParser类并重写一些方法
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        # 打印所有属性
        for attr in attrs:
            print("     Attribute:", attr[0], "=", attr[1])

    def handle_endtag(self, tag):
        print("End tag:", tag)

    def handle_data(self, data):
        print("Data:", data)

# 创建parser对象
parser = MyHTMLParser()

# 要解析的HTML代码
html = """
<html>
<head>
    <title>My Website</title>
</head>
<body>
    <h1>Welcome to My Website</h1>
    <p>This is a paragraph.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
    </ul>
</body>
</html>
"""

# 解析HTML代码
parser.feed(html)

输出结果如下：

Start tag: html
Start tag: head
Start tag: title
Data: My Website
End tag: title
End tag: head
Start tag: body
Start tag: h1
Data: Welcome to My Website
End tag: h1
Start tag: p
Data: This is a paragraph.
End tag: p
Start tag: ul
Start tag: li
Data: Item 1
End tag: li
Start tag: li
Data: Item 2
End tag: li
Start tag: li
Data: Item 3
End tag: li
End tag: ul
End tag: body
End tag: html

通过重写handle_starttag()和handle_endtag()方法，我们可以获得HTML标签的开始和结束，而重写handle_data()方法可以获得标签内部的数据内容。在提取标签时，我们还可以使用attrs参数来获取标签的属性。

在上面的例子中，我们定义了一个自定义的MyHTMLParser类，继承了HTMLParser类，并重写了handle_starttag()、handle_endtag()和handle_data()方法。然后，我们实例化了MyHTMLParser对象并调用feed()方法将HTML代码作为输入来解析HTML。

需要注意的是，在实际使用中，我们可以根据自己的需求重写其他HTMLParser类提供的方法，来实现更复杂的功能。

希望这个例子能帮助你理解如何使用HTMLParser来提取HTML标签和内容。