Python中利用HTMLParser.HTMLParser解析嵌套的HTML标签

发布时间：2024-01-03 08:17:20

HTMLParser是Python自带的标准库，用于解析HTML文档。它提供了一个HTMLParser类，我们可以继承该类，并实现一些回调方法，来处理HTML文档中的内容。

下面是一个使用HTMLParser解析嵌套的HTML标签的例子：

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.tags = []

    def handle_starttag(self, tag, attrs):
        self.tags.append(tag)
        print("Start tag:", tag)

    def handle_endtag(self, tag):
        if self.tags:
            self.tags.pop()
        print("End tag:", tag)

    def handle_data(self, data):
        if self.tags:
            print("Data:", data)

html = '''
<html>
    <head>
        <title>HTML Parser Example</title>
    </head>
    <body>
        <h1>Python HTML Parser</h1>
        <p>This is a paragraph.</p>
        <div>
            <p>This is a nested paragraph.</p>
        </div>
        <p>This is another paragraph.</p>
    </body>
</html>
'''

parser = MyHTMLParser()
parser.feed(html)

输出结果为：

Start tag: html
Start tag: head
Start tag: title
Data: HTML Parser Example
End tag: title
End tag: head
Start tag: body
Start tag: h1
Data: Python HTML Parser
End tag: h1
Start tag: p
Data: This is a paragraph.
End tag: p
Start tag: div
Start tag: p
Data: This is a nested paragraph.
End tag: p
End tag: div
Start tag: p
Data: This is another paragraph.
End tag: p
End tag: body
End tag: html

在这个例子中，我们创建了一个继承自HTMLParser的子类MyHTMLParser。在该类中，我们定义了三个回调方法handle_starttag、handle_endtag和handle_data。

- handle_starttag(tag, attrs)方法在解析到起始标签时被调用。在这个方法中，我们将起始标签添加到一个列表中，并打印出来。

- handle_endtag(tag)方法在解析到结束标签时被调用。在这个方法中，我们将当前标签从列表中移除，并打印出来。

- handle_data(data)方法在解析到标签中的文本数据时被调用。在这个方法中，我们打印出文本数据。

在主程序中，我们创建了一个MyHTMLParser的实例parser，并调用其feed方法来解析html字符串。解析过程中，每当遇到起始标签、结束标签或文本数据时，HTMLParser会调用相应的回调方法。

通过这个例子可以看出，HTMLParser可以很方便地解析嵌套的HTML标签。我们可以根据需要，在回调方法中添加处理逻辑，来对解析后的内容进行处理。