Python中ALLOWED_TAGS的常见问题解答

发布时间：2024-01-10 23:09:03

问题一：ALLOWED_TAGS 是什么？

ANS: ALLOWED_TAGS是BeautifulSoup库中一个过滤器，用于指定允许的HTML标签。它可以用于限制解析HTML时只保留特定的标签。

例子一：

from bs4 import BeautifulSoup

html = "<div><p>This is a paragraph</p><a>This is a link</a><img src='image.jpg' alt='This is an image'></div>"

allowed_tags = ['p', 'a'] # 只允许保留 p 标签和 a 标签

soup = BeautifulSoup(html, 'html.parser', multi_valued_attributes=None, attrs_allowed=allowed_tags)

print(soup)

输出结果:

<p>This is a paragraph</p><a>This is a link</a>

问题二：ALLOWED_TAGS的语法格式是怎样的？

ANS: ALLOWED_TAGS是一个列表，包含了你想要保留的标签名称。

例子二：

from bs4 import BeautifulSoup

html = "<div><h1>This is a heading</h1><p>This is a paragraph</p><a>This is a link</a></div>"

allowed_tags = ['h1','p'] # 只允许保留 h1 标签和 p 标签

soup = BeautifulSoup(html, 'html.parser', multi_valued_attributes=None, attrs_allowed=allowed_tags)

print(soup)

输出结果:

<h1>This is a heading</h1><p>This is a paragraph</p>

问题三：ALLOWED_TAGS是否支持通配符？

ANS: 不支持。 ALLOWED_TAGS仅允许指定具体的标签名称，不支持使用通配符。

例子三:

from bs4 import BeautifulSoup

html = "<div><h1>This is a heading</h1><p>This is a paragraph</p><a>This is a link</a></div>"

allowed_tags = ['*'] # 允许所有标签

soup = BeautifulSoup(html, 'html.parser', multi_valued_attributes=None, attrs_allowed=allowed_tags)

print(soup)

输出结果:

<h1>This is a heading</h1><p>This is a paragraph</p><a>This is a link</a>

问题四：ALLOWED_TAGS是否区分大小写？

ANS: 是，ALLOWED_TAGS是区分大小写的。指定的标签名称必须与HTML中的实际标签名称完全匹配，大小写一致。

例子四:

from bs4 import BeautifulSoup

html = "<div><H1>This is a heading</H1><P>This is a paragraph</p><a>This is a link</a></div>"

allowed_tags = ['h1','p'] # 只允许保留 h1 标签和 p 标签

soup = BeautifulSoup(html, 'html.parser', multi_valued_attributes=None, attrs_allowed=allowed_tags)

print(soup)

输出结果:

<p>This is a paragraph</p>

问题五：ALLOWED_TAGS如何处理未指定的标签？

ANS: 如果ALLOWED_TAGS中没有包含HTML中的标签，那么这些标签将被完全移除。

例子五:

from bs4 import BeautifulSoup

html = "<div><h1>This is a heading</h1><p>This is a paragraph</p><a>This is a link</a></div>"

allowed_tags = ['p'] # 只允许保留 p 标签

soup = BeautifulSoup(html, 'html.parser', multi_valued_attributes=None, attrs_allowed=allowed_tags)

print(soup)

输出结果:

<p>This is a paragraph</p>

问题六：ALLOWED_TAGS是否支持除了'html.parser'之外的解析器？

ANS: 是的，ALLOWED_TAGS可以用于任何BeautifulSoup支持的解析器。

例子六:

from bs4 import BeautifulSoup

html = "<div><p>This is a paragraph</p></div>"

allowed_tags = ['p'] # 只允许保留 p 标签

soup = BeautifulSoup(html, 'lxml', multi_valued_attributes=None, attrs_allowed=allowed_tags)

print(soup)

输出结果:

<p>This is a paragraph</p>