Python中如何统计文本中的词频利用words()函数

发布时间：2024-01-20 20:08:39

在Python中，可以使用collections模块中的Counter来统计文本中的词频。具体步骤如下：

1. 导入Counter模块：

from collections import Counter

2. 读取文本文件并分词：

def words(text):
    # 将文本中的标点符号替换为空格
    text = text.replace(",", " ").replace(".", " ").replace(":", " ").replace(";", " ")
    # 将文本中的大写字母转换为小写字母
    text = text.lower()
    # 使用空格将文本分割成单词列表
    word_list = text.split()
    return word_list

3. 统计词频:

def count_words(text):
    # 调用words()函数对文本进行分词
    word_list = words(text)
    # 使用Counter统计每个单词的出现次数
    word_counts = Counter(word_list)
    return word_counts

4. 使用例子：

text = "This is a sample text. It contains several words. This sample text is used for example purposes."
word_counts = count_words(text)
print(word_counts)

输出结果：

Counter({'this': 2, 'is': 2, 'sample': 2, 'text': 2, 'a': 1, 'it': 1, 'contains': 1, 'several': 1, 'words': 1, 'used': 1, 'for': 1, 'example': 1, 'purposes': 1})

上述代码将输入的文本字符串转换为小写字母，并且去除标点符号后，统计每个单词出现的次数。最后输出的结果为一个Counter对象，它是一种字典的子类，可以轻松地找到每个单词出现的频率。