gensim中文语料库的词频统计与分析方法

发布时间：2023-12-24 09:20:48

在处理中文语料库时，使用gensim库进行词频统计与分析是一种常见的方法。gensim是一个开源的自然语言处理工具包，它提供了一种基于向量空间模型的文本相似度计算方法，能够有效地进行主题建模、文本聚类和文本分类等任务。

下面将介绍gensim库中一些常用的中文词频统计与分析方法，并提供相应的使用例子。假设我们已经建立了一个中文语料库，并使用gensim库加载了语料库数据，命名为corpus。

1. 统计每个词的出现次数：

from collections import defaultdict

# 初始化一个词频字典
word_count = defaultdict(int)

# 统计词频
for document in corpus:
    for word in document:
        word_count[word] += 1

# 按词频降序排序
sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)

# 打印前N个词频最高的词及其出现次数
N = 10
for word, count in sorted_words[:N]:
    print(word, count)

2. 生成词频统计图表：

import matplotlib.pyplot as plt

# 获取词频数据
word_freq = [count for word, count in sorted_words]

# 绘制词频统计图表
plt.bar(range(len(word_freq)), word_freq)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.show()

3. 去除停用词：

from gensim.parsing.preprocessing import remove_stopwords

# 定义停用词列表
stopwords = ['的', '是', '在', '了', '和']

# 去除停用词
corpus_without_stopwords = [[word for word in document if word not in stopwords] for document in corpus]

4. 统计每个文档中的词频：

from collections import Counter

# 统计每个文档中的词频
document_word_count = [Counter(document) for document in corpus]

# 打印第N个文档中词频最高的词及其出现次数
N = 0  #       个文档
sorted_words_in_document = sorted(document_word_count[N].items(), key=lambda x: x[1], reverse=True)
for word, count in sorted_words_in_document:
    print(word, count)

5. 使用TF-IDF算法计算词权重：

from gensim import corpora
from gensim.models import TfidfModel

# 创建词典
dictionary = corpora.Dictionary(corpus)

# 创建语料库的词袋表示
corpus_bow = [dictionary.doc2bow(document) for document in corpus]

# 训练TF-IDF模型
tfidf_model = TfidfModel(corpus_bow)

# 计算文档中每个词的TF-IDF权重
document_tfidf = tfidf_model[corpus_bow]

# 打印第N个文档中TF-IDF权重最高的词及其权重值
N = 0  #       个文档
sorted_words_tfidf = sorted(document_tfidf[N], key=lambda x: x[1], reverse=True)
for word_id, weight in sorted_words_tfidf:
    print(dictionary[word_id], weight)

这些例子展示了使用gensim库进行中文词频统计与分析的常见方法。根据具体的需求，你可以选择适合的方法来对中文语料库进行处理和分析。同时，gensim还提供了其他一些功能，如词向量训练、主题建模和文本相似度计算等，帮助你更好地处理和分析中文文本数据。