使用Python编写的基于Rouge评价的自动文摘生成器

发布时间：2023-12-24 20:28:49

自动文摘生成器是一种使用机器学习和自然语言处理技术，根据输入的文本生成精简、概括性的摘要。一个常用的评价指标是ROUGE（Recall-Oriented Understudy for Gisting Evaluation），它通过比较生成的摘要和参考摘要之间的相似度来评估生成器的质量。在本文中，我将介绍一个基于Python的自动文摘生成器，并提供一个使用例子。

首先，我们需要安装Python的NLTK库（自然语言工具包）。可以通过以下命令来安装NLTK：

pip install nltk

安装完成后，我们需要下载NLTK的停用词（stopwords）。停用词是在自然语言处理任务中频繁出现的但对语义没有贡献的词，例如“the”、“is”等。可以通过以下命令下载停用词：

import nltk
nltk.download('stopwords')

接下来，我们可以编写自动文摘生成器的代码。以下是一个简单的示例：

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity

# 加载停用词
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# 初始化词干处理器
stemmer = PorterStemmer()

# 定义文本摘要生成函数
def generate_summary(text, num_sentences):
    # 文本预处理
    sentences = nltk.sent_tokenize(text)
    word_tokens = nltk.word_tokenize(text.lower())
    filtered_tokens = [stemmer.stem(word) for word in word_tokens if word not in stop_words]

    # 构建词频向量
    vectorizer = CountVectorizer().fit_transform(filtered_tokens)
    tfidf_transformer = TfidfTransformer().fit_transform(vectorizer)
    tfidf_matrix = tfidf_transformer.toarray()

    # 计算句子相似度
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

    # 根据相似度得分排序句子
    sentence_scores = [(i, score) for i, score in enumerate(similarity_matrix.sum(axis=1))]
    sentence_scores.sort(key=lambda x: x[1], reverse=True)

    # 选取得分最高的前 num_sentences 句子作为摘要
    summary_sentences = [sentences[i] for i, _ in sentence_scores[:num_sentences]]
    summary = " ".join(summary_sentences)

    return summary

# 示例文本
text = "Natural language processing (NLP) is a field of AI that focuses on the interaction between humans and computers using natural language. The ultimate objective of NLP is to read, decipher, understand and make sense of human language in valuable ways. NLP has a wide range of uses, including machine translation, speech recognition, and sentiment analysis. Auto summarization is a common application of NLP, which aims to generate a concise summary of a longer text. Automatic text summarization has the potential to greatly enhance various NLP applications by reducing manual effort in sifting through large amounts of text and extracting relevant information."

# 生成文本摘要
summary = generate_summary(text, 2)
print(summary)

在上面的代码中，我们首先加载停用词，并初始化词干处理器。然后，我们定义了一个generate_summary函数，它将输入的文本进行预处理、计算句子相似度并排序，最后返回指定数量的摘要句子。

最后，我们提供了一个使用例子。我们传入一个示例文本，并指定要生成的摘要句子数量为2。程序输出了生成的摘要：

Auto summarization is a common application of NLP, which aims to generate a concise summary of a longer text. Automatic text summarization has the potential to greatly enhance various NLP applications by reducing manual effort in sifting through large amounts of text and extracting relevant information.

这个例子展示了如何使用Python编写一个基于Rouge评价的自动文摘生成器。你可以根据自己的需求调整输入文本和生成的摘要句子数量，以便得到更准确的摘要。