使用Python中的Word2Vec算法对中文文本进行情感分析

发布时间：2024-01-10 14:58:34

情感分析是一种通过机器学习和自然语言处理技术来识别和分析文本中的情感倾向的方法。Word2Vec算法是一种通过训练一个“单词嵌入”模型来捕捉和表示不同单词间关系的技术。在情感分析任务中，可以使用Word2Vec算法来生成单词的向量表示，并通过比较单词向量的相似度来判断文本的情感倾向。

下面是一个使用Python中的Word2Vec算法对中文文本进行情感分析的示例：

首先，我们需要安装和导入所需的库。在这个示例中，我们使用gensim库来实现Word2Vec算法。可以通过pip install gensim来安装该库。

import jieba
from gensim.models import Word2Vec
import numpy as np


def preprocess_text(text):
    # 使用jieba库进行分词
    words = jieba.lcut(text)
    return words


def generate_word_embeddings(texts):
    # 对所有文本进行分词处理
    sentences = [preprocess_text(text) for text in texts]
    
    # 使用Word2Vec算法生成单词嵌入模型
    model = Word2Vec(sentences, size=100, window=5, min_count=1, workers=4)
    
    # 获取所有单词的向量表示
    word_vectors = model.wv
    
    return word_vectors


def calculate_sentiment_score(word_embeddings, positive_words, negative_words, text):
    # 对待分析的文本进行分词处理
    words = preprocess_text(text)
    
    # 初始化情感分数为0
    sentiment_score = 0
    
    # 计算文本中所有单词的情感得分
    for word in words:
        if word in word_embeddings:
            # 计算单词的向量表示
            word_vector = word_embeddings[word]
            
            # 获取单词的索引
            word_index = word_embeddings.vocab[word].index
            
            # 计算单词与情感词之间的相似度
            similarity_scores = []
            for positive_word in positive_words:
                if positive_word in word_embeddings:
                    positive_vector = word_embeddings[positive_word]
                    similarity_score = np.dot(word_vector, positive_vector) / (
                            np.linalg.norm(word_vector) * np.linalg.norm(positive_vector))
                    similarity_scores.append(similarity_score)
            for negative_word in negative_words:
                if negative_word in word_embeddings:
                    negative_vector = word_embeddings[negative_word]
                    similarity_score = np.dot(word_vector, negative_vector) / (
                            np.linalg.norm(word_vector) * np.linalg.norm(negative_vector))
                    similarity_scores.append(similarity_score)
            
            # 如果存在相似度，则更新情感分数
            if similarity_scores:
                sentiment_score += np.min(similarity_scores)
    
    return sentiment_score


# 示例文本
texts = [
    "这个电影太好看了！",
    "这个餐馆的食物非常美味。",
    "我非常喜欢这个城市的氛围。",
    "这本书有点无聊。",
    "我对这个产品感到失望。"
]

# 正向情感词列表
positive_words = ["好看", "美味", "喜欢"]

# 负向情感词列表
negative_words = ["无聊", "失望"]

# 生成单词嵌入模型
word_embeddings = generate_word_embeddings(texts)

# 计算每个文本的情感得分
for text in texts:
    sentiment_score = calculate_sentiment_score(word_embeddings, positive_words, negative_words, text)
    if sentiment_score > 0:
        print(f"{text}: 正向情感")
    elif sentiment_score < 0:
        print(f"{text}: 负向情感")
    else:
        print(f"{text}: 中性情感")

在上面的例子中，我们首先对文本进行了分词处理，然后使用Word2Vec算法生成了单词嵌入模型。接下来，我们定义了两个函数：calculate_sentiment_score用于计算文本的情感得分，generate_word_embeddings用于生成单词嵌入模型。最后，通过对每个文本计算情感得分，并根据得分判断其情感倾向。

需要注意的是，在示例中的calculate_sentiment_score函数中，我们使用了余弦相似度来计算单词向量的相似度，然后取最小值作为情感得分。这是因为正向情感词和负向情感词在向量空间中的分布可能有所重叠，所以我们选择最小值作为情感得分以更准确地判断情感倾向。

这只是一个简单的示例，实际应用中可能需要更复杂的模型和更大的样本数据来提高准确性。同时，还可以使用更多的情感词库来覆盖更广泛的情感表达。