欢迎访问宙启技术站
智能推送

使用Python实现GRU网络进行中文情感分析

发布时间:2023-12-12 07:59:53

中文情感分析基于各种深度学习模型,其中包括Gated Recurrent Unit (GRU) 网络,它是一种递归神经网络 (RNN) 的变种。在本文中,我们将使用Python和PyTorch库实现一个简单的GRU网络来进行中文情感分析。

首先,我们需要安装PyTorch库。可以使用以下命令在Python环境中安装PyTorch:

pip install torch

接下来,我们将使用一个中文情感分析数据集,例如THUCNews。该数据集包含了一系列新闻文本以及对应的情感标注。

首先,我们需要加载数据集。我们将使用pandas库来加载和整理数据:

import pandas as pd

# 读取数据集
data = pd.read_csv('thucnews.csv', encoding='utf-8')

# 随机打乱数据集
data = data.sample(frac=1).reset_index(drop=True)

# 划分训练集和测试集
train_data = data[:8000]
test_data = data[8000:]

接下来,我们需要创建一个词表,并将文本转换成词向量。我们将使用jieba库进行分词,并使用torchtext库来构建词表和数据加载器。

import jieba
from torchtext.data import Field, TabularDataset, BucketIterator

# 初始化分词器
jieba.initialize()

# 定义分词函数
def tokenize(text):
    return list(jieba.cut(text))

# 定义Field
TEXT = Field(sequential=True, tokenize=tokenize, lower=True, include_lengths=True)
LABEL = Field(sequential=False, use_vocab=False)

# 加载数据集
train_datafields = [('label', LABEL), ('text', TEXT)]
train = TabularDataset(path='thucnews_train.csv', format='csv', fields=train_datafields, skip_header=True)

test_datafields = [('label', LABEL), ('text', TEXT)]
test = TabularDataset(path='thucnews_test.csv', format='csv', fields=test_datafields, skip_header=True)

# 构建词表
TEXT.build_vocab(train)

# 创建数据加载器
train_loader, test_loader = BucketIterator.splits((train, test), batch_size=32, sort_key=lambda x: len(x.text), repeat=False)

接下来,我们需要定义GRU网络模型。我们将使用torch.nn库定义一个简单的二层GRU网络。

import torch
import torch.nn as nn

class GRUNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(GRUNet, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, x, lengths):
        packed = torch.nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
        packed_output, hidden = self.gru(packed)
        output, _ = torch.nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
        
        output = self.fc(output[:, -1, :])
        
        return output

在训练模型之前,我们需要定义一些超参数,并初始化模型和损失函数。

vocab_size = len(TEXT.vocab)
input_size = vocab_size
hidden_size = 128
output_size = 2
num_layers = 2
learning_rate = 0.001
num_epochs = 10

model = GRUNet(input_size, hidden_size, output_size, num_layers)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

接下来,我们就可以进行模型的训练和测试了。

# 训练模型
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, batch in enumerate(train_loader):
        labels = batch.label
        texts = batch.text[0]
        lengths = batch.text[1].numpy()

        outputs = model(texts, lengths)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                  .format(epoch+1, num_epochs, i+1, total_step, loss.item()))
            
# 测试模型
with torch.no_grad():
    correct = 0
    total = 0

    for batch in test_loader:
        labels = batch.label
        texts = batch.text[0]
        lengths = batch.text[1].numpy()

        outputs = model(texts, lengths)
        _, predicted = torch.max(outputs.data, 1)

        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy: {} %'.format(100 * correct / total))

在训练完成后,我们可以使用训练好的模型对新样本进行情感分类:

def predict_sentiment(model, sentence):
    model.eval()
    tokenized = tokenize(sentence)
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    
    tensor = torch.LongTensor(indexed).unsqueeze(0).to(device)
    length_tensor = torch.LongTensor(length).to(device)
    
    prediction = model(tensor, length_tensor)
    probabilities = nn.functional.softmax(prediction, dim=1)
    probability, predicted = torch.max(probabilities.squeeze(), 0)
    
    return probabilities, predicted.item()

通过以上步骤,我们就实现了一个简单的GRU网络来进行中文情感分析。可以使用THUCNews数据集来训练和测试模型,然后使用模型对新样本进行情感分类。