使用Python实现GRU网络进行中文情感分析
发布时间:2023-12-12 07:59:53
中文情感分析基于各种深度学习模型,其中包括Gated Recurrent Unit (GRU) 网络,它是一种递归神经网络 (RNN) 的变种。在本文中,我们将使用Python和PyTorch库实现一个简单的GRU网络来进行中文情感分析。
首先,我们需要安装PyTorch库。可以使用以下命令在Python环境中安装PyTorch:
pip install torch
接下来,我们将使用一个中文情感分析数据集,例如THUCNews。该数据集包含了一系列新闻文本以及对应的情感标注。
首先,我们需要加载数据集。我们将使用pandas库来加载和整理数据:
import pandas as pd
# 读取数据集
data = pd.read_csv('thucnews.csv', encoding='utf-8')
# 随机打乱数据集
data = data.sample(frac=1).reset_index(drop=True)
# 划分训练集和测试集
train_data = data[:8000]
test_data = data[8000:]
接下来,我们需要创建一个词表,并将文本转换成词向量。我们将使用jieba库进行分词,并使用torchtext库来构建词表和数据加载器。
import jieba
from torchtext.data import Field, TabularDataset, BucketIterator
# 初始化分词器
jieba.initialize()
# 定义分词函数
def tokenize(text):
return list(jieba.cut(text))
# 定义Field
TEXT = Field(sequential=True, tokenize=tokenize, lower=True, include_lengths=True)
LABEL = Field(sequential=False, use_vocab=False)
# 加载数据集
train_datafields = [('label', LABEL), ('text', TEXT)]
train = TabularDataset(path='thucnews_train.csv', format='csv', fields=train_datafields, skip_header=True)
test_datafields = [('label', LABEL), ('text', TEXT)]
test = TabularDataset(path='thucnews_test.csv', format='csv', fields=test_datafields, skip_header=True)
# 构建词表
TEXT.build_vocab(train)
# 创建数据加载器
train_loader, test_loader = BucketIterator.splits((train, test), batch_size=32, sort_key=lambda x: len(x.text), repeat=False)
接下来,我们需要定义GRU网络模型。我们将使用torch.nn库定义一个简单的二层GRU网络。
import torch
import torch.nn as nn
class GRUNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size, num_layers):
super(GRUNet, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.embedding = nn.Embedding(input_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x, lengths):
packed = torch.nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
packed_output, hidden = self.gru(packed)
output, _ = torch.nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
output = self.fc(output[:, -1, :])
return output
在训练模型之前,我们需要定义一些超参数,并初始化模型和损失函数。
vocab_size = len(TEXT.vocab) input_size = vocab_size hidden_size = 128 output_size = 2 num_layers = 2 learning_rate = 0.001 num_epochs = 10 model = GRUNet(input_size, hidden_size, output_size, num_layers) optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) criterion = nn.CrossEntropyLoss()
接下来,我们就可以进行模型的训练和测试了。
# 训练模型
total_step = len(train_loader)
for epoch in range(num_epochs):
for i, batch in enumerate(train_loader):
labels = batch.label
texts = batch.text[0]
lengths = batch.text[1].numpy()
outputs = model(texts, lengths)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % 100 == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
.format(epoch+1, num_epochs, i+1, total_step, loss.item()))
# 测试模型
with torch.no_grad():
correct = 0
total = 0
for batch in test_loader:
labels = batch.label
texts = batch.text[0]
lengths = batch.text[1].numpy()
outputs = model(texts, lengths)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Test Accuracy: {} %'.format(100 * correct / total))
在训练完成后,我们可以使用训练好的模型对新样本进行情感分类:
def predict_sentiment(model, sentence):
model.eval()
tokenized = tokenize(sentence)
indexed = [TEXT.vocab.stoi[t] for t in tokenized]
length = [len(indexed)]
tensor = torch.LongTensor(indexed).unsqueeze(0).to(device)
length_tensor = torch.LongTensor(length).to(device)
prediction = model(tensor, length_tensor)
probabilities = nn.functional.softmax(prediction, dim=1)
probability, predicted = torch.max(probabilities.squeeze(), 0)
return probabilities, predicted.item()
通过以上步骤,我们就实现了一个简单的GRU网络来进行中文情感分析。可以使用THUCNews数据集来训练和测试模型,然后使用模型对新样本进行情感分类。
