使用Python编写GRU模型预测中文句子情感极性
发布时间:2023-12-12 07:50:28
以下是一个使用Python编写的GRU模型预测中文句子情感极性的示例代码:
import numpy as np
import pandas as pd
import jieba
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, GRU
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 加载数据集
data = pd.read_csv('sentiment_dataset.csv')
# 对句子进行分词
def tokenize(sentence):
return ' '.join(jieba.cut(sentence))
data['tokenized'] = data['sentence'].apply(tokenize)
# 构建词典
word2idx = {'<PAD>': 0}
idx = 1
for tokenized_sentence in data['tokenized']:
for word in tokenized_sentence.split():
if word not in word2idx:
word2idx[word] = idx
idx += 1
vocab_size = len(word2idx)
# 将句子转换为词索引序列
def to_sequence(sentence):
tokens = sentence.split()
seq = []
for token in tokens:
seq.append(word2idx[token])
return seq
data['sequence'] = data['tokenized'].apply(to_sequence)
# 对序列进行填充
max_seq_len = max(data['sequence'].apply(len))
data['padded_sequence'] = pad_sequences(data['sequence'], maxlen=max_seq_len, padding='post').tolist()
# 构建训练集和测试集
X = np.array(data['padded_sequence'].to_list())
y = np.array(data['label'].to_list())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 构建模型
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_seq_len))
model.add(GRU(units=32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# 训练模型
model.fit(X_train, y_train, batch_size=32, epochs=10)
# 在测试集上评估模型性能
y_pred = model.predict_classes(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
以上代码实现了一个基于GRU的情感极性预测模型。首先,我们加载了一个包含标记化的中文句子和相应情感极性标签的数据集。然后,我们使用jieba库对句子进行分词,并构建词典和词索引序列。接下来,我们使用pad_sequences函数将序列填充到相同长度,并构建了训练集和测试集。然后,我们使用Keras库构建了一个包含嵌入层、GRU层和输出层的模型,并使用Adam优化器和二元交叉熵损失函数进行编译。最后,我们训练模型并在测试集上评估模型的性能。
