使用Python实现中文命名实体识别的GRU模型
中文命名实体识别(Named Entity Recognition,简称NER)是指从中文文本中识别和分类命名实体的任务。命名实体通常指的是具有特定含义和边界的实体,例如人名、地名、组织机构名称等。NER在信息抽取、问题回答、机器翻译等自然语言处理任务中具有重要作用。本文将使用Python实现一个基于GRU(Gated Recurrent Unit)的中文NER模型,并提供一个使用例子。
1. 数据准备
首先,我们需要准备用于训练和测试的数据。这里以中国人民的公开网页《中国和世界》(https://www.qianmu.org/)中的文章作为例子。
import requests
from bs4 import BeautifulSoup
import re
# 从网页中提取文章内容
def extract_article(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
content = soup.find('div', class_='article_content').find_all('p')
article = ''.join([p.get_text() for p in content])
return article
# 提取实体标注
def extract_entities(article):
entities = re.findall('(《[\u4e00-\u9fa5]+?》)', article)
return entities
# 准备训练数据
def prepare_data():
url = 'https://www.qianmu.org/china-and-world/trump-vs-trudeau/'
article = extract_article(url)
entities = extract_entities(article)
return article, entities
article, entities = prepare_data()
print('Article:', article)
print('Entities:', entities)
运行上述代码后,会输出从网页中提取的文章和其中的实体标注。
2. 特征提取
在进行中文NER任务时,我们可以使用分词、词性标注等技术来提取词级别的特征。这里我们使用jieba库进行分词:
import jieba
# 分词
def word_segmentation(article):
words = jieba.cut(article)
return list(words)
words = word_segmentation(article)
print('Words:', words)
运行上述代码后,会输出分词后的词列表。
3. 数据预处理
将数据转换为模型可接受的格式。首先,我们需要构建一个词汇表,并为每个词构建一个索引:
vocab = set(words)
word2idx = {word: idx+1 for idx, word in enumerate(vocab)}
word2idx['<pad>'] = 0
idx2word = {idx: word for word, idx in word2idx.items()}
print('Vocab:', len(vocab))
然后,我们将每个词转换为索引,并将实体标注转换为标签索引:
# 将词转换为索引
def convert_words_to_idx(words, word2idx):
return [word2idx[word] for word in words]
word_idx = convert_words_to_idx(words, word2idx)
# 将实体标注转换为标签索引
def convert_entities_to_idx(entities, word_idx):
entities_idx = []
for entity in entities:
start_idx = word_idx.index(word2idx[entity[1]])
end_idx = word_idx.index(word2idx[entity[-2]])
entities_idx.append([entity, start_idx, end_idx])
return entities_idx
entities_idx = convert_entities_to_idx(entities, word_idx)
print('Word Index:', word_idx)
print('Entities Index:', entities_idx)
运行上述代码后,会输出每个词的索引和每个实体的索引。
4. 模型搭建
接下来,我们使用GRU模型搭建NER模型:
import torch
import torch.nn as nn
# 模型定义
class GRUNER(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
super(GRUNER, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.gru = nn.GRU(embedding_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
embedded = self.embedding(x.long())
_, hidden = self.gru(embedded)
out = self.fc(hidden[-1])
return out
# 模型初始化
vocab_size = len(vocab)
embedding_dim = 128
hidden_dim = 256
output_dim = 2
model = GRUNER(vocab_size, embedding_dim, hidden_dim, output_dim)
print(model)
运行上述代码后,会输出模型结构。
5. 训练模型
接下来,我们将使用训练数据对模型进行训练,优化器选择Adam,损失函数选择交叉熵:
import torch.optim as optim
import numpy as np
# 数据切分
def split_data(word_idx, entities_idx, split_ratio):
split_point = int(len(word_idx) * split_ratio)
train_words, train_entities = word_idx[:split_point], entities_idx[:split_point]
test_words, test_entities = word_idx[split_point:], entities_idx[split_point:]
return train_words, train_entities, test_words, test_entities
split_ratio = 0.8
train_words, train_entities, test_words, test_entities = split_data(word_idx, entities_idx, split_ratio)
# 模型训练
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
train_X = torch.tensor(train_words, dtype=torch.long)
train_y = torch.tensor(np.array([0 if i <= entity[1] or i >= entity[2] else 1 for i in range(len(train_words))]), dtype=torch.long)
def train_model(model, train_X, train_y, criterion, optimizer, num_epochs=10, batch_size=64):
model.train()
for epoch in range(num_epochs):
running_loss = 0.0
for i in range(0, len(train_X), batch_size):
inputs = train_X[i:i+batch_size]
labels = train_y[i:i+batch_size]
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
epoch_loss = running_loss / (len(train_X) / batch_size)
print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, epoch_loss))
train_model(model, train_X, train_y, criterion, optimizer)
运行上述代码后,会输出每轮训练的损失值。
6. 预测
最后,我们使用测试数据对模型进行预测:
def predict(model, test_words):
model.eval()
test_X = torch.tensor(test_words, dtype=torch.long)
with torch.no_grad():
test_y = model(test_X)
_, predicted = torch.max(test_y.data, 1)
return predicted
test_X = torch.tensor(test_words, dtype=torch.long)
test_y_pred = predict(model, test_X)
print('Predicted:', test_y_pred)
运行上述代码后,会输出模型的预测结果。
通过以上步骤,我们成功实现了一个基于GRU的中文命名实体识别模型,并使用实际数据进行了训练和预测。这个模型可以在NER任务中起到一定作用,但为了使其更加准确和鲁棒,可以尝试使用更多的训练数据、调整模型的结构和超参数等。
