Python实现中文NER任务的BERT模型
发布时间:2023-12-27 12:21:39
中文NER(Named Entity Recognition)任务是指对中文文本进行命名实体识别,即识别出文本中的人名、地名、组织机构名等特定名称。BERT(Bidirectional Encoder Representations from Transformers)是一种预训练的语言模型,可以用于多种自然语言处理任务。下面是使用Python实现中文NER任务的BERT模型的示例代码。
首先,我们需要安装需要的库,包括transformers、torch和seqeval等:
!pip install transformers !pip install torch !pip install seqeval
接下来,我们需要下载预训练好的BERT模型参数,例如"bert-base-chinese",并导入相关的库:
from transformers import BertTokenizer, BertForTokenClassification import torch from seqeval.metrics import f1_score
然后,我们可以加载预训练的BERT模型和相应的tokenizer:
model_name = 'bert-base-chinese' model = BertForTokenClassification.from_pretrained(model_name) tokenizer = BertTokenizer.from_pretrained(model_name)
下一步是准备NER任务的训练数据。训练数据应为一个列表,每个元素代表一个样本,包含两个列表,一个是token列表,一个是对应的label列表。例如:
train_data = [
(['我', '爱', '北京', '天安门', '。'], ['O', 'O', 'B-LOC', 'I-LOC', 'O']),
(['今天', '天气', '很', '好', '。'], ['O', 'O', 'O', 'O', 'O']),
...
]
接下来,我们需要对训练数据进行预处理,将每个token转换成对应的id,并且加上特殊的"[CLS]"和"[SEP]"标记符号。
def preprocess_data(data, tokenizer):
input_ids = []
labels = []
for tokens, tags in data:
tokenized_texts = [tokenizer.tokenize(token) for token in tokens]
tokenized_labels = []
for i, tokenized_text in enumerate(tokenized_texts):
tokenized_labels += [tags[i]] * len(tokenized_text)
tokenized_texts = ["[CLS]"] + [token for sublist in tokenized_texts for token in sublist] + ["[SEP]"]
tokenized_labels = ["O"] + tokenized_labels + ["O"]
input_ids.append(tokenizer.convert_tokens_to_ids(tokenized_texts))
labels.append(tokenized_labels)
return input_ids, labels
train_input_ids, train_labels = preprocess_data(train_data, tokenizer)
接下来,我们可以将训练数据转化为PyTorch的tensor格式:
train_tensors = [torch.tensor(ids) for ids in train_input_ids] train_labels_tensors = [torch.tensor(ids) for ids in train_labels]
接下来,我们可以定义训练模型的一些参数,如batch_size、学习率、迭代次数等。然后,我们可以定义模型的优化器和损失函数:
batch_size = 8 num_epochs = 5 learning_rate = 1e-5 optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate) total_steps = len(train_input_ids) * num_epochs // batch_size scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=learning_rate, total_steps=total_steps) loss_fn = torch.nn.CrossEntropyLoss(ignore_index=-100)
最后,我们可以开始训练模型:
model.train()
for epoch in range(num_epochs):
for i in range(0, len(train_tensors), batch_size):
batch_input_ids = train_tensors[i:i + batch_size]
batch_labels = train_labels_tensors[i:i + batch_size]
optimizer.zero_grad()
outputs = model(input_ids=torch.nn.utils.rnn.pad_sequence(batch_input_ids, batch_first=True).cuda(),
labels=torch.nn.utils.rnn.pad_sequence(batch_labels, batch_first=True).cuda())
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
# 计算F1 score
preds = model.predict(inputs=torch.nn.utils.rnn.pad_sequence(train_tensors, batch_first=True).cuda())
pred_labels = [[tokenizer.convert_ids_to_tokens(ids) for ids in p] for p in preds]
true_labels = [[tokenizer.convert_ids_to_tokens(ids) for ids in l] for l in train_labels]
pred_labels = [labels[1:-1] for labels in pred_labels]
true_labels = [labels[1:-1] for labels in true_labels]
f1 = f1_score(true_labels, pred_labels)
print("Epoch: {}/{}".format(epoch+1, num_epochs), "F1 Score:", f1)
以上就是使用Python实现中文NER任务的BERT模型的示例代码。通过逐步加载模型、预处理数据、定义训练参数和训练模型,可以实现对中文文本的命名实体识别。请注意,以上代码只是一个示例,在实际使用中可能需要根据具体的需求进行调整和优化。
