PyTorch预训练的BertModel()模型在中文情感分析中的性能评估

发布时间：2023-12-16 11:36:05

在中文情感分析任务中，可以使用预训练的BertModel()模型来提取文本特征，并使用这些特征进行情感分类。下面给出一个使用例子，用于评估PyTorch预训练的BertModel()模型在中文情感分析中的性能。

首先，我们需要准备数据集。假设我们有一个包含中文评论和标签的数据集，我们可以使用Pandas库来读取和处理数据。假设数据集的格式如下：

| 评论 | 标签 |

| ----------- | ------ |

| “这部电影真的太好看了！” | 积极 |

| “但是我觉得结局有点令人失望。” | 消极 |

| ... | ... |

根据数据集的格式，我们可以按照如下方式进行数据加载和预处理：

import pandas as pd
from sklearn.model_selection import train_test_split

# 读取数据集
data = pd.read_csv('data.csv', encoding='utf-8')
# 划分训练集和测试集
train_data, test_data, train_labels, test_labels = train_test_split(data['评论'], data['标签'], test_size=0.2, random_state=42)

接下来，我们需要使用一个tokenizer来对文本进行分词和编码，以便输入到BertModel()中。可以使用Hugging Face的transformers库中提供的BertTokenizer来实现：

from transformers import BertTokenizer

# 加载中文BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

然后，我们需要把文本转换成Bert模型的输入格式。Bert模型的输入由三个部分组成：input ids、attention masks和token type ids。我们可以定义一个函数来将数据转换为Bert模型的输入格式：

import torch

def preprocess(texts, labels, tokenizer, max_len):
    input_ids = []
    attention_masks = []
    token_type_ids = []
    encoded_labels = []

    for i in range(len(texts)):
        # 将文本分词并编码
        encoded_text = tokenizer.encode_plus(
            texts[i],
            truncation=True,
            max_length=max_len,
            padding='max_length',
            return_tensors='pt'
        )

        # 提取编码后的文本和标签
        input_ids.append(encoded_text['input_ids'])
        attention_masks.append(encoded_text['attention_mask'])
        token_type_ids.append(encoded_text['token_type_ids'])
        encoded_labels.append(torch.tensor(labels[i]))

    # 将列表转换为PyTorch的tensor类型
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    token_type_ids = torch.cat(token_type_ids, dim=0)
    encoded_labels = torch.stack(encoded_labels)

    return input_ids, attention_masks, token_type_ids, encoded_labels

# 设置最大序列长度
max_len = 128
# 数据预处理
train_input_ids, train_attention_masks, train_token_type_ids, train_labels = preprocess(train_data, train_labels, tokenizer, max_len)
test_input_ids, test_attention_masks, test_token_type_ids, test_labels = preprocess(test_data, test_labels, tokenizer, max_len)

接下来，我们可以加载预训练的BertModel()模型，并在其之上构建一个分类器，用于对情感进行分类。可以使用transformers库中提供的BertForSequenceClassification模型来实现：

from transformers import BertForSequenceClassification

# 加载预训练的BertForSequenceClassification模型
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=2)

在载入预训练模型后，我们将其搭配一个optimizer和loss function，并迭代训练模型：

from torch.utils.data import TensorDataset, DataLoader
from transformers import AdamW

# 创建数据载入器
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_token_type_ids, train_labels)
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# 定义优化器和损失函数
optimizer = AdamW(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# 训练模型
epochs = 10
for epoch in range(epochs):
    model.train()

    for batch in train_dataloader:
        batch_input_ids = batch[0].to(device)
        batch_attention_masks = batch[1].to(device)
        batch_token_type_ids = batch[2].to(device)
        batch_labels = batch[3].to(device)

        optimizer.zero_grad()
        outputs = model(
            input_ids=batch_input_ids,
            attention_mask=batch_attention_masks,
            token_type_ids=batch_token_type_ids,
            labels=batch_labels
        )
        loss = outputs.loss
        loss.backward()
        optimizer.step()

最后，我们可以在测试集上评估模型的性能。可以计算准确率、精确率、召回率和F1-score等指标。

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 创建测试集数据载入器
test_dataset = TensorDataset(test_input_ids, test_attention_masks, test_token_type_ids, test_labels)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# 测试模型
model.eval()
predictions = []
true_labels = []

for batch in test_dataloader:
    batch_input_ids = batch[0].to(device)
    batch_attention_masks = batch[1].to(device)
    batch_token_type_ids = batch[2].to(device)
    batch_labels = batch[3]

    with torch.no_grad():
        outputs = model(
            input_ids=batch_input_ids,
            attention_mask=batch_attention_masks,
            token_type_ids=batch_token_type_ids
        )

    logits = outputs.logits
    batch_predictions = torch.argmax(logits, dim=1)
    predictions.extend(batch_predictions.cpu().numpy().tolist())
    true_labels.extend(batch_labels.numpy().tolist())

# 计算指标
accuracy = accuracy_score(true_labels, predictions)
precision = precision_score(true_labels, predictions, average='weighted')
recall = recall_score(true_labels, predictions, average='weighted')
f1 = f1_score(true_labels, predictions, average='weighted')

在这个例子中，我们使用预训练的BertModel()模型在中文情感分析任务上进行了评估。可以根据需求调整模型参数、训练次数、优化算法等，以达到更好的性能。