利用PyTorch_Pretrained_BERT.Modeling模块进行文本分类

发布时间：2024-01-15 09:09:34

PyTorch_Pretrained_BERT是一个用于自然语言处理任务的开源工具包。其中的Modeling模块提供了用于文本分类的功能，可以将BERT模型应用于不同的任务，如情感分析、文本分类等。下面是一个使用例子，以情感分析为例：

首先，需要安装PyTorch_Pretrained_BERT库。可以通过以下命令安装：

pip install pytorch_pretrained_bert

接下来，导入所需的库和模块：

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel
from torch import nn

然后，加载预训练的BERT模型和词汇表：

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

接着，定义一个文本分类器模型：

class Classifier(nn.Module):
    def __init__(self, hidden_size, num_labels):
        super(Classifier, self).__init__()
        self.bert = model
        self.dropout = nn.Dropout(0.1)
        self.linear = nn.Linear(hidden_size, num_labels)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        _, pooled_output = self.bert(input_ids, attention_mask, token_type_ids)
        pooled_output = self.dropout(pooled_output)
        logits = self.linear(pooled_output)
        probabilities = self.softmax(logits)
        return probabilities

在这个例子中，我们使用BERT的句子级别的输出pooled_output作为输入，通过一个线性层和softmax层进行分类。

接下来，准备数据集。示例数据集可以放在一个包含两个文件的文件夹中，其中一个文件包含用于训练模型的文本和标签，另一个文件包含用于测试模型的文本和标签。每个文本文件的每一行都包含一个文本样本，用制表符分隔的文本和标签。例如：

I love this movie.    positive
This book is boring.    negative

加载数据集并将文本转换为BERT的输入格式：

def load_data(file_path):
    texts, labels = [], []
    with open(file_path, 'r') as f:
        for line in f:
            text, label = line.strip().split('\t')
            texts.append(text)
            labels.append(label)
    return texts, labels

def preprocess_data(texts, labels):
    input_ids = []
    attention_masks = []
    token_type_ids = []
    for text in texts:
        encoded = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=128,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_token_type_ids=True
        )
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
        token_type_ids.append(encoded['token_type_ids'])
    return input_ids, attention_masks, token_type_ids

train_texts, train_labels = load_data("train.txt")
test_texts, test_labels = load_data("test.txt")

train_input_ids, train_attention_masks, train_token_type_ids = preprocess_data(train_texts, train_labels)
test_input_ids, test_attention_masks, test_token_type_ids = preprocess_data(test_texts, test_labels)

然后，将数据集转换为PyTorch tensor对象：

train_input_ids = torch.tensor(train_input_ids)
train_attention_masks = torch.tensor(train_attention_masks)
train_token_type_ids = torch.tensor(train_token_type_ids)
train_labels = torch.tensor(train_labels)

test_input_ids = torch.tensor(test_input_ids)
test_attention_masks = torch.tensor(test_attention_masks)
test_token_type_ids = torch.tensor(test_token_type_ids)
test_labels = torch.tensor(test_labels)

train_data = torch.utils.data.TensorDataset(train_input_ids, train_attention_masks, train_token_type_ids, train_labels)
train_dataloader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)

test_data = torch.utils.data.TensorDataset(test_input_ids, test_attention_masks, test_token_type_ids, test_labels)
test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle=False)

接着，定义训练和评估函数：

def train_model(model, train_dataloader, optimizer, criterion):
    model.train()
    total_loss = 0.0
    for batch in train_dataloader:
        input_ids, attention_masks, token_type_ids, labels = batch
        optimizer.zero_grad()
        probabilities = model(input_ids, attention_masks, token_type_ids)
        loss = criterion(probabilities, labels)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    return total_loss / len(train_dataloader)

def evaluate_model(model, test_dataloader):
    model.eval()
    total_correct = 0
    total_loss = 0.0
    with torch.no_grad():
        for batch in test_dataloader:
            input_ids, attention_masks, token_type_ids, labels = batch
            probabilities = model(input_ids, attention_masks, token_type_ids)
            _, predicted_labels = torch.max(probabilities, 1)
            total_correct += (predicted_labels == labels).sum().item()
            loss = criterion(probabilities, labels)
            total_loss += loss.item()
    accuracy = total_correct / len(test_data)
    return accuracy, total_loss / len(test_dataloader)

定义模型参数和优化器：

hidden_size = 768
num_labels = 2
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

训练和评估模型：

epochs = 10
for epoch in range(epochs):
    train_loss = train_model(model, train_dataloader, optimizer, criterion)
    accuracy, test_loss = evaluate_model(model, test_dataloader)
    print("Epoch {}/{}: Train Loss: {:.4f}, Test Loss: {:.4f}, Accuracy: {:.2f}%".format(
        epoch+1, epochs, train_loss, test_loss, accuracy*100))

在这个例子中，模型将会训练10个epochs，每个epoch会计算训练集和测试集的损失和准确率。

最后，可以使用该模型进行预测：

def predict_text(text):
    input_ids, attention_masks, token_type_ids = preprocess_data([text], ['label'])
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)
    token_type_ids = torch.tensor(token_type_ids)
    with torch.no_grad():
        probabilities = model(input_ids, attention_masks, token_type_ids)
    _, predicted_label = torch.max(probabilities, 1)
    return predicted_label.item()

text = "I like this movie."
predicted_label = predict_text(text)
print("Predicted label for '{}' is: {}".format(text, predicted_label))

这个例子展示了如何使用PyTorch_Pretrained_BERT的Modeling模块进行文本分类。你可以根据需要进行修改和扩展。