利用PyTorch_Pretrained_BERT.Modeling模块进行文本分类
发布时间:2024-01-15 09:09:34
PyTorch_Pretrained_BERT是一个用于自然语言处理任务的开源工具包。其中的Modeling模块提供了用于文本分类的功能,可以将BERT模型应用于不同的任务,如情感分析、文本分类等。下面是一个使用例子,以情感分析为例:
首先,需要安装PyTorch_Pretrained_BERT库。可以通过以下命令安装:
pip install pytorch_pretrained_bert
接下来,导入所需的库和模块:
import torch from pytorch_pretrained_bert import BertTokenizer, BertModel from torch import nn
然后,加载预训练的BERT模型和词汇表:
model_name = 'bert-base-uncased' tokenizer = BertTokenizer.from_pretrained(model_name) model = BertModel.from_pretrained(model_name)
接着,定义一个文本分类器模型:
class Classifier(nn.Module):
def __init__(self, hidden_size, num_labels):
super(Classifier, self).__init__()
self.bert = model
self.dropout = nn.Dropout(0.1)
self.linear = nn.Linear(hidden_size, num_labels)
self.softmax = nn.Softmax(dim=1)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
_, pooled_output = self.bert(input_ids, attention_mask, token_type_ids)
pooled_output = self.dropout(pooled_output)
logits = self.linear(pooled_output)
probabilities = self.softmax(logits)
return probabilities
在这个例子中,我们使用BERT的句子级别的输出pooled_output作为输入,通过一个线性层和softmax层进行分类。
接下来,准备数据集。示例数据集可以放在一个包含两个文件的文件夹中,其中一个文件包含用于训练模型的文本和标签,另一个文件包含用于测试模型的文本和标签。每个文本文件的每一行都包含一个文本样本,用制表符分隔的文本和标签。例如:
I love this movie. positive This book is boring. negative
加载数据集并将文本转换为BERT的输入格式:
def load_data(file_path):
texts, labels = [], []
with open(file_path, 'r') as f:
for line in f:
text, label = line.strip().split('\t')
texts.append(text)
labels.append(label)
return texts, labels
def preprocess_data(texts, labels):
input_ids = []
attention_masks = []
token_type_ids = []
for text in texts:
encoded = tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=128,
pad_to_max_length=True,
return_attention_mask=True,
return_token_type_ids=True
)
input_ids.append(encoded['input_ids'])
attention_masks.append(encoded['attention_mask'])
token_type_ids.append(encoded['token_type_ids'])
return input_ids, attention_masks, token_type_ids
train_texts, train_labels = load_data("train.txt")
test_texts, test_labels = load_data("test.txt")
train_input_ids, train_attention_masks, train_token_type_ids = preprocess_data(train_texts, train_labels)
test_input_ids, test_attention_masks, test_token_type_ids = preprocess_data(test_texts, test_labels)
然后,将数据集转换为PyTorch tensor对象:
train_input_ids = torch.tensor(train_input_ids) train_attention_masks = torch.tensor(train_attention_masks) train_token_type_ids = torch.tensor(train_token_type_ids) train_labels = torch.tensor(train_labels) test_input_ids = torch.tensor(test_input_ids) test_attention_masks = torch.tensor(test_attention_masks) test_token_type_ids = torch.tensor(test_token_type_ids) test_labels = torch.tensor(test_labels) train_data = torch.utils.data.TensorDataset(train_input_ids, train_attention_masks, train_token_type_ids, train_labels) train_dataloader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True) test_data = torch.utils.data.TensorDataset(test_input_ids, test_attention_masks, test_token_type_ids, test_labels) test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle=False)
接着,定义训练和评估函数:
def train_model(model, train_dataloader, optimizer, criterion):
model.train()
total_loss = 0.0
for batch in train_dataloader:
input_ids, attention_masks, token_type_ids, labels = batch
optimizer.zero_grad()
probabilities = model(input_ids, attention_masks, token_type_ids)
loss = criterion(probabilities, labels)
total_loss += loss.item()
loss.backward()
optimizer.step()
return total_loss / len(train_dataloader)
def evaluate_model(model, test_dataloader):
model.eval()
total_correct = 0
total_loss = 0.0
with torch.no_grad():
for batch in test_dataloader:
input_ids, attention_masks, token_type_ids, labels = batch
probabilities = model(input_ids, attention_masks, token_type_ids)
_, predicted_labels = torch.max(probabilities, 1)
total_correct += (predicted_labels == labels).sum().item()
loss = criterion(probabilities, labels)
total_loss += loss.item()
accuracy = total_correct / len(test_data)
return accuracy, total_loss / len(test_dataloader)
定义模型参数和优化器:
hidden_size = 768 num_labels = 2 criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
训练和评估模型:
epochs = 10
for epoch in range(epochs):
train_loss = train_model(model, train_dataloader, optimizer, criterion)
accuracy, test_loss = evaluate_model(model, test_dataloader)
print("Epoch {}/{}: Train Loss: {:.4f}, Test Loss: {:.4f}, Accuracy: {:.2f}%".format(
epoch+1, epochs, train_loss, test_loss, accuracy*100))
在这个例子中,模型将会训练10个epochs,每个epoch会计算训练集和测试集的损失和准确率。
最后,可以使用该模型进行预测:
def predict_text(text):
input_ids, attention_masks, token_type_ids = preprocess_data([text], ['label'])
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
token_type_ids = torch.tensor(token_type_ids)
with torch.no_grad():
probabilities = model(input_ids, attention_masks, token_type_ids)
_, predicted_label = torch.max(probabilities, 1)
return predicted_label.item()
text = "I like this movie."
predicted_label = predict_text(text)
print("Predicted label for '{}' is: {}".format(text, predicted_label))
这个例子展示了如何使用PyTorch_Pretrained_BERT的Modeling模块进行文本分类。你可以根据需要进行修改和扩展。
