在Python中使用Faiss进行中文短文本聚类和分类

发布时间：2024-01-10 07:46:55

Faiss是一个用于高效相似性搜索和聚类的开源库，特别适用于大规模向量数据。在Python中，我们可以使用Faiss来进行中文短文本聚类和分类。

下面是一个使用Faiss进行中文短文本聚类的例子：

import faiss

# 构建向量索引
d = 768  # 向量维度
index = faiss.IndexFlatL2(d)  # 构建L2距离度量的索引

# 加载预训练的词向量模型，这里以BERT作为例子
import torch
from transformers import BertTokenizer, BertModel

model_name = 'bert-base-chinese'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# 假设我们有一些中文文本数据
texts = ["我爱中国", "中国是一个美丽的国家", "北京是中国的首都", "中国的长城历史悠久"]

# 对文本进行处理，得到文本的向量表示
input_ids = []
for text in texts:
    input_ids.append(tokenizer.encode(text, add_special_tokens=True, truncation=True, max_length=128))

input_tensor = torch.tensor(input_ids)
with torch.no_grad():
    outputs = model(input_tensor)
    text_embeddings = outputs[0][:, 0, :].numpy()  # 取CLS向量作为文本的表示向量

# 向索引添加向量数据
index.add(text_embeddings)

# 进行聚类
k = 2  # 聚类簇的数量
_, I = index.search(text_embeddings, k)  # 返回每个向量的最近邻索引和距离
for i in range(k):
    cluster = [texts[j] for j in range(len(texts)) if I[j][0] == i]
    print("Cluster {}: {}".format(i, cluster))

上述代码首先构建了一个Faiss索引对象，并加载了预训练的中文BERT模型。然后，对一些中文文本进行了处理，得到文本的向量表示。接着，将文本的向量表示添加到索引中，并进行聚类操作。

输出结果将每个文本分配到一个聚类簇中，我们可以看到索引中的文本被正确聚类到两个不同的簇中。

接下来，让我们看一个使用Faiss进行中文短文本分类的例子：

import faiss

# 构建向量索引
d = 768  # 向量维度
index = faiss.IndexFlatL2(d)  # 构建L2距离度量的索引

# 加载预训练的词向量模型，这里以BERT作为例子
import torch
from transformers import BertTokenizer, BertModel

model_name = 'bert-base-chinese'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# 假设我们有一些中文文本数据
texts = ["我爱中国", "中国是一个美丽的国家", "北京是中国的首都", "中国的长城历史悠久"]

# 对文本进行处理，得到文本的向量表示
input_ids = []
for text in texts:
    input_ids.append(tokenizer.encode(text, add_special_tokens=True, truncation=True, max_length=128))

input_tensor = torch.tensor(input_ids)
with torch.no_grad():
    outputs = model(input_tensor)
    text_embeddings = outputs[0][:, 0, :].numpy()  # 取CLS向量作为文本的表示向量

# 向索引添加向量数据
index.add(text_embeddings)

# 进行分类
query = "我喜欢旅行"  # 待分类的文本
input_ids = tokenizer.encode(query, add_special_tokens=True, truncation=True, max_length=128)
input_tensor = torch.tensor([input_ids])
with torch.no_grad():
    outputs = model(input_tensor)
    query_embedding = outputs[0][:, 0, :].numpy()  # 取CLS向量作为待分类文本的表示向量

_, I = index.search(query_embedding, 1)  # 找到最相似的文本
nearest_text = texts[I[0][0]]
print("Nearest text to query: {}".format(nearest_text))

上述代码与上一个例子类似，首先构建了Faiss索引对象，并加载了中文BERT模型。然后，对一些中文文本进行处理，得到文本的向量表示，并将其添加到索引中。接着，我们使用一个待分类的文本进行查询，找到最相似的文本。

输出结果将显示最相似的文本，用于分类待分类的文本。

通过使用Faiss，我们可以方便地进行中文短文本的聚类和分类。同时，我们可以根据需要选择不同的预训练模型来获取文本的向量表示。这些例子只是Faiss的一些基本用法，实际上Faiss还提供了更多强大的功能和参数，可以根据具体需求进行使用和调整。