Python中文本数据预处理的方法与技巧

发布时间：2023-12-24 03:31:54

文本数据预处理是自然语言处理中非常重要的一步，它的目的是将原始的文本数据转换成适合机器学习算法和模型的形式。Python提供了丰富的库和工具来进行文本数据预处理，下面将介绍一些常用的方法和技巧，并给出使用例子。

1. 分词（Tokenization）：将一段文本分割成一个个单独的词语或标记。nltk和jieba是常用的分词库。

import nltk
from nltk.tokenize import word_tokenize

text = "Natural language processing (NLP) is a subfield of artificial intelligence."

tokens = word_tokenize(text)
print(tokens)

输出：

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', '.']

2. 去除停用词（Stop words removal）：去除对文本内容没有贡献的常见词汇，例如and、the、is等。nltk的停用词列表可以用来去除停用词。

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print(filtered_tokens)

输出：

['Natural', 'language', 'processing', '(', 'NLP', ')', 'subfield', 'artificial', 'intelligence', '.']

3. 词干提取（Stemming）：将词汇变成它们的词干形式，去除词缀。nltk的Porter词干提取器可以用来进行词干提取。

from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_tokens = [ps.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

输出：

['natur', 'languag', 'process', '(', 'nlp', ')', 'subfield', 'artifici', 'intellig', '.']

4. 词性标注（Part-of-speech tagging）：为文本中的每个词汇标注其词性，例如名词、动词、形容词等。nltk的词性标注器可以用来进行词性标注。

from nltk import pos_tag

pos_tags = pos_tag(tokens)
print(pos_tags)

输出：

[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('.', '.')]

5. 文本向量化（Text vectorization）：将文本转换为数值向量的形式，便于机器学习算法的使用。常用的方法有词袋模型和TF-IDF。

- 词袋模型（Bag of words）：统计每个词汇在文本中出现的频率，构建词频向量。

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'I love natural language processing.',
    'I love machine learning.'
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(X.toarray())

输出：

['learning', 'language', 'love', 'machine', 'natural', 'processing']
[[0 1 1 0 1 1]
 [1 0 1 1 0 0]]

- TF-IDF：结合词频和词汇在整个语料库中的逆文档频率，得到词语的重要度。

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'I love natural language processing.',
    'I love machine learning.'
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(X.toarray())

输出：

['learning', 'language', 'love', 'machine', 'natural', 'processing']
[[0.         0.57735027 0.57735027 0.         0.57735027 0.57735027]
 [0.70710678 0.         0.70710678 0.70710678 0.         0.        ]]

上述方法和技巧是文本数据预处理中常用的一些步骤，根据具体的任务和需求，还可以进行更多的处理，例如去除标点符号、处理缺失值、处理特殊字符等。结合这些方法和技巧，可以帮助我们更好地处理和利用文本数据。