使用Python实现垃圾邮件分类器的方法

发布时间：2024-01-07 10:42:04

垃圾邮件分类器是一种用于自动判断某封电子邮件是否为垃圾邮件的算法。它可以帮助我们过滤掉大量的垃圾邮件，只保留与我们感兴趣的邮件。

在实现垃圾邮件分类器之前，我们需要准备一个合适的数据集，包含一些已标记为垃圾或非垃圾的邮件样本。可以使用已知的垃圾邮件和非垃圾邮件，或者像SpamAssassin公开的垃圾邮件数据集等。接下来，我们将使用Python来实现一个简单的垃圾邮件分类器，并利用朴素贝叶斯算法进行分类。

import os
import re
import math
from collections import defaultdict

class SpamClassifier:
    def __init__(self):
        self.spam_word_counts = defaultdict(int)
        self.ham_word_counts = defaultdict(int)
        self.total_spam = 0
        self.total_ham = 0
    
    def train(self, directory):
        for filename in os.listdir(directory):
            filepath = os.path.join(directory, filename)
            if filename.startswith("spam"):
                self.total_spam += 1
                self._count_words(filepath, self.spam_word_counts)
            else:
                self.total_ham += 1
                self._count_words(filepath, self.ham_word_counts)
    
    def _count_words(self, filepath, word_counts):
        with open(filepath, 'r', errors='ignore') as file:
            for line in file:
                words = re.findall(r'\w+', line.lower())
                for word in words:
                    word_counts[word] += 1
    
    def classify(self, email):
        spam_score = self._calculate_spam_score(email, self.spam_word_counts, self.total_spam)
        ham_score = self._calculate_spam_score(email, self.ham_word_counts, self.total_ham)
        
        if spam_score >= ham_score:
            return "spam"
        else:
            return "ham"
    
    def _calculate_spam_score(self, email, word_counts, total_emails):
        spam_score = 0
        words = re.findall(r'\w+', email.lower())
        for word in words:
            spam_score += math.log( (word_counts[word] + 1) / (sum(word_counts.values()) + total_emails) )
        
        return spam_score

# 使用例子
classifier = SpamClassifier()
classifier.train("./data")  # 假设垃圾邮件和非垃圾邮件存放在"./data"目录下
email = """
    Dear friend,

    You have won a lottery! Claim your prize now!

    Regards,
    John
"""

classification = classifier.classify(email)
print("Classification:", classification)

在上面的代码中，我们首先定义了一个SpamClassifier类，它包含了训练和分类方法。train方法用于训练分类器，它会遍历指定目录下的邮件文件，并统计垃圾邮件和非垃圾邮件中每个单词的出现次数。classify方法用于对指定的邮件进行分类，它会计算邮件中每个单词的垃圾邮件概率得分，并返回分类结果。

使用例子中，我们首先创建了一个SpamClassifier实例，并使用train方法训练分类器，传入垃圾邮件和非垃圾邮件的文件夹路径。然后，我们定义了一个邮件字符串，并使用classify方法对邮件进行分类。最后，我们将分类结果打印出来。

需要注意的是，上述代码只是一个简单的示例，实际上垃圾邮件分类器的实现可能要复杂得多。可以根据具体需求，使用更高级的算法或技术来提高分类器的准确性和性能。