Python函数实现机器学习算法K-means

发布时间：2023-06-05 21:47:37

K-means是一种常见的聚类算法，其主要思想是将样本分成k个簇，使得簇内的样本相似度高，簇间的样本相似度低。本文将通过Python代码实现K-means算法。

K-means算法步骤：

1. 首先确定k个初始质心，可以随机生成或通过其他方式选取。

2. 然后对样本进行分组，将每个样本归为与其最近的质心所在的簇，同时记录各个簇的样本数量和坐标和。

3. 接着计算每个簇的中心坐标，即将该簇内所有样本的坐标取平均值作为质心坐标。

4. 重复以上两个步骤，直到质心位置不再变化或达到预设迭代次数。

代码实现：

先定义一个KMeans类，包括初始化函数和聚类函数：

import random

class KMeans:
    def __init__(self, k=2, max_iter=300):
        self.k = k
        self.max_iter = max_iter

    def fit(self, data):
        centroids = random.sample(data, self.k)
        for i in range(self.max_iter):
            clusters = self.cluster_points(data, centroids)
            new_centroids = self.move_centroids(clusters)
            if centroids == new_centroids:
                break
            centroids = new_centroids
        return clusters, centroids

    def cluster_points(self, data, centroids):
        clusters = {}
        for i in range(self.k):
            clusters[i] = []
        for point in data:
            distances = [self.euclidean(point, centroid) for centroid in centroids]
            index = distances.index(min(distances))
            clusters[index].append(point)
        return clusters

    def move_centroids(self, clusters):
        new_centroids = []
        for key in clusters.keys():
            new_centroid = self.mean(clusters[key])
            new_centroids.append(new_centroid)
        return new_centroids

    def euclidean(self, point, centroid):
        return sum([pow(point[i] - centroid[i], 2) for i in range(len(point))])

    def mean(self, points):
        result = []
        for i in range(len(points[0])):
            result.append(sum([point[i] for point in points]) / len(points))
        return result

其中，fit函数依次调用cluster_points和move_centroids函数，直到满足停止迭代的条件。

cluster_points函数将数据集中的每个点分配到最近的中心点所在的簇中，最后返回一个字典类型的clusters，保存每个簇包含的样本。

move_centroids函数计算新的中心点的位置。

在实际使用时，可以通过以下代码对KMeans类进行初始化并调用fit函数进行训练：

data = [[1, 1], [1, 2], [2, 2], [10, 10], [10, 11], [11, 11]]
kmeans = KMeans(k=2, max_iter=300)
clusters, centroids = kmeans.fit(data)
print(clusters, centroids)

运行结果如下：

{0: [[1, 1], [1, 2], [2, 2]], 1: [[10, 10], [10, 11], [11, 11]]} [[1.3333333333333333, 1.6666666666666667], [10.333333333333334, 10.666666666666666]]

在上述代码中，我们使用了一个简单的数据集来测试我们的KMeans类。在这个例子中，我们将数据分成了两个簇，分别包含三个点和三个点。输出结果显示，两个簇以及它们的中心点都被正确确定。

总结：

本文详细介绍了如何使用Python实现K-means算法，代码实现简单易懂，可为初学者提供一定的实践参考。需要注意的是，在实际使用时，需要根据具体的数据集选择合适的k值和迭代次数。