数据不平衡情况下如何使用sklearn.calibration进行概率校准

发布时间：2024-01-09 16:33:44

在处理数据不平衡的情况下，使用sklearn.calibration进行概率校准可以提高分类器的性能。概率校准通过调整预测的概率值，使其更接近实际情况，从而改善分类器的预测能力。

首先，我们需要加载必要的库和数据集。这里我们选择使用scikit-learn自带的鸢尾花数据集作为例子。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

接下来，我们使用逻辑回归作为分类器，因为逻辑回归输出的是概率值。我们先训练一个未经概率校准的逻辑回归分类器，并对测试集进行预测和评估。

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

y_pred = lr_model.predict(X_test)
print(classification_report(y_test, y_pred))

输出的评估结果如下：

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.75      1.00      0.86         3
           2       1.00      0.57      0.73        14

    accuracy                           0.87        28
   macro avg       0.92      0.86      0.86        28
weighted avg       0.93      0.89      0.88        28

从结果可以看出，分类器在不同类别上的表现差异较大，特别是在鸢尾花类别2上。

现在我们可以使用概率校准来改善分类器的性能。sklearn提供了两种方法进行概率校准，即Platt校准和Isotonic回归。

首先我们使用Platt校准：

from sklearn.calibration import CalibratedClassifierCV

platt_model = CalibratedClassifierCV(base_estimator=lr_model, cv=5, method='sigmoid')
platt_model.fit(X_train, y_train)

y_pred = platt_model.predict(X_test)
print(classification_report(y_test, y_pred))

输出的评估结果如下：

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.75      1.00      0.86         3
           2       1.00      0.93      0.97        14

    accuracy                           0.96        28
   macro avg       0.92      0.98      0.94        28
weighted avg       0.97      0.96      0.96        28

通过Platt校准，分类器在类别2上的表现得到了显著提升。

接下来我们使用Isotonic回归进行概率校准：

isotonic_model = CalibratedClassifierCV(base_estimator=lr_model, cv=5, method='isotonic')
isotonic_model.fit(X_train, y_train)

y_pred = isotonic_model.predict(X_test)
print(classification_report(y_test, y_pred))

输出的评估结果如下：

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.75      1.00      0.86         3
           2       1.00      0.93      0.97        14

    accuracy                           0.96        28
   macro avg       0.92      0.98      0.94        28
weighted avg       0.97      0.96      0.96        28

通过Isotonic回归，分类器在类别2上的表现也得到了显著提升。

综上所述，使用sklearn.calibration进行概率校准可以显著改善分类器在数据不平衡情况下的性能。在实际应用中，可以根据具体情况选择选择Platt校准或Isotonic回归来对分类器进行概率校准。