scikit-learn 笔记 - 非监督学习

聚类 (Clustering)

聚类是一类将数据集划分为不同簇的任务。通常同簇的样本相对不同簇间的样本相似度更高。

K-均值 (K-Means)

K-Means 聚类算法的步骤：

初始化聚类数及各聚类centroid
分配：根据离centroid的距离将样本归属到某一聚类
优化：根据聚类所属样本重新计算centroid
如此重复。

from sklearn.cluster import KMeans

# Compute k-means clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

# Predict the closest cluster each sample in X belongs to.
kmeans.predict(X)

K-Means 聚类算法的局限：

聚类的结果非常依赖初始化的情况，很有可能得到局部最优结果。

因此我们通常需要多次运行K-均值算法（对应sklearn中的n_init参数），每一次都重新进行随机初始化，最后选择最优（代价最低）的结果。

降维 (Dimensionality Reduction)

数据降维是在尽量保留有效信息的前提下压缩数据(减少特征个数)。

主成分分析(PCA)

PCA 可以通过在特征空间转换坐标系 (平移和旋转) 来降低数据维度。

PCA 的目的是找到一个向量表示的直线(或一组向量表示的平面或空间)，即主成分。当我们把所有样本都投影到主成分上时，我们希望投影误差能尽可能地小(最大程度保留来自原始数据的信息)。

PCA 实际上将输入的原始特征自动组合成了数量更少的一些新的特征（主成分），同时这些新的特征尽量多的保留了原始特征的信息。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# training features
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

# PCA
pca = PCA(n_components=2)
pca.fit(X)
transformed_X = pca.transform(X)
print transformed_X

# Percentage of variance explained by each of the selected components.
print(pca.explained_variance_ratio_)  

# Principal axes in feature space, representing the directions of maximum variance in the data. 
print pca.components_

# same as pca.transform(X)
print np.dot(X, pca.components_)

# plot components
for origin, transformed in zip(X, transformed_X):
    # first component
    prj0 = [0, 0]
    prj0[0] = pca.components_[0][0] * transformed[0]
    prj0[1] = pca.components_[0][1] * transformed[0]
    plt.scatter(prj0[0], prj0[1], color = 'r')
    
    # second component
    prj1 = [0, 0]
    prj1[0] = pca.components_[1][0] * transformed[1]
    prj1[1] = pca.components_[1][1] * transformed[1]
    plt.scatter(prj1[0], prj1[1], color = 'c')
    
    plt.scatter(origin[0], origin[1], color = 'b')
    plt.scatter(transformed[0], transformed[1], color = 'g')
plt.show()

下面是使用 PCA 来获得脸部特征(eigenfaces, 特征脸) 从而结合 SVM 进行人脸识别的案例：
Faces recognition example using PCA and SVM

异常值(Outlier)处理

异常值产生可能的原因：

数据录入错误
传感器故障
反常事件

前两者导致的异常值是需要忽略或舍弃的，而最后一个原因导致的异常值，在某些情况下是需要特别关注的，如欺诈检测。

异常值检测与删除：

通过原始训练集训练模型
删除训练集中误差很高的部分样本
使用清理后的训练集进行训练
可重复进行异常值删除和再次训练

#!/usr/bin/python

def outlierCleaner(pred, features, actual):
    """
        Clean away the 10% of points that have the largest
        residual errors (difference between the prediction
        and the actual).

        Return a list of tuples where 
        each tuple is of the form (age, actual, error).
    """
    
    m = len(features) / 10
    data = [(features[i], actual[i], (actual[i]-pred[i])**2) for i in range(0, len(features))]
    sorted_data = sorted(data, key=lambda item: item[2], reverse=True)
    cleaned_data = sorted_data[m:]
    
    return cleaned_data

if len(cleaned_data) > 0:
    # refit your cleaned data!
    features, actual, errors = zip(*cleaned_data)
    features = numpy.reshape(numpy.array(features), (len(features), 1))
    actual = numpy.reshape(numpy.array(actual), (len(actual), 1))
    reg.fit(features, actual)