scikit-learn 笔记 - 监督学习

线性回归 (Linear Regression)

线性模型的数学表示：

$\hat{y}(w,x) = w_0 + w_1x_1 + ... + w_px_p$

通常使用最小二乘法来解决线性回归问题，从而得到合适的系数。

$min_w||Xw - y||^2$

在sklearn中，通过线性回归得到的系数有 coef 和 intercept 。coef 表示向量w(w1,…,wp)， intercept 表示截距w0。

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# create linear regression model
reg = LinearRegression()

# fit the model on the training set
clf.fit(X_train, y_train)

# use the trained model to predict result for the test set
pred = clf.predict(X_test)

# The coefficients
print('Coefficients: ', reg.coef_)
print('Intercept: ', reg.intercept_)

# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, pred))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, pred))
# also can use the below code
# reg.score(X_test, y_test)

R^2 Score

线性回归模型可视化

plt.scatter(X, Y)
plt.plot(X, reg.predict(X), color='red', linewidth=2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

朴素贝叶斯 (Naive Bayes)

朴素贝叶斯分类器是一类简单的概率分类器，它基于贝叶斯定理和特征间的强大的（朴素的）独立假设，算法实现简单且可以支持很大的特征空间，同时也具备很快的训练速度。以下是贝叶斯公式。

$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$

P(A|B)是已知B发生后A的条件概率，也由于得自B的取值而被称作A的后验概率。
P(B|A)是已知A发生后B的条件概率，也由于得自A的取值而被称作B的后验概率(似然率)。
P(A)是A的先验概率（或边缘概率）。之所以称为”先验”是因为它不考虑任何B方面的因素。
P(B)是B的先验概率或边缘概率。

from sklearn.naive_bayes import GaussianNB

# create classifier
clf = GaussianNB()

# fit the classifier on the training features and labels
clf.fit(features_train, labels_train)

# use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)

# calculate the accuracy on the test data
accuracy = clf.score(features_test, labels_test)

# also can use the below code to calculate the accuracy
# from sklearn.metrics import accuracy_score
# accuracy = accuracy_score(pred, labels_test)

支持向量机 (SVM)

支持向量机 (SVM) 有时也称为最大化间距分类器 (maximum margin classifier)。其目标是找到一条分隔线，使得这条分隔线离不同类别最近的数据点间距(margin)越远越好。

SVM的超参数 C 用于防止过拟合：

C过大时，规则化项的权重太小，可能会导致过拟合
C过小时，规则化项的权重过大，可能会导致欠拟合

通过使用不同的核函数，可以构建复杂的非线性分类器，可选的核函数很多，比如：

径向基核函数(Radial Basis Function-RBF)
多项式核函数(Polynomial Kernel)
Sigmoid核函数
….

SVM的缺点：

SVM在训练数据集特别大时，训练速度较慢。
存在噪点时SVM可能表现不佳

from sklearn.svm import SVC

# create classifier
clf = SVC(kernel="linear")

# fit the classifier on the training features and labels
clf.fit(features_train, labels_train)

# use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)

# calculate the accuracy on the test data
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)

# also can use the below code to calculate the accuracy
# acc = clf.score(features_test, labels_test)

决策树 (Decision Tree)

决策树通常可用于解决多元线性问题，它将目标函数表示为树型结构，每个结点处理一个属性的值，进行线性分离，每个分支对应该特征的一种情况。

from sklearn.tree import DecisionTreeClassifier

# create classifier
clf = DecisionTreeClassifier()

# fit the classifier on the training features and labels
clf.fit(features_train, labels_train)

# use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)

# calculate the accuracy on the test data
acc = clf.score(features_test, labels_test)

当出现过拟合或欠拟合时，需要尝试不同的超参数，如最小分隔样本（结点所属样本数到达阈值时不再继续分割）、最大深度等。

1	clf = DecisionTreeClassifier(min_samples_split=50)

决策树ID3算法利用信息增益这个统计量来选择每个结点的属性，信息增益的计算依赖于熵。

熵

熵主要用于控制决策树如何决定在何处分割数据，它是测量一系列样本中的不纯度的方式。

$Entropy = -\sum_{k=1}^K{(p_k) log_2(p_k)}$

公式中 p_k 表示训练集中类别 k 的样本比例。

例如，某结点的所有样本可分为 2 个类别，如果每个类别刚好各占总的样本量的一半(p=0.5)，则计算可得熵为1，表示有最高的不纯度。而如果所有样本都只属于一类，则计算可得到熵为0，表示最低的不纯度。

信息增益

信息增益表示用某个特定属性分割样本集而引起熵的预期减少程度，可定量地表示某一个属性根据目标分类标签对训练样本的分类效果，计算公式如下：

$Gain(L, X^i) = Entropy(L) - \sum_{v \in values(X^i)} \frac{|L_v|}{|L|}Entropy(L_v)$

结点的训练子集 L 根据属性 X 值分割的信息增益等于该结点的熵减去分割后各子结点的熵的加权平均。

优缺点

决策树的优点：易于使用，且某种程度上能以图形的方式解释数据，便于理解。另外，决策树还很方便在一些集成方法中使用。

决策树的缺点：数据包含大量的特征时，复杂的决策树容易过拟合。

k-NN (k nearest neighbors)

k-NN（k-最近邻）算法可以说是所有机器学习算法中最简单的算法。它通过计算新数据与训练数据特征值之间的距离, 然后选取 k (k>=1) 个距离最近的邻居进行分类或者回归。

有时候，一个简单的k-NN算法在良好选择的特征上会有很出色的表现。如果参数（主要是metrics）设置得当，其在回归问题中通常可以表现出最好的质量。

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

### create classifier
clf = KNeighborsClassifier()

### fit the classifier on the training features and labels
clf.fit(features_train, labels_train)

### use the trained classifier to predict labels for the test features
predicted = clf.predict(features_test)

### calculate the accuracy on the test data
accuracy = clf.score(features_test, labels_test)

### summarize the fit of the classifier
print(metrics.classification_report(labels_test, predicted))
print(metrics.confusion_matrix(labels_test, predicted))

AdaBoost

AdaBoost是一种集成学习方法。

AdaBoost是”Adaptive Boosting”的简写，它是一种迭代算法，通过组合许多其他的机器学习算法来提升性能，在每一轮迭代中加入一个新的弱分类器（某种机器学习算法），直到达到某个预定的足够小的错误率。

每一个训练样本都被赋予一个权重，表明它被某个分类器选入训练集的概率。如果某个样本点已经被准确地分类，那么在构造下一个训练集中，它被选中的概率就被降低；相反，如果某个样本点没有被准确地分类，那么它的权重就得到提高。通过这样的方式，AdaBoost方法能聚焦于那些较难分的样本上。

AdaBoost方法对于噪声数据和异常数据很敏感，相对于大多数其它学习算法而言，不容易出现过拟合现象。

AdaBoost（以决策树为弱分类器）通常被称为最好的开箱即用的分类器。

from sklearn.ensemble import AdaBoostClassifier

# create classifier
clf = AdaBoostClassifier(n_estimators=20, learning_rate = 2.0)

# fit the classifier on the training features and labels
clf.fit(features_train, labels_train)

# use the trained classifier to predict labels for the test features
predicted = clf.predict(features_test)

# calculate the accuracy on the test data
accuracy = clf.score(features_test, labels_test)

print "accuracy: ", accuracy

随机森林 (Random forest)

随机森林是通过集成学习(Ensemble Learning)的思想将多棵树集成的一种算法，它的基本单元是决策树。

随机森林是用随机的方式建立一个森林，森林由很多的决策树组成，每棵决策树都是一个分类器。对于一个输入样本，N棵树会有N个分类结果，投票次数最多的类别即为最终的输出，这就是一种最简单的 Bagging 思想。

from sklearn.ensemble import RandomForestClassifier

# create classifier
clf = RandomForestClassifier(n_estimators=10)

# fit the classifier on the training features and labels
clf.fit(features_train, labels_train)

# use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)

# calculate the accuracy on the test data
acc = clf.score(features_test, labels_test)