0%

scikit-learn库

scikit-learn库是当今最流行的机器学习算法库之一

可用来解决分类与回归问题

本章以鸢尾花数据集为例,简单了解八大传统机器学习分类算法的sk-learn实现

传统机器学习算法的原理和推导,学习《统计学习方法》或《西瓜书》

image-20241128121716044

数据集准备

下载数据集

1
import seaborn as sns
1
iris = sns.load_dataset("iris")

数据集的查看

1
type(iris)
pandas.core.frame.DataFrame
1
iris.shape
(150, 5)
1
2
iris.head()
#花萼长度、花萼宽度、花瓣长度、花瓣宽度、鸢尾花类型

sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
1
2
iris.info()
#可以看出数据很干净,没有缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
1
iris.describe()

sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
1
2
#对标签值进行统计
iris.species.value_counts()
virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64
1
2
3
#主要作用是分析多变量数据的变量之间的关系,并在对角线上显示每个变量的分布

sns.pairplot(data=iris,hue="species")
<seaborn.axisgrid.PairGrid at 0x1aac4ed3850>


png

标签清洗

为了简化问题,我们可以只取花瓣长度和花瓣宽度这两个变量进行鸢尾花的分类

1
2
iris_simple = iris.drop(["sepal_length", "sepal_width"], axis=1)
iris_simple.head()

petal_length petal_width species
0 1.4 0.2 setosa
1 1.4 0.2 setosa
2 1.3 0.2 setosa
3 1.5 0.2 setosa
4 1.4 0.2 setosa

标签编码

1
2
3
4
5
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder() #构造实例
iris_simple["species"] = encoder.fit_transform(iris_simple["species"])
#LabelEncoder 是 scikit-learn 提供的工具,用于将类别型数据(如文本标签)转换为数值型数据

fit_transform:

这是一个常用方法,结合了两步操作:

  • fit:学习类别数据的映射关系(即将每个类别与一个整数标签对应)。
  • transform:将原始的类别数据转换为对应的数值标签。

    简化操作:fit_transform 等价于先调用 fit 然后调用 transform,但更加方便。
1
iris_simple

petal_length petal_width species
0 1.4 0.2 0
1 1.4 0.2 0
2 1.3 0.2 0
3 1.5 0.2 0
4 1.4 0.2 0
... ... ... ...
145 5.2 2.3 2
146 5.0 1.9 2
147 5.2 2.0 2
148 5.4 2.3 2
149 5.1 1.8 2

150 rows × 3 columns

数据集的标准化(本数据集比较接近,实际处理过程中未标准化)

1
2
from sklearn.preprocessing import StandardScaler
import pandas as pd
1
2
3
4
trans=StandardScaler()
_iris_simple = trans.fit_transform(iris_simple[["petal_length","petal_width"]])
_iris_simple=pd.DataFrame(_iris_simple,columns=["petal_length","petal_width"])
_iris_simple.head()

petal_length petal_width
0 -1.340227 -1.315444
1 -1.340227 -1.315444
2 -1.397064 -1.315444
3 -1.283389 -1.315444
4 -1.340227 -1.315444
1
_iris_simple.describe()

petal_length petal_width
count 1.500000e+02 1.500000e+02
mean -8.652338e-16 -4.662937e-16
std 1.003350e+00 1.003350e+00
min -1.567576e+00 -1.447076e+00
25% -1.226552e+00 -1.183812e+00
50% 3.364776e-01 1.325097e-01
75% 7.627583e-01 7.906707e-01
max 1.785832e+00 1.712096e+00

构建训练集和测试集(暂时不考虑验证集)

1
2
3
4
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(iris_simple, test_size=0.2)
test_set.head()

petal_length petal_width species
15 1.5 0.4 0
92 4.0 1.2 1
46 1.6 0.2 0
37 1.4 0.1 0
84 4.5 1.5 1
1
2
iris_x_train = train_set[["petal_length","petal_width"]]
iris_x_train.head()

petal_length petal_width
38 1.3 0.2
52 4.9 1.5
49 1.4 0.2
55 4.5 1.3
117 6.7 2.2

(iris_x_train)不需要 copy():

这里是直接取列(petal_length, petal_width),而没有进一步修改。

如果这些特征只是用于建模,且不会修改,不使用 .copy() 是安全的。

(iris_y_train)需要 copy():

目标是将目标变量 species 提取为一个独立的副本,可能会对它进行编码(如 LabelEncoder),因此使用 .copy() 避免无意中修改原始数据。

1
2
iris_y_train = train_set["species"].copy()
iris_y_train.head()
38     0
52     1
49     0
55     1
117    2
Name: species, dtype: int32
1
2
iris_x_test = test_set[["petal_length","petal_width"]]
iris_x_test.head()

petal_length petal_width
15 1.5 0.4
92 4.0 1.2
46 1.6 0.2
37 1.4 0.1
84 4.5 1.5
1
2
iris_y_test = test_set["species"].copy()
iris_y_test.head()
15    0
92    1
46    0
37    0
84    1
Name: species, dtype: int32

k近邻算法

基本思想:把k个近邻中最常见的类别预测为待预测点的类别

sklearn实现

1
from sklearn.neighbors import KNeighborsClassifier
1
2
3
# 构建分类器对象
clf=KNeighborsClassifier()
clf
KNeighborsClassifier()
1
2
# 训练
clf.fit(iris_x_train,iris_y_train)
KNeighborsClassifier()
1
2
3
4
# 预测
res = clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)
[0 1 0 0 1 2 1 1 2 1 2 0 2 2 1 0 0 2 0 0 2 2 2 1 1 1 1 2 1 0]
[0 1 0 0 1 2 1 1 2 1 2 0 2 2 1 0 0 2 0 0 2 2 2 1 1 1 1 2 1 0]


D:\software\anaconda\lib\site-packages\sklearn\neighbors\_classification.py:211: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
1
2
# 翻转:将数值型类别标签还原为原始的类别名称(即反向转换)
encoder.inverse_transform(res)
array(['setosa', 'versicolor', 'setosa', 'setosa', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'virginica', 'versicolor',
       'virginica', 'setosa', 'virginica', 'virginica', 'versicolor',
       'setosa', 'setosa', 'virginica', 'setosa', 'setosa', 'virginica',
       'virginica', 'virginica', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'virginica', 'versicolor', 'setosa'], dtype=object)
1
2
3
#评估
accuracy = clf.score(iris_x_test,iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
预测正确率:100%


D:\software\anaconda\lib\site-packages\sklearn\neighbors\_classification.py:211: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
1
2
3
4
5
#存储数据
out = iris_x_test.copy()
out["y"] =iris_y_test
out["pre"]=res
out

petal_length petal_width y pre
15 1.5 0.4 0 0
92 4.0 1.2 1 1
46 1.6 0.2 0 0
37 1.4 0.1 0 0
84 4.5 1.5 1 1
136 5.6 2.4 2 2
61 4.2 1.5 1 1
76 4.8 1.4 1 1
137 5.5 1.8 2 2
87 4.4 1.3 1 1
116 5.5 1.8 2 2
19 1.5 0.3 0 0
139 5.4 2.1 2 2
110 5.1 2.0 2 2
85 4.5 1.6 1 1
2 1.3 0.2 0 0
16 1.3 0.4 0 0
135 6.1 2.3 2 2
45 1.4 0.3 0 0
13 1.1 0.1 0 0
130 6.1 1.9 2 2
103 5.6 1.8 2 2
126 4.8 1.8 2 2
65 4.4 1.4 1 1
57 3.3 1.0 1 1
51 4.5 1.5 1 1
50 4.7 1.4 1 1
144 5.7 2.5 2 2
71 4.0 1.3 1 1
17 1.4 0.3 0 0
1
out.to_csv("iris_predict.csv")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 可视化
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

def draw(clf):
#网格化
M,N=500,500
x1_min,x2_min = iris_simple[["petal_length","petal_width"]].min(axis=0)
x1_max,x2_max = iris_simple[["petal_length","petal_width"]].max(axis=0)
t1 = np.linspace(x1_min,x1_max,M)
t2 = np.linspace(x2_min,x2_max,N)
x1,x2 = np.meshgrid(t1,t2)

#预测
x_show = np.stack((x1.flat,x2.flat),axis=1)
y_predict = clf.predict(x_show)

#配色
cm_light = mpl.colors.ListedColormap (["#A0FFA0","#FFA0A0","#A0A0FF"])
cm_dark = mpl.colors.ListedColormap(["g","r","b"])

#绘制预测区域图
plt.figure(figsize=(10,6))
plt.pcolormesh(t1,t2,y_predict.reshape(x1.shape),cmap=cm_light)

#绘制原始数据点
plt.scatter(iris_simple["petal_length"],iris_simple["petal_width"],label=None,
c=iris_simple["species"],cmap=cm_dark,marker='o',edgecolors='k')
plt.xlabel("petal_length")
plt.ylabel("petal_width")

#绘制图例
color=["g","r","b"]
species = ["setosa","virginica","versicolor"]
for i in range(3):
plt.scatter([],[],c=color[i],s=40,label=species[i])#利用空点绘制图例

plt.legend(loc="best")
plt.title('iris_classfier')


1
draw(clf)
D:\software\anaconda\lib\site-packages\sklearn\neighbors\_classification.py:211: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
<ipython-input-29-b1c5d7cc9ad0>:25: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.
  plt.pcolormesh(t1,t2,y_predict.reshape(x1.shape),cmap=cm_light)

png

朴素贝叶斯算法

基本思想:
当X=(x1,x2)发生的时候,哪一个yk发生的概率最大

image-20241128213417738

sklearn实现

1
from sklearn.naive_bayes import GaussianNB
1
2
3
#构建分类器对象
clf = GaussianNB()
clf
GaussianNB()
1
2
# 训练
clf.fit(iris_x_train,iris_y_train)
GaussianNB()
1
2
3
4
#预测
res=clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)
[0 1 0 0 1 2 1 1 2 1 2 0 2 2 1 0 0 2 0 0 2 2 2 1 1 1 1 2 1 0]
[0 1 0 0 1 2 1 1 2 1 2 0 2 2 1 0 0 2 0 0 2 2 2 1 1 1 1 2 1 0]
1
2
3
#评估
accuracy = clf.score(iris_x_test,iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
预测正确率:100%
1
2
#可视化
draw(clf)
<ipython-input-29-b1c5d7cc9ad0>:25: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.
  plt.pcolormesh(t1,t2,y_predict.reshape(x1.shape),cmap=cm_light)

png

决策树算法

基本思想
CART算法:每次通过一个特征,将数据尽可能的分为纯净的两类,递归的分下去

1
from sklearn.tree import DecisionTreeClassifier
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 构建分类器对象
clf=AdaBoostClassifier()

#训练
clf.fit(iris_x_train,iris_y_train)

#预测
res=clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)

#评估
accuracy = clf.score(iris_x_test,iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-38-17c5ee30a199> in <module>
      1 # 构建分类器对象
----> 2 clf=AdaBoostClassifier()
      3 
      4 #训练
      5 clf.fit(iris_x_train,iris_y_train)


NameError: name 'AdaBoostClassifier' is not defined
1
2
#可视化
draw(clf)

逻辑回归算法

image-20241129111511404

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#sklearn 实现
from sklearn.linear_model import LogisticRegression


# 构建分类器对象
clf=LogisticRegression(solver="saga",max_iter=1000)

#训练
clf.fit(iris_x_train,iris_y_train)

#预测
res=clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)

#评估
accuracy = clf.score(iris_x_test,iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))

1
2
#可视化
draw(clf)

支持向量机算法

基本思想

以二分类为例,假设数据可用完全分开:

用一个超平面将两类数据完全分开,且最近点到平面的距离最大

1
from sklearn.svm import SVC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 构建分类器对象
clf=SVC()

#训练
clf.fit(iris_x_train,iris_y_train)

#预测
res=clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)

#评估
accuracy = clf.score(iris_x_test,iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))

集成方法——随机森林

基本思想

训练集m,有放回的随机抽取m个数据,构成一组,共抽取n组采样集

n组采样集训练得到n个弱分类器,弱分类器一般用决策树或神经网络

将个弱分类器进行组合得到强分类器

1
from sklearn.ensemble import RandomForestClassifier
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 构建分类器对象
clf=RandomForestClassifier()

#训练
clf.fit(iris_x_train,iris_y_train)

#预测
res=clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)

#评估
accuracy = clf.score(iris_x_test,iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))

集成方法——Adaboost(自适应增强算法)

image-20241129093312106

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#sklearn实现
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB

# 构建分类器对象
clf=AdaBoostClassifier()

#训练
clf.fit(iris_x_train,iris_y_train)

#预测
res=clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)

#评估
accuracy = clf.score(iris_x_test,iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))

集成方法——梯度提升树GBDT

image-20241129095046838

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# sklearn实现

from sklearn.ensemble import GradientBoostingClassifier

# 构建分类器对象
clf=GradientBoostingClassifier()

#训练
clf.fit(iris_x_train,iris_y_train)

#预测
res=clf.predict(iris_x_test)
print(res)
print(iris_y_test.values)

#评估
accuracy = clf.score(iris_x_test,iris_y_test)
print("预测正确率:{:.0%}".format(accuracy))

大杀器

  1. xgboost

    GBDT的损失函数只对误差部分做负梯度(一阶泰勒)展开
    XGB00s损失函数对误差部分做二阶泰勒展开,更加准确。更快收敛
  2. lightgbm

    微软提出的:快速的,分布式的,高性能的基于决策树算法的梯度提升框架,速度更快
  3. stacking

    堆叠或者叫模型融合

    先建立几个简单的模型进行训练,第二级学习器会基于前级模型的预测结果进行再训练
1