数据特征的优化选择方法（feature_selection）

博集华仿

2019年12月14日 20:01

摘要：学习器的学习素材---数据，可能拥有过多的特征，删除不必要的特征，保留必要的特征，既可以提高学习速度，也可以提高模型预测性能。这就需要数据特征的优化选择。

00 方差法

import numpy as np
import sklearn.feature_selection as selection
x=np.array([[100,1,2,3],[100.5,4,5,6],[101,7,8,9]])

数据特征的优化选择方法（feature_selection）的图1 selectvar=selection.VarianceThreshold(threshold=1)
selectvar.fit(x)
x_var=selectvar.transform(x)

去除方差小于1的一列，这样特征就由4个变为3个；

数据特征的优化选择方法（feature_selection）的图2 selectvar.variances_
Out[8]: array([0.16666667, 6. , 6. , 6. ])

01 指标法

import numpy as np
import sklearn.feature_selection as selection
x=np.array([[1,2,3,4,5],[5,4,3,2,1],[3,3,3,3,3],[1,1,1,1,1]])
y=[0,1,0,1]

数据特征的优化选择方法（feature_selection）的图3 selectkbest=selection.SelectKBest(selection.f_classif,k=4)
selectkbest.fit(x,y)
x_kbest=selectkbest.transform(x)

x的第二列被去除了，5个特征变成4个特征：

数据特征的优化选择方法（feature_selection）的图4

列出x中每一列的得分

selectkbest.scores_
Out[24]: array([0.2, 0. , 1. , 8. , 9. ], dtype=float32)

列出x中被保留的特征的列编号

selectkbest.get_support(True)
Out[25]: array([0, 2, 3, 4], dtype=int64)

换一种指标：

selectkbest=selection.SelectKBest(selection.chi2,k=4)
selectkbest.fit(x,y)
x_kbest=selectkbest.transform(x)
selectkbest.scores_
Out[26]: array([0.4, 0. , 0.4, 1.6, 3.6])

数据特征的优化选择方法（feature_selection）的图5

selectpercen=selection.SelectPercentile(selection.chi2,60)
selectpercen.fit(x,y)
selectpercen.scores_
Out[36]: array([0.4, 0. , 0.4, 1.6, 3.6])

x_kbest=selectpercen.transform(x)

数据特征的优化选择方法（feature_selection）的图6

02 权重法

获取sklearn中的鸢尾花数据

import numpy as np
from sklearn import datasets, svm
import sklearn.feature_selection as selection
iris=datasets.load_iris()
x=iris.data
y=iris.target

selectrfe=selection.RFE(estimator=svm.LinearSVC(max_iter=10000),n_features_to_select=2)
selectrfe.fit(x,y)
x_rfe=selectrfe.transform(x)

数据特征的优化选择方法（feature_selection）的图7

selectrfe.support_
Out[149]: array([False, True, False, True])

selectrfe.ranking_
Out[150]: array([3, 1, 2, 1])

selectrfe.score(x,y)
Out[151]: 0.9466666666666667

特征选择不一定会提高预测性能，比如不去除特征，预测得分更高：

classi=svm.LinearSVC(max_iter=10000)
classi.fit(x,y)
classi.score(x,y)
Out[152]: 0.9666666666666667

selectrfecv=selection.RFECV(estimator=svm.LinearSVC(max_iter=10000),cv=3)
selectrfecv.fit(x,y)
x_rfecv=selectrfecv.transform(x)
selectrfecv.support_
Out[153]: array([ True, True, True, True])

selectrfecv.ranking_
Out[154]: array([1, 1, 1, 1])

selectrfecv.grid_scores_
Out[155]: array([0.91421569, 0.94689542, 0.95383987, 0.96691176])

03 阈值法

import numpy as np
from sklearn import datasets,svm
import sklearn.feature_selection as selection
digits=datasets.load_digits()
x=digits.data
y=digits.target
selectfm=selection.SelectFromModel(estimator=svm.LinearSVC(penalty='l1',dual=False,max_iter=10000)
,threshold='mean')
selectfm.fit(x,y)
selectfm.transform(x)
selectfm.get_support(True)
Out[164]:
array([ 2, 3, 4, 5, 6, 9, 14, 16, 18, 19, 20, 21, 22, 24, 25, 30, 33,
36, 38, 41, 42, 43, 44, 45, 51, 54, 55, 57, 58], dtype=int64)

selectfm.threshold_
Out[165]: 0.8611951902929874

登录后免费查看全文

立即登录