特色栏目： python 批处理 net编程 Javascript Php Asp Css Html5 Android seo centos

Titanic Introduction to Ensembling or Stacking

来源：互联网收集：自由互联发布时间：2021-06-16

Titanic Introduction to Ensembling or Stacking 文档：Titanic Introduction to Ensembling o... 链接：http://note.youdao.com/noteshare?id=b9faf92632044f057a8c948789bbe69csub=60829C84153046DBB45DCBC7B764E1CA kaggle 数据处理和清洗在

Titanic Introduction to Ensembling or Stacking

文档：Titanic Introduction to Ensembling o...
链接：http://note.youdao.com/noteshare?id=b9faf92632044f057a8c948789bbe69c&sub=60829C84153046DBB45DCBC7B764E1CA

kaggle

数据处理和清洗在 Titanic best working Classifier 中

数据

trian.head()
Survived	Pclass	Sex	Age	Parch	Fare	Embarked	Name_length	Has_CabinFamilySize	IsAlone	Title
0 0 3 1 1 0 0 0 23 0 2 0 1 1 1 1 0 2 0 3 1 51 1 2 0 3 2 1 3 0 1 0 1 0 22 0 1 1 2 3 1 1 0 2 0 3 0 44 1 2 0 3 4 0 3 1 2 0 1 0 24 0 1 1 1

建立模型模板

1，类模板

# Some useful parameters which will come in handy later on ntrain = train.shape[0] ntest = test.shape[0] SEED = 0 # for reproducibility NFOLDS = 5 # set folds for out-of-fold prediction kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED) # Class to extend the Sklearn classifier class SklearnHelper(object): def __init__(self, clf, seed=0, params=None): params[‘random_state‘] = seed self.clf = clf(**params) def train(self, x_train, y_train): self.clf.fit(x_train, y_train) def predict(self, x): return self.clf.predict(x) def fit(self,x,y): return self.clf.fit(x,y) #相对重要性 def feature_importances(self,x,y): print(self.clf.fit(x,y).feature_importances_) # Class to extend XGboost classifer

2，参数模板

参考链接

sklearn速查手册

RandomForest

AdaBoostClassifier

ExtraTreesClassifier

# Random Forest parameters rf_params = { ‘n_jobs‘: -1, #引擎有多少处理器是它可以使用。 “-1”意味着没有限制，而“1”值意味着它只能使用一个处理器。 ‘n_estimators‘: 500, #最大的弱学习器的个数。一般来说n_estimators太小，容易欠拟合，n_estimators太大，计算量会太大,一般100 ‘warm_start‘: True, #‘max_features‘: 0.2, #最大特征数，默认是"auto"，一般我们用默认的"auto"就可以了 ‘max_depth‘: 6, #最大深度，默认可以不输入，如果不输入的话，决策树在建立子树的时候不会限制子树的深度。常用的可以取值10-100之间。 ‘min_samples_leaf‘: 2, #叶子结点最少样本数， 默认是1,可以输入最少的样本数的整数，或者最少样本数占样本总数的百分比。如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。 ‘max_features‘ : ‘sqrt‘, ‘verbose‘: 0 #oob_score: 即是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True，因为袋外分数反应了一个模型拟合后的泛化能力。 #in_weight_fraction_leaf： 默认是0，就是不考虑权重问题。一般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时我们就要注意这个值了。 } # Extra Trees Parameters et_params = { ‘n_jobs‘: -1, ‘n_estimators‘:500, #‘max_features‘: 0.5, ‘max_depth‘: 8, ‘min_samples_leaf‘: 2, ‘verbose‘: 0 #控制日志输出 verbose = 0没有输出；verbose = 1 简化版日志输出；verbose=2 更细致的日志输出... } # AdaBoost parameters ada_params = { ‘n_estimators‘: 500, #:基分类器提升（循环）次数，默认是50次，这个值过大，模型容易过拟合；值过小，模型容易欠拟合。 ‘learning_rate‘ : 0.75 #学习率，表示梯度收敛速度，默认为1，如果过大，容易错过最优值，如果过小，则收敛速度会很慢；该值需要和n_estimators进行一个权衡，当分类器迭代次数较少时，学习率可以小一些，当迭代次数较多时，学习率可以适当放大。 #random_state #algorithm #loass } # Gradient Boosting parameters gb_params = { ‘n_estimators‘: 500, #指定基础决策树的数量（默认为100）。GBDT对过拟合有很好的鲁棒性，因此该值越大越好。(计算范围内) #‘max_features‘: 0.2, ‘max_depth‘: 5, ‘min_samples_leaf‘: 2, ‘verbose‘: 0 } # Support Vector Classifier parameters svc_params = { ‘kernel‘ : ‘linear‘, #一个字符串，指定核函数。 ’linear’ :线性核K(x? ,z? )=x? ∗z? K(x→,z→)=x→∗z→。 ‘poly’:多项式核K(x? ,z? )=(γ(x? ∗z? +1)+r)pK(x→,z→)=(γ(x→∗z→+1)+r)p，其中pp由degreedegree参数决定，γγ由gammagamma参数决定，rr由coef0coef0参数决定。 ‘rbf’:默认值，高斯核函数K(x? ,z? )=exp(−γ||x? −z? ||2)K(x→,z→)=exp(−γ||x→−z→||2)，其中γγ由gammagamma参数决定。 ‘sigmoid’:K(x? ,z? )=tanh(γ(x? ,z? )+r)K(x→,z→)=tanh(γ(x→,z→)+r)，其中γγ由gammagamma参数决定，rr由coef0coef0参数指定。 ‘precomputed’:表示提供了kernel matrix，或者提供一个可调用对象，该对象用于计算kernel matrix。 ‘C‘ : 0.025 #一个浮点数，罚项系数。C值越大对误分类的惩罚越大。 }

3，创建模型类

from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier) from sklearn.svm import rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)

4，得到各个模型的预测

train k折得到的预测值
test 平均得到的预测值

选择最优模型

1，各个模型关于各个特征的相对重要性

feature_dataframe = pd.DataFrame( {‘features‘: cols, ‘Random Forest feature importances‘: rf_features, ‘Extra Trees feature importances‘: et_features, ‘AdaBoost feature importances‘: ada_features, ‘Gradient Boost feature importances‘: gb_features })

2, 相对重要性的均值

feature_dataframe[‘mean‘] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise

3, 将从train和test得到的预测值整合

x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)#有小数

4, 用xgboost处理得到最优预测

xgboost

import xgboost as xgb gbm = xgb.XGBClassifier(  #booster 选择每次迭代的模型，有两种选择： gbtree：基于树的模型[默认] gbliner：线性模型  #  #learning_rate = 0.02, n_estimators= 2000, max_depth= 4,  #x_depth越大，模型会学到更具体更局部的样本。默认6 min_child_weight= 2,  #gamma=1, gamma=0.9,  #在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。Gamma指定了节点分裂所需的最小损失函数下降值。 这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关，所以是需要调整的。 subsample=0.8, 默认1  #和GBM中的subsample参数一模一样。这个参数控制对于每棵树，随机采样的比例。 减小这个参数的值，算法会更加保守，避免过拟合。但是，如果这个值设置得过小，它可能会导致欠拟合。 典型值：0.5-1 colsample_bytree=0.8, 默认1  #和GBM里面的max_features参数类似。用来控制每棵随机采样的列数的占比(每一列是一个特征)。 典型值：0.5-1 objective= ‘binary:logistic‘,  #定义需要被最小化的损失函数 binary:logistic 二分类的逻辑回归，返回预测的概率(不是类别)。 multi:softmax 使用softmax的多分类器，返回预测的类别(不是概率)。 在这种情况下，你还需要多设一个参数：num_class(类别数目)。 multi:softprob 和multi:softmax参数一样，但是返回的是每个数据属于各个类别的概率。 nthread= -1,  #这个参数用来进行多线程控制，应当输入系统的核数。 scale_pos_weight=1 默认1  #各类别样本十分不平衡时，把这个参数设定为一个正值，可以使算法更快收敛。  #权重的L1正则化项。(和Lasso regression类似)。 可以应用在很高维度的情况下，使得算法的速度更快。 ).fit(x_train, y_train) predictions = gbm.predict(x_test)

5, 生成提交文件

StackingSubmission = pd.DataFrame({ ‘PassengerId‘: PassengerId, ‘Survived‘: predictions }) StackingSubmission.to_csv("StackingSubmission.csv", index=False)

数据集的划分

sklearn中的数据集的划分

K折验证

n_splits：表示划分几等份

shuffle：在每次划分时，是否进行洗牌

①若为Falses时，其效果等同于random_state等于整数，每次划分的结果相同

②若为True时，每次划分的结果都不一样，表示经过洗牌，随机取样的

random_state：随机种子数

属性：

①get_n_splits(X=None, y=None, groups=None)：获取参数n_splits的值

②split(X, y=None, groups=None)：将数据集划分成训练集和测试集，返回索引生成器

通过一个不能均等划分的栗子，设置不同参数值，观察其结果

①设置shuffle=False，运行两次，发现两次结果相同

from sklearn.model_selection import StratifiedKFold X = np.array([[1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24], [31, 32, 33, 34], [41, 42, 43, 44], [51, 52, 53, 54], [61, 62, 63, 64], [71, 72, 73, 74]]) y = np.array([1, 1, 0, 0, 1, 1, 0, 0]) stratified_folder = StratifiedKFold(n_splits=4, random_state=0, shuffle=False) for train_index, test_index in stratified_folder.split(X, y): print("Stratified Train Index:", train_index) print("Stratified Test Index:", test_index) print("Stratified y_train:", y[train_index]) print("Stratified y_test:", y[test_index],‘\n‘)

Stratified Train Index: [1 3 4 5 6 7] Stratified Test Index: [0 2] Stratified y_train: [1 0 1 1 0 0] Stratified y_test: [1 0] Stratified Train Index: [0 2 4 5 6 7] Stratified Test Index: [1 3] Stratified y_train: [1 0 1 1 0 0] Stratified y_test: [1 0] Stratified Train Index: [0 1 2 3 5 7] Stratified Test Index: [4 6] Stratified y_train: [1 1 0 0 1 0] Stratified y_test: [1 0] Stratified Train Index: [0 1 2 3 4 6] Stratified Test Index: [5 7] Stratified y_train: [1 1 0 0 1 0] Stratified y_test: [1 0]

代码学习

feature_importances_

clf.fit().feature_importances_

concatenate 默认axis=0

x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)

绘图

figure 参数

colormap的色条

heatmap

import seaborn as sns # 高级绘图 colormap = plt.cm.RdBu #色条 plt.figure(figsize=(16, 10)) # figsize:以英寸为单位的宽高 plt.title(‘person correlation of Features‘, y=1.05, size=15) sns.heatmap( train.astype(float).corr(),#计算每个列两两之间的相似度 linewidths=0.1, cmap=colormap, #默认为cubehelix map (数据集为连续数据集时) 或 RdBu_r (数据集为离散数据集时) annot=True #annot默认为False，当annot为True时，在heatmap中每个方格写入数据 )

pairplot

sns.pairplot(
train[list(train.columns)], 
hue=‘Survived‘, #使用指定变量为分类变量画图 palette=‘seismic‘, #调色板颜色 size=1.2,#图的尺度大小（正方形） diag_kind=‘kde‘, #对角样式 diag_kws=dict(shade=True),#指定其他参数 plot_kws=dict(s=10))#指定其他参数

plt.bar

1. left：x轴的位置序列，一般采用arange函数产生一个序列； 2. height：y轴的数值序列，也就是柱形图的高度，一般就是我们需要展示的数据； 3. alpha：透明度 4. width：为柱形图的宽度，一般这是为0.8即可； 5. color或facecolor：柱形图填充的颜色； 6. edgecolor：图形边缘颜色 7. label：解释每个图像代表的含义 8. linewidth or linewidths or lw：边缘or线的宽度 plt.bar(pos,feature_importances[index_sorted],align=‘center

上一篇：再遇CORS -- 自定义HTTP header的导致跨域
下一篇：Titanic best working Classifier

Titanic Introduction to Ensembling or Stacking

Titanic Introduction to Ensembling or Stacking

数据处理和清洗在 Titanic best working Classifier 中

建立模型模板

1，类模板

2， 参数模板

3， 创建模型类

4，得到各个模型的预测

选择最优模型

1，各个模型关于各个特征的相对重要性

2, 相对重要性的均值

3, 将从train和test得到的预测值整合

4, 用xgboost处理得到最优预测

5, 生成提交文件

数据集的划分

K折验证

代码学习

绘图

相关文章

2，参数模板

3，创建模型类