文档: Titanic best working Classifier.md 链接:http://note.youdao.com/noteshare?id=6e2847e19cd533c02d3658eca63e4f06sub=6D7F43835A564D59B27AF30BF2FB09F9 PassengerId = 乘客ID Pclass = 乘客等级(1/2/3等舱位) Name = 乘客姓名
文档: Titanic best working Classifier.md
链接:http://note.youdao.com/noteshare?id=6e2847e19cd533c02d3658eca63e4f06&sub=6D7F43835A564D59B27AF30BF2FB09F9
- PassengerId => 乘客ID
- Pclass => 乘客等级(1/2/3等舱位)
- Name => 乘客姓名
- Sex => 性别
- Age => 年龄
- SibSp => 堂兄弟/妹个数
- Parch => 父母与小孩个数
- Ticket => 船票信息
- Fare => 票价
- Cabin => 客舱
- Embarked => 登船港口
Titanic best working Classifier
kaggle link
<class ‘pandas.core.frame.DataFrame‘> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB None
特征工程
查看各个属性对生存率的影响
- Pcalss
- Sex
- SibSp、Parch ==> SibSp+Parch+1 = FamilySize
IsAlone
dataset[‘IsAlone‘] = 0 dataset.loc[dataset[‘FamilySize‘] == 1, ‘IsAlone‘] = 1
IsAlone Survived
0 0 0.505650 1 1 0.303538
遗失数据填充--中值
- Embarked ==>‘S‘
- Fare ==> fillna(Fare.median)
- Age ==> np.random.randint(mean-std, mean+std, size = null_size)
- Name == >Title
数据清洗
将特征转换成数字
1.直接特征对应 ex: ‘female‘: 0, ‘male‘: 1
2.均分特征对应
ex:
dataset.loc[ dataset[‘Age‘] <= 16, ‘Age‘] = 0 dataset.loc[(dataset[‘Age‘] > 16) & (dataset[‘Age‘] <= 32), ‘Age‘] = 1 dataset.loc[(dataset[‘Age‘] > 32) & (dataset[‘Age‘] <= 48), ‘Age‘] = 2 dataset.loc[(dataset[‘Age‘] > 48) & (dataset[‘Age‘] <= 64), ‘Age‘] = 3 dataset.loc[ dataset[‘Age‘] > 64, ‘Age‘] = 4
丢弃不需要的特征
train = train.drop(drop_elements, axis = 1)
Survived Pclass Sex Age Fare Embarked IsAlone Title
0 0 3 1 1 0 0 0 1 1 1 1 0 2 3 1 0 3 2 1 3 0 1 1 0 1 2 3 1 1 0 2 3 0 0 3 4 0 3 1 2 1 0 1 1 5 0 3 1 0 1 2 1 1 6 0 1 1 3 3 0 1 1 7 0 3 1 0 2 0 0 4 8 1 3 0 1 1 0 0 3 9 1 2 0 0 2 1 0 3
分类器选择
选择测试得分最高的分类器
from sklearn.model_selection import StratifiedShuffleSplit from sklearn.metrics import accuracy_score, log_loss from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis from sklearn.linear_model import LogisticRegression
代码学习
groupby
train[["Sex", "Survived"]].groupby([‘Sex‘], as_index=False).mean()
#[‘Sex‘]:以Sex中心聚集数据 #as_index=False:不以Sex为index
Sex Survived
0 female 0.742038 1 male 0.188908
fillna
dataset[‘Embarked‘] = dataset[‘Embarked‘].fillna(‘S‘) #将Embarked中的nan填充为‘s’
isnull, isnan, randint
age_null_count = dataset[‘Age‘].isnull().sum() age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count) dataset[‘Age‘][np.isnan(dataset[‘Age‘])] = age_null_random_list
cut, qcut
cut是将自变量均匀分配
qcut是将因变量均匀分配
apply
dataset[‘Title‘] = dataset[‘Name‘].apply(get_title) #使用get_title函数
replace, drop
dataset[‘Title‘] = dataset[‘Title‘].replace([‘Lady‘, ‘Countess‘,‘Capt‘, ‘Col‘, ‘Don‘, ‘Dr‘, ‘Major‘, ‘Rev‘, ‘Sir‘, ‘Jonkheer‘, ‘Dona‘], ‘Rare‘) dataset[‘Title‘] = dataset[‘Title‘].replace(‘Mlle‘, ‘Miss‘)
train = train.drop(drop_elements, axis = 1) #不加axis = 1,默认去掉行, 加了去掉列
难点
1,Fare值的补充
data[‘Fare‘]= data[‘Fare‘].fillna(data[‘Fare‘].median())
2, 逻辑运算符的括号
data.loc[(data[‘Age‘] >32) & (data[‘Age‘] <= 48), ‘Age‘] = 2