当前位置 : 主页 > 网页制作 > HTTP/TCP >

Titanic best working Classifier

来源:互联网 收集:自由互联 发布时间:2021-06-16
文档: Titanic best working Classifier.md 链接:http://note.youdao.com/noteshare?id=6e2847e19cd533c02d3658eca63e4f06sub=6D7F43835A564D59B27AF30BF2FB09F9 PassengerId = 乘客ID Pclass = 乘客等级(1/2/3等舱位) Name = 乘客姓名

文档: Titanic best working Classifier.md
链接:http://note.youdao.com/noteshare?id=6e2847e19cd533c02d3658eca63e4f06&sub=6D7F43835A564D59B27AF30BF2FB09F9

  • PassengerId => 乘客ID
  • Pclass => 乘客等级(1/2/3等舱位)
  • Name => 乘客姓名
  • Sex => 性别
  • Age => 年龄
  • SibSp => 堂兄弟/妹个数
  • Parch => 父母与小孩个数
  • Ticket => 船票信息
  • Fare => 票价
  • Cabin => 客舱
  • Embarked => 登船港口

Titanic best working Classifier

kaggle link

<class ‘pandas.core.frame.DataFrame‘> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB None

特征工程

查看各个属性对生存率的影响

  • Pcalss
  • Sex
  • SibSp、Parch ==> SibSp+Parch+1 = FamilySize

IsAlone

dataset[‘IsAlone‘] = 0 dataset.loc[dataset[‘FamilySize‘] == 1, ‘IsAlone‘] = 1 
IsAlone  Survived
0 0 0.505650 1 1 0.303538 

遗失数据填充--中值

  • Embarked ==>‘S‘
  • Fare ==> fillna(Fare.median)
  • Age ==> np.random.randint(mean-std, mean+std, size = null_size)
  • Name == >Title

数据清洗

将特征转换成数字

1.直接特征对应 ex: ‘female‘: 0, ‘male‘: 1

2.均分特征对应

ex:

dataset.loc[ dataset[‘Age‘] <= 16, ‘Age‘] = 0 dataset.loc[(dataset[‘Age‘] > 16) & (dataset[‘Age‘] <= 32), ‘Age‘] = 1 dataset.loc[(dataset[‘Age‘] > 32) & (dataset[‘Age‘] <= 48), ‘Age‘] = 2 dataset.loc[(dataset[‘Age‘] > 48) & (dataset[‘Age‘] <= 64), ‘Age‘] = 3 dataset.loc[ dataset[‘Age‘] > 64, ‘Age‘] = 4 

丢弃不需要的特征

train = train.drop(drop_elements, axis = 1)

Survived  Pclass  Sex  Age  Fare  Embarked  IsAlone  Title
0 0 3 1 1 0 0 0 1 1 1 1 0 2 3 1 0 3 2 1 3 0 1 1 0 1 2 3 1 1 0 2 3 0 0 3 4 0 3 1 2 1 0 1 1 5 0 3 1 0 1 2 1 1 6 0 1 1 3 3 0 1 1 7 0 3 1 0 2 0 0 4 8 1 3 0 1 1 0 0 3 9 1 2 0 0 2 1 0 3 

分类器选择

选择测试得分最高的分类器

from sklearn.model_selection import StratifiedShuffleSplit from sklearn.metrics import accuracy_score, log_loss from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis from sklearn.linear_model import LogisticRegression 

代码学习

groupby

train[["Sex", "Survived"]].groupby([‘Sex‘], as_index=False).mean()
#[‘Sex‘]:以Sex中心聚集数据 #as_index=False:不以Sex为index 
Sex  Survived
0 female 0.742038 1 male 0.188908 

fillna

dataset[‘Embarked‘] = dataset[‘Embarked‘].fillna(‘S‘) #Embarked中的nan填充为‘s

isnull, isnan, randint

age_null_count = dataset[‘Age‘].isnull().sum() age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count) dataset[‘Age‘][np.isnan(dataset[‘Age‘])] = age_null_random_list 

cut, qcut

cut是将自变量均匀分配

qcut是将因变量均匀分配

apply

dataset[‘Title‘] = dataset[‘Name‘].apply(get_title) #使用get_title函数 

replace, drop

dataset[‘Title‘] = dataset[‘Title‘].replace([‘Lady‘, ‘Countess‘,‘Capt‘, ‘Col‘, ‘Don‘, ‘Dr‘, ‘Major‘, ‘Rev‘, ‘Sir‘, ‘Jonkheer‘, ‘Dona‘], ‘Rare‘) dataset[‘Title‘] = dataset[‘Title‘].replace(‘Mlle‘, ‘Miss‘) 
train = train.drop(drop_elements, axis = 1) #不加axis = 1,默认去掉行, 加了去掉列 

难点

1,Fare值的补充

data[‘Fare‘]= data[‘Fare‘].fillna(data[‘Fare‘].median()) 

2, 逻辑运算符的括号

data.loc[(data[‘Age‘] >32) & (data[‘Age‘] <= 48), ‘Age‘] = 2
网友评论