UCI機械学習リポジトリのデータ（など）で遊ぶ(3)：クレジットカードの加入審査データ

このシリーズ、前回はUCI リポジトリではないデータセットを使ってしまって本義に悖る内容になってしまったので（笑）、今回はUCIのデータセットを使ってみることにします。そのデータがこちら。

Credit Approval Data Set

Data set descriptionを見ると、こんなことが書いてあります。重要そうなところだけ抜粋。

4. Relevant Information:

This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.

This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.

7. Attribute Information:

A1: b, a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t, f.
A10: t, f.
A11: continuous.
A12: t, f.
A13: g, p, s.
A14: continuous.
A15: continuous.
A16: +,- (class attribute)

8. Missing Attribute Values:

37 cases (5%) have one or more missing values. The missing values from particular attributes are:

A1: 12
A2: 12
A4: 6
A5: 6
A6: 9
A7: 9
A14: 13

このデータの特徴は、

課題は二値分類
説明変数は二値・カテゴリ・連続値と満遍なく揃っている
欠損値がゴロゴロ入っている

ということで前処理のしがいがあるってところですね（鼻血）。そして多変量なのに尺度基準がバラバラなのでどうあがいても可視化しづらいという。。。なので、できればこのデータセットに直接チャレンジすると皆さん練習になって良いのではないかなぁと思います（白目）。

なんて言うのは鬼畜なので、一応適当に前処理したデータを僕のGitHub リポジトリに置いておきました。これまた僕の方で適当にholdoutしてあって、学習データとテストデータに分けてあります。

ということで、card_train.csvとcard_test.csvを落としてきてそれぞれtrain, testみたいな名前でRにインポート。。。していただきたいんですが、データ構造の不一致とかが起きて面倒なのかもなので、同じフォルダ内のcard_dataset.Rdataワークスペースをそのままロードしちゃってください。

今回はもう何もかも面倒くさいので、ただひたすらモデル推定して予測かけてholdoutでCVしていきます。

> library(e1071)
> library(randomForest)
> train.glm<-glm(label~.,train,family=binomial)
Warning message:
 glm.fit: 数値的に 0 か 1 である確率が生じました  
> table(test$label,round(predict(train.glm,newdata=test[,-16],type='response'),0))
   
     0  1
  N 42  8
  Y 10 40

# ロジスティック回帰で正答率82%

Warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading
> train.tune1<-tune.svm(label~.,data=train,kernel='radial')
> train.tune2<-tune.svm(label~.,data=train,kernel='linear')
> train.tune1$best.model

Call:
best.svm(x = label ~ ., data = train, kernel = "radial")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.02325581 

Number of Support Vectors:  295

> train.tune2$best.model

Call:
best.svm(x = label ~ ., data = train, kernel = "linear")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1 
      gamma:  0.02325581 

Number of Support Vectors:  190

> train.svm1<-svm(label~.,train,kernel='radial',cost=train.tune1$best.model$cost,gamma=train.tune1$best.model$gamma)
> train.svm2<-svm(label~.,train,kernel='linear',cost=train.tune2$best.model$cost,gamma=train.tune2$best.model$gamma)
> table(test$label,predict(train.svm1,newdata=test[,-16]))
   
     N  Y
  N 42  8
  Y  3 47

# ガウシアンカーネルSVMで89%

> table(test$label,predict(train.svm2,newdata=test[,-16]))
   
     N  Y
  N 42  8
  Y  6 44

# 線形カーネルSVMで84%

> tuneRF(train[,-16],train[,16],doBest=T)
mtry = 3  OOB error = 14.41% 
Searching left ...
mtry = 2 	OOB error = 14.07% 
0.02352941 0.05 
Searching right ...
mtry = 6 	OOB error = 13.9% 
0.03529412 0.05 

Call:
 randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 6

        OOB estimate of  error rate: 14.24%
Confusion matrix:
    N   Y class.error
N 287  46   0.1381381
Y  38 219   0.1478599
> train.rf<-randomForest(label~.,train,mtry=6)
> table(test$label,predict(train.rf,newdata=test[,-16]))
   
     N  Y
  N 45  5
  Y  9 41

# ランダムフォレストで86%

ここまでの結果だとガウシアンカーネル SVMが一番良いですね。残りはXgboost。H2OのDeep Learningはサンプルサイズが小さ過ぎるので見送りますｗ

> library(xgboost)
> library(Matrix)
> train.mx<-sparse.model.matrix(label~.,train)
> test.mx<-sparse.model.matrix(label~.,test)
> dtrain<-xgb.DMatrix(train.mx,label=as.integer(train$label)-1)
> dtest<-xgb.DMatrix(test.mx,label=as.integer(test$label)-1)
> train.gbdt<-xgb.train(params=list(objective="binary:logistic",eval_metric="logloss",eta=0.7),dtrain,nrounds=150,watchlist=list(train=dtrain,test=dtest))
[0]	train-logloss:0.358473	test-logloss:0.440524
[1]	train-logloss:0.257837	test-logloss:0.396443
[2]	train-logloss:0.206318	test-logloss:0.406964
# ...
[148]	train-logloss:0.008419	test-logloss:0.602591
[149]	train-logloss:0.008393	test-logloss:0.602187
> table(test$label,round(predict(train.gbdt,newdata=dtest),0))
   
     0  1
  N 44  6
  Y  8 42

# Xgboostで86%