統計的因果推論(4): 機械学習分類器による傾向スコアを調整してみる - 渋谷駅前で働くデータサイエンティストのブログ

この記事は以下の記事の続きです。

ご覧の通り、機械学習分類器3種で傾向スコアを算出してみたらおかしな結果になったわけです。この点について、実は後日2点ほどコメントをいただきました。1つはブコメで、

統計的因果推論(3): 傾向スコア算出を機械学習に置き換えてみると - 渋谷駅前で働くデータサイエンティストのブログ
CM接触群と非接触群に分けて、傾向スコアの分布をみてみると、2群のスコアが0.25～0.75でしか重複していません。傾向スコアが0.25～0.75のデータに絞って比較すると、とりあえずは妥当な結論が出ると思います。後は、傾向
2016/10/12 21:46

とのことでした。これは確かにその通りかもということで、試してみる価値がありそうです。一方で半可通のMLerとしては以下の[twitter:@toshi_k_datasci]さんからのコメントも気になったのでした。

@TJO_datasci 機械学習で確率を推定したいなら，多くの場合calibration的な操作が必要だと考えています．ご参考→｜1.16. Probability calibration https://t.co/MW6oYOEkTZ
— toshi_k (@toshi_k_datasci) 2016年11月12日

ということでRでもprobability calibrationやりたいなぁと思ったんですが、調べても出てこないしこれは自分でRスクリプト書くしかないのかなぁと思っていたらありました。こんな感じです。

{caret}を使えば良いということなので、こちらについてはひとまず同じ{randomForest}を共通して使うことになるランダムフォレストに的を絞ってprobability calibrationを行った上で、傾向スコアを算出し直してみようと思います。

上記二者のどちらについても、前回の記事までで用いていたワークスペースをそのまま利用する前提ですので、この記事から読み始めたという方はお手数ですが過去2回分記事を遡ってRスクリプトを実行しておいてください。

傾向スコア0.25-0.75の範囲に絞ってATE / ATTを算出し直してみる

これは簡単なので、サクッとやってみましょう。まずはSVM。

# ATE

> idx.svm <- which(ps.svm[,1]>=0.25&ps.svm[,1]<=0.75)
> ps.svm.prt <- ps.svm[idx.svm,]
> ivec1.svm.tmp <- ivec1[idx.svm]
> ivec0.svm.tmp <- ivec0[idx.svm]
> iestp1.svm <- (ivec1.svm.tmp/ps.svm.prt[,2])
> iestp0.svm <- (ivec0.svm.tmp/ps.svm.prt[,1])
> iestp.svm <- iestp1.svm+iestp0.svm
> ipwe_gs.svm <- lm(d[idx.svm,]$gamesecond ~ ivec[idx.svm,]+0, weights=iestp.svm)
> summary(ipwe_gs.svm)

Call:
lm(formula = d[idx.svm, ]$gamesecond ~ ivec[idx.svm, ] + 0, weights = iestp.svm)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
 -8177  -6259  -5235  -4725 369945 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
ivec[idx.svm, ]ivec1   4090.0      570.9   7.165 1.02e-12 ***
ivec[idx.svm, ]ivec0   3518.8      624.1   5.638 1.92e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31090 on 2474 degrees of freedom
Multiple R-squared:  0.0325,	Adjusted R-squared:  0.03172 
F-statistic: 41.56 on 2 and 2474 DF,  p-value: < 2.2e-16

# ATT

> iestp1_ATT.svm <- ivec1.svm.tmp
> iestp0_ATT.svm <- ivec0.svm.tmp*ps.svm.prt[,2]/ps.svm.prt[,1]
> iestp_ATT.svm <- iestp1_ATT.svm+iestp0_ATT.svm
> ipwe_ATT_gs.svm <- lm(d[idx.svm,]$gamesecond ~ ivec[idx.svm,]+0, weights=iestp_ATT.svm)
> summary(ipwe_ATT_gs.svm)

Call:
lm(formula = d[idx.svm, ]$gamesecond ~ ivec[idx.svm, ] + 0, weights = iestp_ATT.svm)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
 -6270  -3743  -3743  -3279 219353 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
ivec[idx.svm, ]ivec1   3743.5      559.5   6.691 2.73e-11 ***
ivec[idx.svm, ]ivec0   3624.8      558.2   6.494 1.01e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20380 on 2474 degrees of freedom
Multiple R-squared:  0.03395,	Adjusted R-squared:  0.03317 
F-statistic: 43.47 on 2 and 2474 DF,  p-value: < 2.2e-16

次にランダムフォレスト。

# ATE

> idx.rf <- which(ps.rf[,1]>=0.25&ps.rf[,1]<=0.75)
> ps.rf.prt <- ps.rf[idx.rf,]
> ivec1.rf.tmp <- ivec1[idx.rf]
> ivec0.rf.tmp <- ivec0[idx.rf]
> iestp1.rf <- (ivec1.rf.tmp/ps.rf.prt[,2])
> iestp0.rf <- (ivec0.rf.tmp/ps.rf.prt[,1])
> iestp.rf <- iestp1.rf+iestp0.rf
> ipwe_gs.rf <- lm(d[idx.rf,]$gamesecond ~ ivec[idx.rf,]+0, weights=iestp.rf)
> summary(ipwe_gs.rf)

Call:
lm(formula = d[idx.rf, ]$gamesecond ~ ivec[idx.rf, ] + 0, weights = iestp.rf)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
-10948  -7729  -7167  -5831 243515 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
ivec[idx.rf, ]ivec1   6173.5      691.5   8.927  < 2e-16 ***
ivec[idx.rf, ]ivec0   4780.0      829.8   5.760 9.77e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 30180 on 1903 degrees of freedom
Multiple R-squared:  0.05599,	Adjusted R-squared:  0.055 
F-statistic: 56.44 on 2 and 1903 DF,  p-value: < 2.2e-16

# ATT

> iestp1_ATT.rf <- ivec1.rf.tmp
> iestp0_ATT.rf <- ivec0.rf.tmp*ps.rf.prt[,2]/ps.rf.prt[,1]
> iestp_ATT.rf <- iestp1_ATT.rf+iestp0_ATT.rf
> ipwe_ATT_gs.rf <- lm(d[idx.rf,]$gamesecond ~ ivec[idx.rf,]+0, weights=iestp_ATT.rf)
> summary(ipwe_ATT_gs.rf)

Call:
lm(formula = d[idx.rf, ]$gamesecond ~ ivec[idx.rf, ] + 0, weights = iestp_ATT.rf)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
 -6819  -4778  -4778  -3073 126555 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
ivec[idx.rf, ]ivec1   4778.2      572.1   8.352  < 2e-16 ***
ivec[idx.rf, ]ivec0   4379.0      843.4   5.192  2.3e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18780 on 1903 degrees of freedom
Multiple R-squared:  0.04836,	Adjusted R-squared:  0.04736 
F-statistic: 48.35 on 2 and 1903 DF,  p-value: < 2.2e-16

最後に、xgboost。

# ATE

> idx.gbdt <- which(ps.gbdt[,1]>=0.25&ps.gbdt[,1]<=0.75)
> ps.gbdt.prt <- ps.gbdt[idx.gbdt,]
> ivec1.gbdt.tmp <- ivec1[idx.gbdt]
> ivec0.gbdt.tmp <- ivec0[idx.gbdt]
> iestp1.gbdt <- (ivec1.gbdt.tmp/ps.gbdt.prt[,2])
> iestp0.gbdt <- (ivec0.gbdt.tmp/ps.gbdt.prt[,1])
> iestp.gbdt <- iestp1.gbdt+iestp0.gbdt
> ipwe_gs.gbdt <- lm(d[idx.gbdt,]$gamesecond ~ ivec[idx.gbdt,]+0, weights=iestp.gbdt)
> summary(ipwe_gs.gbdt)

Call:
lm(formula = d[idx.gbdt, ]$gamesecond ~ ivec[idx.gbdt, ] + 0, 
    weights = iestp.gbdt)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
 -8931  -5837  -2771  -2146 227295 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
ivec[idx.gbdt, ]ivec1   4674.2      469.6   9.955  < 2e-16 ***
ivec[idx.gbdt, ]ivec0   1754.8      488.9   3.589 0.000337 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23070 on 2739 degrees of freedom
Multiple R-squared:  0.03928,	Adjusted R-squared:  0.03858 
F-statistic: 55.99 on 2 and 2739 DF,  p-value: < 2.2e-16

# ATT

> iestp1_ATT.gbdt <- ivec1.gbdt.tmp
> iestp0_ATT.gbdt <- ivec0.gbdt.tmp*ps.gbdt.prt[,2]/ps.gbdt.prt[,1]
> iestp_ATT.gbdt <- iestp1_ATT.gbdt+iestp0_ATT.gbdt
> ipwe_ATT_gs.gbdt <- lm(d[idx.gbdt,]$gamesecond ~ ivec[idx.gbdt,]+0, weights=iestp_ATT.gbdt)
> summary(ipwe_ATT_gs.gbdt)

Call:
lm(formula = d[idx.gbdt, ]$gamesecond ~ ivec[idx.gbdt, ] + 0, 
    weights = iestp_ATT.gbdt)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
 -3958  -3958  -2451  -1464 156776 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
ivec[idx.gbdt, ]ivec1   3957.8      421.7   9.385  < 2e-16 ***
ivec[idx.gbdt, ]ivec0   2080.0      529.4   3.929 8.73e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15810 on 2739 degrees of freedom
Multiple R-squared:  0.03642,	Adjusted R-squared:  0.03572 
F-statistic: 51.76 on 2 and 2739 DF,  p-value: < 2.2e-16

SVM・ランダムフォレスト・xgboostのいずれであっても、傾向スコアを0.25-0.75の範囲に制限してATE/ATTを計算し直した結果、「処置群の方が大きい」という結論が得られました。これはブコメでご指摘いただいた通りだと思います。有難うございました。

Probability calibrationをやってみる

これは結構難儀しました。というか多分失敗してますorz　冒頭にも書いたように、ランダムフォレストのみで試しています。

{caret}でキャリブレーションする

一応、stackoverflowに書いてあった通りにやってみます。

> d2 <- d
> d2$cm_dummy <- factor(x=d$cm_dummy, labels=c("Class1","Class2"))

> library(caret)
> model.rf.cal <- train(cm_dummy ~ TVwatch_day + age + sex + marry_dummy + child_dummy + inc + pmoney + area_kanto + area_tokai + area_keihanshin + job_dummy1 + job_dummy2 + job_dummy3 + job_dummy4 + job_dummy5 + job_dummy6 + job_dummy7 + fam_str_dummy1 + fam_str_dummy2 + fam_str_dummy3 + fam_str_dummy4, data=d2, method="rf", trControl = trainControl(savePredictions=T, classProbs=T))
> str(model.rf.cal$pred)
'data.frame':	276249 obs. of  7 variables:
 $ pred    : Factor w/ 2 levels "Class1","Class2": 1 1 1 1 1 1 1 1 1 1 ...
 $ obs     : Factor w/ 2 levels "Class1","Class2": 1 1 1 1 1 1 1 1 1 1 ...
 $ Class1  : num  0.972 0.996 0.994 0.854 0.96 0.666 0.99 0.642 0.982 0.99 ...
 $ Class2  : num  0.028 0.004 0.006 0.146 0.04 0.334 0.01 0.358 0.018 0.01 ...
 $ rowIndex: int  5 9 10 15 19 23 27 28 29 33 ...
 $ mtry    : num  2 2 2 2 2 2 2 2 2 2 ...
 $ Resample: chr  "Resample01" "Resample01" "Resample01" "Resample01" ...
> cal.rf <- calibration(obs~Class1, data=model.rf.cal$pred)
> str(cal.rf)
List of 5
 $ data     :'data.frame':	11 obs. of  5 variables:
  ..$ calibModelVar: chr [1:11] "Class1" "Class1" "Class1" "Class1" ...
  ..$ bin          : Factor w/ 11 levels "[0,0.0909]","(0.0909,0.182]",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..$ Percent      : num [1:11] 0.0187 1.2742 4.9501 9.968 17.5349 ...
  ..$ Count        : num [1:11] 14 56 253 591 1471 ...
  ..$ midpoint     : num [1:11] 4.55 13.64 22.73 31.82 40.91 ...
 $ cuts     : num 11
 $ class    : chr "Class1"
 $ probNames: chr "Class1"
 $ call     : language calibration.formula(x = obs ~ Class1, data = model.rf.cal$pred)
 - attr(*, "class")= chr "calibration"
> xyplot(cal.rf)

{caret}のtrain関数でリサンプリングして膨らませたものをもとにcalibrationした結果を出しているわけですが、とりあえず何となく見た感じには良さげではあります。これで調整すれば良いのでしょうか。

傾向スコアを算出してみる

You could average the class probabilities per rowIndex if you like.

と言うのでこれで調整してみました。いわゆるisotonic regressionをやるってことですかね？

> ps.rf.cal.out <- rep(0,10000)
> for (i in 1:18){
+ ps.rf.cal.out[which(table(model.rf.cal$pred$rowIndex)==(i*3))] <- mean(ps.rf.cal[which(table(model.rf.cal$pred$rowIndex)==(i*3)),2])
+ }

たったこれだけです。でもこんな簡単に行くのかなぁ。。。

ATEを算出してみる

後は前回同様にただATEを算出するだけです。

> iestp1.rf.cal <- (ivec1/ps.rf.cal.out)
> iestp0.rf.cal <- (ivec0/(1-ps.rf.cal.out))
> iestp.rf.cal <- iestp1.rf.cal+iestp0.rf.cal
> ipwe_gs.rf.cal <- lm(d2$gamesecond ~ ivec+0, weights=iestp.rf.cal)
> summary(ipwe_gs.rf.cal)

Call:
lm(formula = d2$gamesecond ~ ivec + 0, weights = iestp.rf.cal)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
 -4889  -4059  -4011  -3859 474508 

Coefficients:
          Estimate Std. Error t value Pr(>|t|)    
ivecivec1   2483.5      249.6   9.952   <2e-16 ***
ivecivec0   3106.7      249.6  12.449   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24950 on 9998 degrees of freedom
Multiple R-squared:  0.02478,	Adjusted R-squared:  0.02458 
F-statistic:   127 on 2 and 9998 DF,  p-value: < 2.2e-16

全然ダメですねorz　ということで、この方法ではどうやらうまくいかないようです。。。というのも理由は簡単で、調整した傾向スコアを見てみるとこうなっていたのでした。

> hist(ps.rf.cal.out,breaks=100)

いくら何でもこれはおかしいので、多分やり方自体が間違っているのでしょう。Pythonのscikit-learnには上記リンク先記事の通り、ちゃんとしたメソッドがあるわけですが。。。どなたかprobability calibrationに詳しい方、ぜひぜひご教授下さいm(_ _)m

そもそも最初に算出した4種類の傾向スコアは互いにどう異なっていたのか

実はcalibration関数に与えるデータを適当にいじることで、xyplotを当初の4種類の傾向スコアのそれぞれについて描くことができます。

> df4 <- data.frame(obs=factor(d$cm_dummy),svm=ps.svm[,1],rf=ps.rf[,1],gbdt=ps.gbdt[,1],lr=1-ps)
> cal4 <- calibration(obs~svm+rf+gbdt+lr,data=df4)
> xyplot(cal4)