Rで異常検知(1): これまで自分がやってきたことのおさらい - 渋谷駅前で働くデータサイエンティストのブログ

ぶっちゃけ今更感がなくもないんですが、実はこれまで自分ではほとんど異常検知・変化検知をゴリゴリやったことがなかったなぁと思ったのでした。きっかけは、時々色々な手法のテストに使っているこのUCI 機械学習リポジトリのデータセット。

UCI Machine Learning Repository: Water Treatment Plant Data Set

これは説明にもあるように、とある都市部の下水処理場の様々なセンサからのインプットを合わせて日次でまとめたデータセットです。この手のプラントデータセットにありがちな課題がまさにその異常検知で、要は何か不具合があった日付を事後で良いので検出したいというお話です。

異常検知自体は、以前このブログでさらっとだけ取り上げたことがあります。それは{AnomalyDetection}パッケージの紹介記事。

この時は{AnomalyDetection}が依拠するgeneralized ESD testの話をちょっと紹介しただけで、根底にある理論とかその他の類似手法などについては特に触れずじまいでした。

また、もっと以前には{MSwM}パッケージを用いてマルコフ状態転換モデルを異常検知に使う話題を取り上げたこともあります。

これはこれでうまくいっているように見えますが、これだと単変量時系列にしか使えないなぁという印象を持ったのも事実です。そう、これらの方法そのままでは多変量の異常検知はできないというわけです。それで若干消化不良感を覚えたままそれきり放っておいたのでした。

なのですが、上記のWater Treatment Plant Datasetをたまたま何度か触るようになって「これではいかん」と思った次第で、買ってきたのがこちらの書籍。

入門機械学習による異常検知―Rによる実践ガイド

作者: 井手剛
出版社/メーカー: コロナ社
発売日: 2015/02/19
メディア: 単行本
この商品を含むブログ (4件) を見る

機械学習界隈の皆さんならご存知の、@Idesanさんの手による異常検知本です*1。次回からはこちらの書籍を踏まえて勉強していくとして、今回は「今現在の自分はどうやって異常検知をやっているのか」という現状把握のまとめを書いておこうと思います。

Water Treatment Plant Datasetについて

改めてデータセットについて再掲しておきましょう。

UCI Machine Learning Repository: Water Treatment Plant Data Set

Data Set Information:
This dataset comes from the daily measures of sensors in a urban waste water treatment plant. The objective is to classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process. This domain has been stated as an ill-structured domain.

Attribute Information:
All atrributes are numeric and continuous
N. Attrib.
1 Q-E (input flow to plant)
2 ZN-E (input Zinc to plant)
3 PH-E (input pH to plant)
4 DBO-E (input Biological demand of oxygen to plant)
5 DQO-E (input chemical demand of oxygen to plant)
6 SS-E (input suspended solids to plant)
7 SSV-E (input volatile supended solids to plant)
8 SED-E (input sediments to plant)
9 COND-E (input conductivity to plant)
10 PH-P (input pH to primary settler)
11 DBO-P (input Biological demand of oxygen to primary settler)
12 SS-P (input suspended solids to primary settler)
13 SSV-P (input volatile supended solids to primary settler)
14 SED-P (input sediments to primary settler)
15 COND-P (input conductivity to primary settler)
16 PH-D (input pH to secondary settler)
17 DBO-D (input Biological demand of oxygen to secondary settler)
18 DQO-D (input chemical demand of oxygen to secondary settler)
19 SS-D (input suspended solids to secondary settler)
20 SSV-D (input volatile supended solids to secondary settler)
21 SED-D (input sediments to secondary settler)
22 COND-D (input conductivity to secondary settler)
23 PH-S (output pH)
24 DBO-S (output Biological demand of oxygen)
25 DQO-S (output chemical demand of oxygen)
26 SS-S (output suspended solids)
27 SSV-S (output volatile supended solids)
28 SED-S (output sediments)
29 COND-S (output conductivity)
30 RD-DBO-P (performance input Biological demand of oxygen in primary settler)
31 RD-SS-P (performance input suspended solids to primary settler)
32 RD-SED-P (performance input sediments to primary settler)
33 RD-DBO-S (performance input Biological demand of oxygen to secondary settler)
34 RD-DQO-S (performance input chemical demand of oxygen to secondary settler)
35 RD-DBO-G (global performance input Biological demand of oxygen)
36 RD-DQO-G (global performance input chemical demand of oxygen)
37 RD-SS-G (global performance input suspended solids)
38 RD-SED-G (global performance input sediments)

これの先頭列に日付が入ったデータセットなんですが、特徴量リストを見れば分かるようにこれは基本的には学習ラベルなしの教師なし学習向けのデータセットだとも言えます*2。なんですが、結構欠損値が多いのでそのまま扱うとちょっと面倒です。今回は欠損値補完の勉強をするわけではないので、先にNA行を抜いたデータセットを手元で作っておきましたので下に置いておきます。

以下このデータセットを使ってやっていきます。

Ward法で見当をつけてK-meansでトドメを刺す

今現在僕がこのデータセットに対して出来ることは「クラスタリングでとにかくサンプルサイズの最も小さなクラスタを探し出す」というやり方です。言い換えると、これは「最もサンプルサイズの小さなクラスタこそが外れ値とみなせる」という素朴な異常検知の考え方です。

とは言え、いきなりK-meansみたいな方法でやってもよく分からない結果になりそうな気もするので、まずはWard法でクラスタリングがてら可視化することで見当をつけてみます。

> d <- read.csv('watertreatment_mod.csv')
> d.dist <- dist(d[,-1])
> d.hcl <- hclust(d.dist, method='ward.D2')
> plot(d.hcl, labels=d[,1])

f:id:TJO:20170110182516p:plain

何となくサンプルサイズが小さくて、尚且つ孤立したように見えるクラスタがチラホラ見えます。これをK-meansでバシッと特定できれば良いのかなと思われるので、とりあえずK = 4, ..., 10で逐次試してみます。

> for (i in 4:10){
+     km <- kmeans(d[,-1], centers=i)
+     print(table(km$cluster))
+ }

  1   2   3   4 
176  61  39 104 

  1   2   3   4   5 
 16  70 156  88  50 

  1   2   3   4   5   6 
 31  16 119  75  87  52 

  1   2   3   4   5   6   7 
  4  40 130  48  65  81  12 

 1  2  3  4  5  6  7  8 
84 64 16  3 40 63 67 43 

 1  2  3  4  5  6  7  8  9 
 7 45  3 65 53 44 64 16 83 

 1  2  3  4  5  6  7  8  9 10 
53 27 54  3 68 33 32 45 50 15

K > 7になると何故か3サンプルしか分類されないクラスタが連続して出てくるのが見て取れますね。これを特定すると、

> for (i in 8:10){
+     km <- kmeans(d[,-1], centers=i)
+     cls <- which(table(km$cluster)==3)
+     print(d$date[km$cluster==cls])
+ }
[1] D-16/9/90 D-2/8/90  D-11/8/91
527 Levels: D-1/1/90 D-1/1/91 D-1/10/90 D-1/10/91 D-1/11/90 D-1/2/90 D-1/2/91 D-1/3/90 D-1/3/91 D-1/4/90 ... D-9/9/90
[1] D-16/9/90 D-2/8/90  D-11/8/91
527 Levels: D-1/1/90 D-1/1/91 D-1/10/90 D-1/10/91 D-1/11/90 D-1/2/90 D-1/2/91 D-1/3/90 D-1/3/91 D-1/4/90 ... D-9/9/90
[1] D-16/9/90 D-2/8/90  D-11/8/91
527 Levels: D-1/1/90 D-1/1/91 D-1/10/90 D-1/10/91 D-1/11/90 D-1/2/90 D-1/2/91 D-1/3/90 D-1/3/91 D-1/4/90 ... D-9/9/90