RにTensorFlow + Kerasを実装した{keras}パッケージがやって来たので試してみた（追記2件あり）

Python側でのTensorFlowの隆盛を他所に、R側では{tensorflow}も使いにくいし*1これはPythonistaに転生しなければならんのかなぁ。。。ということを思っていたら、出ました。あのKerasのRパッケージです。

インストール手順は普通にhttps://rstudio.github.io/keras/に書いてある通り、以下の通りRコンソールから打てばおしまいです。

devtools::install_github("rstudio/keras")
library(keras)
install_tensorflow()

ということで、早速色々試してみようと思います。

3層NNをサクッと試してみる

データセットはこちらのXORデータセット（10万サンプル）を使います。

> d <- read.csv('xor_complex_large.txt', header=T, sep='\t')
> x_train <- as.matrix(d[,-3])
> y_train <- as.matrix(d[,3]) - 1

> library(keras)
> model <- keras_model_sequential() 
> 
> model %>% 
+     layer_dense(units = 5, input_shape = 2) %>% 
+     layer_activation(activation = 'relu') %>% 
+     layer_dense(units = 7) %>%
+     layer_activation(activation = 'relu') %>%
+     layer_dense(units = 1, activation = 'sigmoid')
> 
> model %>% compile(
+     loss = 'binary_crossentropy',
+     optimizer = optimizer_sgd(lr = 0.05),
+     metrics = c('accuracy')
+ )
> model %>% fit(x_train, y_train, epochs = 5, batch_size = 100)
Epoch 1/5
100000/100000 [==============================] - 2s - loss: 0.3898 - acc: 0.8273     
Epoch 2/5
100000/100000 [==============================] - 1s - loss: 0.3381 - acc: 0.8525     
Epoch 3/5
100000/100000 [==============================] - 1s - loss: 0.3288 - acc: 0.8557     
Epoch 4/5
100000/100000 [==============================] - 2s - loss: 0.3246 - acc: 0.8571     
Epoch 5/5
100000/100000 [==============================] - 1s - loss: 0.3233 - acc: 0.8570    
> px <- seq(-4,4,0.03)
> py <- seq(-4,4,0.03)
> x_test <- expand.grid(px, py)
> x_test <- as.matrix(x_test)
> pred_class <- model %>% predict(x_test, batch_size=100)
> pred_class <- round(pred_class, 0)
> plot(d[,-3], col=c(rep(1,50000), rep(2,50000)), pch=19, cex=0.2, xlim=c(-4,4), ylim=c(-4,4))
> par(new=T)
> contour(px, py, array(pred_class, dim=c(length(px), length(py))), xlim=c(-4,4), ylim=c(-4,4), levels=0.5, col='purple', lwd=5)

f:id:TJO:20170601190431p:plain

当たり前ですが、綺麗なXOR決定境界が引けました。

オンラインニュース人気度データセットで4層のDNNを試してみる

これは以前{mxnet}で試した記事と同じものです。

> d <- read.csv('OnlineNewsPopularity.csv')
> d <- d[,-c(1,2)]
> idx <- sample(nrow(d),5000,replace=F)
> d_train <- d[-idx,]
> d_test <- d[idx,]
> train <- data.matrix(d_train)
> test <- data.matrix(d_test)
> train.x <- train[,-59]
> train.y <- train[,59]
> test.x <- test[,-59]
> test.y <- test[,59]
> # 正規化
> train_means <- apply(train.x, 2, mean)
> train_stds <- apply(train.x, 2, sd)
> test_means <- apply(test.x, 2, mean)
> test_stds <- apply(test.x, 2, sd)
> train.x <- t((t(train.x)-train_means)/train_stds)
> test.x <- t((t(test.x)-test_means)/test_stds)
> 
> # 目的変数を対数変換
> train.y <- log(train.y)
> test.y <- log(test.y)
> 
> library(keras)
> model <- keras_model_sequential()
> model %>% layer_dense(units = 200, input_shape=58) %>%
+ layer_activation(activation = 'sigmoid') %>%
+ layer_dropout(rate = 0.2) %>%
+ layer_dense(units = 50) %>%
+ layer_activation(activation = 'sigmoid') %>%
+ layer_dropout(rate = 0.2) %>%
+ layer_dense(units = 44) %>%
+ layer_activation(activation = 'sigmoid') %>%
+ layer_dropout(rate = 0.2) %>%
+ layer_dense(units = 1) %>%
+ layer_activation(activation = 'linear') %>%
+ compile(
+ loss = 'mse',
+ optimizer = optimizer_rmsprop(lr = 0.001)
+ )
> model %>% fit(train.x, train.y, epochs = 8, batch_size = 100)
Epoch 1/8
34644/34644 [==============================] - 1s - loss: 4.3911     
Epoch 2/8
34644/34644 [==============================] - 1s - loss: 1.2735     
Epoch 3/8
34644/34644 [==============================] - 1s - loss: 1.1401     
Epoch 4/8
34644/34644 [==============================] - 1s - loss: 1.0667     
Epoch 5/8
34644/34644 [==============================] - 1s - loss: 1.0511     
Epoch 6/8
34644/34644 [==============================] - 1s - loss: 1.0271     
Epoch 7/8
34644/34644 [==============================] - 1s - loss: 0.9964     
Epoch 8/8
34644/34644 [==============================] - 1s - loss: 1.0003   
> 
> pred_reg <- model %>% predict(test.x, batch_size=100)
> library(Metrics)
> rmse(pred_reg, test.y)
[1] 0.8589696

意外とそこまで悪くない結果になりました。少なくとも{mxnet}でベイズ最適化まで突っ込んで試行錯誤した割にあまり良いスコアにならなかったことを考えると上出来かなと。

ド定番MNISTをDNNでやってみる

もう見たまんまです。これはDNNというよりMLP（Multi-Layer Perceptron: 多層パーセプトロン）と言った方が正しいのだろうとは思いますが。

> train <- read.csv('short_mnist_train.csv', header=T, sep=',')
> test <- read.csv('short_mnist_test.csv', header=T, sep=',')
> train.x <- as.matrix(train[,-1]/255)
> test.x <- as.matrix(test[,-1]/255)
> train.y <- train[,1] %>% to_categorical(num_classes = 10)
> test.y <- test[,1] %>% to_categorical(num_classes = 10)

> library(keras)
> model <- keras_model_sequential()
> model %>%
+ layer_dense(units = 128, input_shape=c(784)) %>%
+ layer_activation(activation = 'relu') %>%
+ # layer_dropout(rate = 0.5) %>%
+ layer_dense(units = 64) %>%
+ layer_activation(activation = 'relu') %>%
+ # layer_dropout(rate = 0.5) %>%
+ layer_dense(units = 10) %>%
+ layer_activation(activation = 'softmax') %>%
+ compile(
+ loss = 'categorical_crossentropy',
+ optimizer = optimizer_sgd(lr = 0.07, decay = 1e-6, momentum = 0.9, nesterov = TRUE)
+ )
> t <- proc.time()
> model %>% fit(train.x, train.y, epochs = 10, batch_size = 100)
Epoch 1/10
5000/5000 [==============================] - 0s - loss: 0.7911     
Epoch 2/10
5000/5000 [==============================] - 0s - loss: 0.2737     
Epoch 3/10
5000/5000 [==============================] - 0s - loss: 0.1813     
Epoch 4/10
5000/5000 [==============================] - 0s - loss: 0.1218     
Epoch 5/10
5000/5000 [==============================] - 0s - loss: 0.0807     
Epoch 6/10
5000/5000 [==============================] - 0s - loss: 0.0583     
Epoch 7/10
5000/5000 [==============================] - 0s - loss: 0.0358     
Epoch 8/10
5000/5000 [==============================] - 0s - loss: 0.0238     
Epoch 9/10
5000/5000 [==============================] - 0s - loss: 0.0162     
Epoch 10/10
5000/5000 [==============================] - 0s - loss: 0.0104     
> proc.time() - t
   ユーザ   システム       経過  
     5.216      0.538      3.332 
> pred_class <- model %>% predict(test.x, batch_size=100)
> pred_label <- t(max.col(pred_class))
> table(test[,1], pred_label)
   pred_label
     1  2  3  4  5  6  7  8  9 10
  0 94  0  0  0  0  1  2  1  1  1
  1  0 97  2  0  0  0  0  0  1  0
  2  0  0 96  1  0  0  1  2  0  0
  3  0  1  2 91  0  3  0  0  2  1
  4  0  0  0  0 95  0  2  1  0  2
  5  0  1  0  1  0 95  2  0  1  0
  6  0  0  0  0  0  2 97  0  1  0
  7  0  0  0  0  1  0  0 96  0  3
  8  0  0  1  1  1  0  0  0 97  0
  9  0  0  0  0  2  1  1  1  0 95
> sum(diag(table(test[,1], pred_label)))/nrow(test)
[1] 0.953 # seedを決めていないので多少変動するはず

一応、それなりには走ります。またそれなりの精度にはなります。ちなみに、{mxnet}で全く同じネットワークを試してみるとこうなります。

> train<-data.matrix(train)
> test<-data.matrix(test)
> train.x<-train[,-1]
> train.y<-train[,1]
> train.x<-t(train.x/255)
> test_org<-test
> test<-test[,-1]
> test<-t(test/255)
> 
> library(mxnet)
> data <- mx.symbol.Variable("data")
> fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
> act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
> fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
> act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
> fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
> softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")
> devices <- mx.cpu()
> mx.set.seed(0)
> t <- proc.time()
> model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y,
+                                      ctx=devices, num.round=10, array.batch.size=100,
+                                      learning.rate=0.07, momentum=0.9,  eval.metric=mx.metric.accuracy,
+                                      initializer=mx.init.uniform(0.07),
+                                      epoch.end.callback=mx.callback.log.train.metric(100))
Start training with 1 devices
[1] Train-accuracy=0.470204081632653
[2] Train-accuracy=0.8326
[3] Train-accuracy=0.9052
[4] Train-accuracy=0.9278
[5] Train-accuracy=0.9466
[6] Train-accuracy=0.9568
[7] Train-accuracy=0.9646
[8] Train-accuracy=0.9756
[9] Train-accuracy=0.9888
[10] Train-accuracy=0.9926
> 
> proc.time() - t
   ユーザ   システム       経過  
     2.540      0.601      1.718 
> preds <- predict(model, test, ctx=devices)
> pred.label <- max.col(t(preds)) - 1
> table(test_org[,1],pred.label)
   pred.label
      0   1   2   3   4   5   6   7   8   9
  0  94   0   1   0   0   0   3   0   1   1
  1   0 100   0   0   0   0   0   0   0   0
  2   0   0  97   1   1   0   0   1   0   0
  3   0   0   2  91   0   3   0   1   2   1
  4   0   0   0   0  95   0   2   1   0   2
  5   0   1   0   1   0  96   0   0   2   0
  6   1   0   0   0   1   3  94   0   1   0
  7   0   0   0   0   0   0   0 100   0   0
  8   0   0   1   1   1   1   0   0  96   0
  9   0   0   0   0   6   1   0   4   0  89
> sum(diag(table(test_org[,1],pred.label)))/1000
[1] 0.952

あれれ、{mxnet}の方が精度は同じくらいなのに2秒以上速いですね（笑）。これはPythonバージョンのKerasだとちょっと考えにくい感じかなぁと。

MNISTのCNNをやろうとしたらコケた話

最後に、MNISTのCNNをやってみようと思います。なのですが。。。

library(keras)

train <- read.csv('short_mnist_train.csv', header=T, sep=',')
test <- read.csv('short_mnist_test.csv', header=T, sep=',')

x_train <- array(train[,-1]/255, dim=c(5000, 28, 28, 1))
y_train <- train[,1] %>% 
  to_categorical(num_classes = 10)
x_test <- array(test[,-1]/255, dim=c(1000, 28, 28, 1))
y_test <- test[,1] %>% 
  to_categorical(num_classes = 10)

# create model
model <- keras_model_sequential()

model %>% 
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu', 
                input_shape = c(28,28, 1)) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_dense(units = 10, activation = 'softmax') %>% 
  compile(
    loss = 'categorical_crossentropy', 
    optimizer = optimizer_sgd(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = TRUE)
  )
  
# train
model %>% fit(x_train, y_train, batch_size = 32, epochs = 10)

# evaluate
score <- model %>% evaluate(x_test, y_test, batch_size = 32)

これで走るはずなんですが。。。出来上がったx_train, x_testのarrayの容量を見たら147GB, 6GBもあってローカルでは走らないと判明orz　せめて{Matrix}のsparse.model.matrix使わせて欲しいなぁと。。。もしくはFAQで推奨されているtrain_on_batchを使うとか、time_distributedで軽量化するとかする必要がありそうです。ちょっとここは詳しい人に解決策教えてもらわないと。。。

追記1：ExampleのMNIST x CNNの例は走りました

ExampleにMNIST x CNNの例が載っているんですが、こちらは確かに走りました。

library(keras)

batch_size <- 128
num_classes <- 10
epochs <- 12

# input image dimensions
img_rows <- 28
img_cols <- 28

# the data, shuffled and split between train and test sets
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

x_train <- array(as.numeric(x_train), dim = c(dim(x_train)[[1]], img_rows, img_cols, 1))
x_test <- array(as.numeric(x_test), dim = c(dim(x_test)[[1]], img_rows, img_cols, 1))
input_shape <- c(img_rows, img_cols, 1)

x_train <- x_train / 255
x_test <- x_test / 255

cat('x_train_shape:', dim(x_train), '\n')
cat(dim(x_train)[[1]], 'train samples\n')
cat(dim(x_test)[[1]], 'test samples\n')

# convert class vectors to binary class matrices
y_train <- to_categorical(y_train, num_classes)
y_test <- to_categorical(y_test, num_classes)

# define model
model <- keras_model_sequential()
model %>%
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',
                input_shape = input_shape) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.25) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = num_classes, activation = 'softmax')

# compile model
model %>% compile(
  loss = loss_categorical_crossentropy,
  optimizer = optimizer_adadelta(),
  metrics = c('accuracy')
)

# train and evaluate
model %>% fit(
  x_train, y_train,
  batch_size = batch_size,
  epochs = epochs,
  verbose = 1,
  validation_data = list(x_test, y_test)
)
scores <- model %>% evaluate(
  x_test, y_test, verbose = 0
)

cat('Test loss:', scores[[1]], '\n')
cat('Test accuracy:', scores[[2]], '\n')

Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2017-06-03 14:11:45.945824: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-03 14:11:45.945842: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-03 14:11:45.945848: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-03 14:11:45.945853: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-03 14:11:45.945859: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
60000/60000 [==============================] - 197s - loss: 0.3336 - acc: 0.8989 - val_loss: 0.0789 - val_acc: 0.9761
Epoch 2/12
60000/60000 [==============================] - 197s - loss: 0.1136 - acc: 0.9667 - val_loss: 0.0554 - val_acc: 0.9818
Epoch 3/12
60000/60000 [==============================] - 187s - loss: 0.0853 - acc: 0.9749 - val_loss: 0.0439 - val_acc: 0.9855
Epoch 4/12
60000/60000 [==============================] - 192s - loss: 0.0713 - acc: 0.9793 - val_loss: 0.0389 - val_acc: 0.9873
Epoch 5/12
60000/60000 [==============================] - 179s - loss: 0.0638 - acc: 0.9817 - val_loss: 0.0347 - val_acc: 0.9884
Epoch 6/12
60000/60000 [==============================] - 181s - loss: 0.0549 - acc: 0.9838 - val_loss: 0.0326 - val_acc: 0.9894
Epoch 7/12
60000/60000 [==============================] - 184s - loss: 0.0511 - acc: 0.9849 - val_loss: 0.0309 - val_acc: 0.9889
Epoch 8/12
60000/60000 [==============================] - 182s - loss: 0.0457 - acc: 0.9864 - val_loss: 0.0301 - val_acc: 0.9891
Epoch 9/12
60000/60000 [==============================] - 181s - loss: 0.0437 - acc: 0.9866 - val_loss: 0.0280 - val_acc: 0.9915
Epoch 10/12
60000/60000 [==============================] - 175s - loss: 0.0414 - acc: 0.9878 - val_loss: 0.0283 - val_acc: 0.9908
Epoch 11/12
60000/60000 [==============================] - 182s - loss: 0.0372 - acc: 0.9890 - val_loss: 0.0274 - val_acc: 0.9911
Epoch 12/12
60000/60000 [==============================] - 193s - loss: 0.0377 - acc: 0.9889 - val_loss: 0.0270 - val_acc: 0.9913
> scores <- model %>% evaluate(
+     x_test, y_test, verbose = 0
+ )
> 
> cat('Test loss:', scores[[1]], '\n')
Test loss: 0.0269964 
> cat('Test accuracy:', scores[[2]], '\n')
Test accuracy: 0.9913

ちゃんとExamplesの説明通り、99%以上の精度が出ています。

追記2：元のMNISTのCNNも走りました

ようやく分かりました。

TJOさんの例でも，x_train <- array(as.matrix(train[,-1]/255), dim=c(5000, 28, 28, 1))などとすると正しい配列ができて動くようです．私も普段あまり配列は使わないので，こういうのは地味にハマりポイントですね...．
— nakamichi (@__nakamichi__) 2017年6月3日

ということで、arrayに渡す時にas.matrixするとtrainで30MBぐらい＆testで6MBぐらいということで、ちゃんとまともな容量のarrayになって回るようです。ごめんなさいorz

library(keras)

train <- read.csv('short_mnist_train.csv', header=T, sep=',')
test <- read.csv('short_mnist_test.csv', header=T, sep=',')

x_train <- array(<span style="color: #ff0000">as.matrix</span>(train[,-1]/255), dim=c(5000, 28, 28, 1))
y_train <- train[,1] %>% 
    to_categorical(num_classes = 10)
x_test <- array(<span style="color: #ff0000">as.matrix</span>(test[,-1]/255), dim=c(1000, 28, 28, 1))
y_test <- test[,1] %>% 
    to_categorical(num_classes = 10)

# create model
model <- keras_model_sequential()

model %>% 
    layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu', 
                  input_shape = c(28,28, 1)) %>% 
    layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
    layer_max_pooling_2d(pool_size = c(2,2)) %>% 
    layer_dropout(rate = 0.25) %>% 
    layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
    layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
    layer_max_pooling_2d(pool_size = c(2,2)) %>% 
    layer_dropout(rate = 0.25) %>% 
    layer_flatten() %>% 
    layer_dense(units = 128, activation = 'relu') %>% 
    layer_dropout(rate = 0.25) %>% 
    layer_dense(units = 10, activation = 'softmax') %>% 
    compile(
        loss = 'categorical_crossentropy', 
        optimizer = optimizer_sgd(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = TRUE)
    )

# train
model %>% fit(x_train, y_train, batch_size = 32, epochs = 10)

# evaluate
score <- model %>% evaluate(x_test, y_test, batch_size = 32)

Epoch 1/10
5000/5000 [==============================] - 20s - loss: 1.1239    
Epoch 2/10
5000/5000 [==============================] - 25s - loss: 0.2766    
Epoch 3/10
5000/5000 [==============================] - 23s - loss: 0.1871    
Epoch 4/10
5000/5000 [==============================] - 21s - loss: 0.1420    
Epoch 5/10
5000/5000 [==============================] - 23s - loss: 0.1142    
Epoch 6/10
5000/5000 [==============================] - 23s - loss: 0.1089    
Epoch 7/10
5000/5000 [==============================] - 22s - loss: 0.0929    
Epoch 8/10
5000/5000 [==============================] - 22s - loss: 0.0819    
Epoch 9/10
5000/5000 [==============================] - 22s - loss: 0.0674    
Epoch 10/10
5000/5000 [==============================] - 21s - loss: 0.0616    
> 
> # evaluate
> score <- model %>% evaluate(x_test, y_test, batch_size = 32)
 992/1000 [============================>.] - ETA: 0s
> score
[1] 0.05340803
> pred_class <- model %>% predict(x_test, batch_size=100)
> pred_label <- t(max.col(pred_class))
> table(test[,1], pred_label)
   pred_label
      1   2   3   4   5   6   7   8   9  10
  0  98   0   0   0   0   0   2   0   0   0
  1   0 100   0   0   0   0   0   0   0   0
  2   0   0  99   0   0   0   0   1   0   0
  3   0   0   0 100   0   0   0   0   0   0
  4   0   2   0   0  97   0   1   0   0   0
  5   0   0   0   0   0 100   0   0   0   0
  6   1   0   0   0   0   2  97   0   0   0
  7   0   0   0   1   0   0   0  99   0   0
  8   0   1   1   1   0   1   0   0  96   0
  9   0   0   0   0   1   1   0   1   0  97
> sum(diag(table(test[,1], pred_label)))/nrow(test)
[1] 0.983

ちゃんと回りました。スコアも学習5000行、テスト1000行にしては悪くない結果になっていると思います。なお、これとほぼ同じ？ネットワークを{mxnet}で作って回してみたら。。。何故か同じ畳み込み層の連続＆学習係数＆ミニバッチ数＆エポック数では全然学習しないので、ちょっといじってありますがこんな感じになります。

> train <- read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_train.csv')
> test <- read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_test.csv')
> train<-data.matrix(train)
> test<-data.matrix(test)
> train.x<-train[,-1]
> train.y<-train[,1]
> train.x<-t(train.x/255)
> test_org<-test
> test<-test[,-1]
> test<-t(test/255)
> 
> library(mxnet)
> 
> data <- mx.symbol.Variable("data")
> devices<-mx.cpu()
> # first conv
> conv1 <- mx.symbol.Convolution(data=data, kernel=c(3,3), num_filter=32)
> tanh1 <- mx.symbol.Activation(data=conv1, act_type="relu")
> pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max",
+                            kernel=c(2,2), stride=c(2,2))
> drop1 <- mx.symbol.Dropout(data=pool1,p=0.5)
> # second conv
> conv2 <- mx.symbol.Convolution(data=drop1, kernel=c(3,3), num_filter=64)
> tanh2 <- mx.symbol.Activation(data=conv2, act_type="relu")
> pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max",
+                            kernel=c(2,2), stride=c(2,2))
> drop2 <- mx.symbol.Dropout(data=pool2,p=0.5)
> # first fullc
> flatten <- mx.symbol.Flatten(data=drop2)
> fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=128)
> tanh4 <- mx.symbol.Activation(data=fc1, act_type="relu")
> drop4 <- mx.symbol.Dropout(data=tanh4,p=0.5)
> # second fullc
> fc2 <- mx.symbol.FullyConnected(data=drop4, num_hidden=10)
> # loss
> lenet <- mx.symbol.SoftmaxOutput(data=fc2)
> train.array <- train.x
> dim(train.array) <- c(28, 28, 1, ncol(train.x))
> test.array <- test
> dim(test.array) <- c(28, 28, 1, ncol(test))
> mx.set.seed(71)
> devices <- mx.cpu()
> tic <- proc.time()
> model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y,
+                                      ctx=devices, num.round=30, array.batch.size=100,
+                                      learning.rate=0.05, momentum=0.9, wd=0.00001,
+                                      eval.metric=mx.metric.accuracy,
+                                      epoch.end.callback=mx.callback.log.train.metric(100))
Start training with 1 devices
[1] Train-accuracy=0.0965306122448979
[2] Train-accuracy=0.0892
[3] Train-accuracy=0.0886
[4] Train-accuracy=0.0888
[5] Train-accuracy=0.0878
[6] Train-accuracy=0.0876
[7] Train-accuracy=0.0882
[8] Train-accuracy=0.089
[9] Train-accuracy=0.0924
[10] Train-accuracy=0.1272
[11] Train-accuracy=0.4322
[12] Train-accuracy=0.676
[13] Train-accuracy=0.7794
[14] Train-accuracy=0.8334
[15] Train-accuracy=0.8672
[16] Train-accuracy=0.8854
[17] Train-accuracy=0.9046
[18] Train-accuracy=0.912
[19] Train-accuracy=0.9182
[20] Train-accuracy=0.9234
[21] Train-accuracy=0.925
[22] Train-accuracy=0.9336
[23] Train-accuracy=0.9342
[24] Train-accuracy=0.933
[25] Train-accuracy=0.9442
[26] Train-accuracy=0.9402
[27] Train-accuracy=0.9384
[28] Train-accuracy=0.9502
[29] Train-accuracy=0.9466
[30] Train-accuracy=0.9468
> print(proc.time() - tic)
   ユーザ   システム       経過  
   528.029      3.727    475.477 
> preds <- predict(model, test.array, ctx=devices)
> pred.label <- max.col(t(preds)) - 1
> table(test_org[,1],pred.label)
   pred.label
      0   1   2   3   4   5   6   7   8   9
  0  98   0   0   0   0   0   2   0   0   0
  1   0 100   0   0   0   0   0   0   0   0
  2   0   0  97   2   0   0   0   1   0   0
  3   0   0   0  99   0   0   0   0   1   0
  4   0   2   0   0  96   0   0   1   0   1
  5   0   1   0   0   0  99   0   0   0   0
  6   0   0   0   0   0   1  98   0   1   0
  7   0   0   0   0   0   0   0 100   0   0
  8   0   2   1   0   1   0   0   0  96   0
  9   0   0   0   0   1   0   0   2   0  97
> sum(diag(table(test_org[,1],pred.label)))/1000
[1] 0.98

{keras}が222秒でACC 0.983だったのに対して、{mxnet}は528秒でACC 0.98。同じSGDで回してるはずなんですが（{mxnet}だとデフォルトはSGD）、ちょっと結構な差がついた気がします。

感想

（追記前の感想）
同じことをやっても{mxnet}の方が{keras}より速いし、何と言っても{keras}でCNN回そうとするとshort version MNISTでもOOMでクラッシュするし。。。{mxnet}ならバカでっかいarrayにしなくても自前で2次元に変換して畳み込みやってくれるし。。。ということで、RでKerasを使うのはあまり相性が良くないんじゃないかと思った次第です。

ということで実際に回り切ったのを確認した上で改めて感想を書くと、全く同じネットワークを組んで比較した感じだと（実はPython側でKerasを触っていた時も思っていましたが）{keras}の方が学習効率も良く高精度のモデルが組み上がる印象があります。ただ、RっぽさのあるところとRっぽくないところが混在している感じで、若干やりにくいなぁ感が。。。

（追記前の感想）
やっぱりKerasはPythonでTensorFlowバックエンドで動かして何ぼだと思いますので、KerasでDeep Learningやりたい人はPythonでやりましょう。お粗末様でしたm(_ _)m

この感想は正直言って変わらないですorz　ぶっちゃけ単なるラッパーとしてではなく、R向けにソースから書き直してくれたらいいなぁと思ったんですが、それはTensorFlowのR版を作るというのと同じことなので過剰な要望かなぁと。。。ということで、やはり皆さんPythonもしっかり勉強しましょう（汗）。お疲れ様でした。

*1:自社のフレームワークに対してあるまじき発言ですが笑