RにTorchとLightGBMがやってきた - 渋谷駅前で働くデータサイエンティストのブログ

f:id:TJO:20201004154349p:plain

これまで、RとPythonは両方使える人が少なくないながらも開発陣やコミュニティの思想が違うせいもあってか、「Rは統計学向け」「Pythonは機械学習向け」的な住み分けが年々進み、特に機械学習関連の重要なフレームワーク・ライブラリ類はPython向けのみがリリースされることが多く、R向けにはリリースされないということが常態化している印象がありました。

そんな中、この9月にPythonの機械学習 OSSを代表する2つのライブラリが相次いでR版パッケージを発表したので、個人的にはなかなか驚きました。中には「この2つがRに来たからにはもうPythonは触らない」と豪語する過激派の方もいらっしゃるようですが（笑）、それはさておき個人的な備忘録としてこの2つのR版パッケージを試してみた記録を記事として残しておこうと思います。

なお、以下のモデリングはほぼ何もチューニングを行っておりません。あくまでも「こうすれば回ります」という動作確認以外の何物をも目的としておりませんので、チューニングなどは皆さんのお手元で皆さんの責任で行っていただくようお願いいたしますm(_ _)m　またいつもながらですが、コード実装部分に誤りなどあれば是非コメント欄などでご指摘いただけると有難いです。

実行環境
PyTorchのCRAN版{torch}パッケージ
- KMNIST（くずし字MNIST）チュートリアル
- {torch}のWine Qualityデータセットへの適用
R版{lightgbm}パッケージ
- agaricusデータセットを用いたチュートリアル
- {lightgbm}のWine Qualityデータセットへの適用
感想など

実行環境

sessionInfo()
#>R version 3.6.3 (2020-02-29)
#>Platform: x86_64-apple-darwin15.6.0 (64-bit)
#>Running under: macOS Catalina 10.15.6
#>
#>Matrix products: default
#>BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#>LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#>locale:
#>[1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8

上記の通りです。Mac Book ProでGPUは挿していないので、原則CPUモードで全て実行しています。

library(randomForest)
# Load the datasets
w_train <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_train.csv")
w_test <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_test.csv")

# Transform "quality" into binary
w_train$quality[w_train$quality < 6] <- 0
w_test$quality[w_test$quality < 6] <- 0
w_train$quality[w_train$quality > 0] <- 1
w_test$quality[w_test$quality > 0] <- 1

c_scale <- function(x){
  x <- (x - min(x)) / (max(x) - min(x))
}

w_train[, -12] <- apply(w_train[, -12], 2, c_scale)
w_test[, -12] <- apply(w_test[, -12], 2, c_scale)

rf <- randomForest(as.factor(quality) ~., w_train)
table(w_test$quality, predict(rf, newdata = w_test))
   
     0  1
  0 66  8
  1 38 48
sum(diag(table(w_test$quality, predict(rf, newdata = w_test)))) / nrow(w_test)
#>[1] 0.70625

ついでに、それぞれのパッケージのチュートリアルとは別のベンチマークとして、時々このブログで使っているWine Qualityデータセットの赤ワインの方を{randomForest}で回した時のACCを置いておきます。

PyTorchのCRAN版{torch}パッケージ

RStudio社の尽力で、主要なDeep Learning フレームワークで最後までRに入ってなかったPyTorchというかTorchが、ついにCRANパッケージとしてRにやってきました。その名も{torch}です。インストール方法は普通にリンク先に書いてある通りにやればOKです。ただし{torchvision}はCRANからは入れられないので、GitHubからremotes::install_githubを使って入れます。すなわち、

install.packages("torch")
# remotes::install_github("mlverse/torch") # ダメならこちらで
remotes::install_github("mlverse/torchvision")

とやればOKです。バージョン依存関係などが気になるようなら、全てGitHubから入れてしまった方が確実だと思います。

KMNIST（くずし字MNIST）チュートリアル

上記リンク先にそのままコード例として紹介されているのがKMNIST（くずし字MNIST）チュートリアルです。これは僕も何度かお目にかかったことのあるTarin Clanuwatさんが中心となって構築した、日本のくずし字文献判読モデル学習のためのオープンデータセットを用いたチュートリアルで、使い古されたオリジナルのMNISTよりも難易度が高いため、より各モデルの特徴がつかみやすくなっています。普通にCNNを組んで回すだけですが、同様のネットワーク実装は他のフレームワークでも良く見かけるため、Torchの特徴が良く見て取れるのではないかと思います。

なお冒頭で予めお断りした通りCPUモードで実行しているため、以下の{torch}のコードは全てdevice = "cpu"としています。GPUを挿している方はdevice = "cuda"に換えてください。ちなみに以下の実行結果に至るまでに実はOpenMP絡みの謎のバグを踏んでしまい、上記リンクのようにしばらく作者のdfabel氏とissueであれこれやり取りしていたのでした。

Sys.setenv(KMP_DUPLICATE_LIB_OK="TRUE") # 僕の環境では必要でした：不具合の出る方は削除してください

library(torch)
library(torchvision)

train_ds <- kmnist_dataset(
  ".",
  download = TRUE,
  train = TRUE,
  transform = transform_to_tensor
)

test_ds <- kmnist_dataset(
  ".",
  download = TRUE,
  train = FALSE,
  transform = transform_to_tensor
)

train_dl <- dataloader(train_ds, batch_size = 32, shuffle = TRUE)
test_dl <- dataloader(test_ds, batch_size = 32)

par(mfrow = c(4,8), mar = rep(0, 4))
images <- train_dl$.iter()$.next()[[1]][1:32, 1, , ] 
images %>%
  purrr::array_tree(1) %>%
  purrr::map(as.raster) %>%
  purrr::iwalk(~{plot(.x)})

net <- nn_module(
  
  "KMNIST-CNN",
  
  initialize = function() {
    # in_channels, out_channels, kernel_size, stride = 1, padding = 0
    self$conv1 <- nn_conv2d(1, 32, 3)
    self$conv2 <- nn_conv2d(32, 64, 3)
    self$dropout1 <- nn_dropout2d(0.25)
    self$dropout2 <- nn_dropout2d(0.5)
    self$fc1 <- nn_linear(9216, 128)
    self$fc2 <- nn_linear(128, 10)
  },
  
  forward = function(x) {
    x %>% 
      self$conv1() %>%
      nnf_relu() %>%
      self$conv2() %>%
      nnf_relu() %>%
      nnf_max_pool2d(2) %>%
      self$dropout1() %>%
      torch_flatten(start_dim = 2) %>%
      self$fc1() %>%
      nnf_relu() %>%
      self$dropout2() %>%
      self$fc2()
  }
)

model <- net()
model$to(device = "cpu")

optimizer <- optim_adam(model$parameters)

for (epoch in 1:5) {
  
  l <- c()
  
  for (b in enumerate(train_dl)) {
    # make sure each batch's gradient updates are calculated from a fresh start
    optimizer$zero_grad()
    # get model predictions
    output <- model(b[[1]]$to(device = "cpu"))
    # calculate loss
    loss <- nnf_cross_entropy(output, b[[2]]$to(device = "cpu"))
    # calculate gradient
    loss$backward()
    # apply weight updates
    optimizer$step()
    # track losses
    l <- c(l, loss$item())
  }
  
  cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))
}
#> Loss at epoch 1: 1.807057
#> Loss at epoch 2: 1.550666
#> Loss at epoch 3: 1.503203
#> Loss at epoch 4: 1.468439
#> Loss at epoch 5: 1.450541

model$eval()

test_losses <- c()
total <- 0
correct <- 0

for (b in enumerate(test_dl)) {
  output <- model(b[[1]]$to(device = "cpu"))
  labels <- b[[2]]$to(device = "cpu")
  loss <- nnf_cross_entropy(output, labels)
  test_losses <- c(test_losses, loss$item())
  # torch_max returns a list, with position 1 containing the values 
  # and position 2 containing the respective indices
  predicted <- torch_max(output$data(), dim = 2)[[2]]
  total <- total + labels$size(1)
  # add number of correct classifications in this batch to the aggregate
  correct <- correct + (predicted == labels)$sum()$item()
}

mean(test_losses)
#> [1] 1.548806

test_accuracy <-  correct/total
test_accuracy
#> [1] 0.9365

公式チュートリアルではACC 0.9449なので僅かに下回りますが、性能としては十分かなと思います。ともあれ、これで{torch}できちんとCNNが回ることが確認できました。

{torch}のWine Qualityデータセットへの適用

PyTorchを使ったことがなかった*1ので知らなかったんですが、Torch系列は"Dataset"の作り方が重要なんですね。これが結構Rユーザーにとっては複雑な印象で、Pythonでクラスを書いたことがない人には難しいかもしれません*2。

一応、{torch}のドキュメントにも簡単ながらDatasetの作り方が書いてあるので、そもそもTorchに初めて触るという人は先にこちらを読んでおくことをお薦めします。ここさえ何とかなれば、後は他のDeep Learning フレームワークを使ったことのある人なら簡単に回せると思います。

library(torch)
Sys.setenv(KMP_DUPLICATE_LIB_OK="TRUE")

# Load the datasets
w_train <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_train.csv")
w_test <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_test.csv")

# Transform "quality" into binary
w_train$quality[w_train$quality < 6] <- 0
w_test$quality[w_test$quality < 6] <- 0
w_train$quality[w_train$quality > 0] <- 1
w_test$quality[w_test$quality > 0] <- 1

c_scale <- function(x){
  x <- (x - min(x)) / (max(x) - min(x))
}

w_train[, -12] <- apply(w_train[, -12], 2, c_scale)
w_test[, -12] <- apply(w_test[, -12], 2, c_scale)

# Create {torch} dataset

df_dataset <- dataset(
  
  "wine",

    initialize = function(df, response_variable) {
    self$df <- df[,-which(names(df) == response_variable)]
    self$response_variable <- df[[response_variable]]
  },
 
   .getitem = function(index) {
    response <- torch_tensor(self$response_variable[index])
    x <- torch_tensor(as.numeric(self$df[index,]))
    
    list(x = x, y = response)
  },
  
  .length = function() {
    length(self$response_variable)
  }
  
)

wtrain_ds <- df_dataset(w_train, "quality")
wtest_ds <- df_dataset(w_test, "quality")

wtrain_dl <- dataloader(wtrain_ds, batch_size = 32, shuffle = T)
wtest_dl <- dataloader(wtest_ds, batch_size = 1, shuffle = F)

# Define a network

net <- nn_module(
  
  "wine_DNN",
  
  initialize = function() {
    self$fc1 <- nn_linear(11, 66)
    self$fc2 <- nn_linear(66, 44)
    self$fc3 <- nn_linear(44, 1)
    self$dropout <- nn_dropout(0.5)
  },
  
  forward = function(x) {
    x %>% 
      self$fc1() %>%
      nnf_relu() %>%
      self$fc2() %>%
      nnf_relu() %>%
      self$dropout() %>%
      self$fc3() %>%
      nnf_sigmoid()
  }
)

model <- net()
model$to(device = "cpu")

optimizer <- optim_adam(model$parameters)

# Run a learning iteration

for (epoch in 1:20) {
  
  l <- c()
  
  for (b in enumerate(wtrain_dl)) {
    optimizer$zero_grad()
    output <- model(b[[1]]$to(device = "cpu"))
    loss <- nnf_binary_cross_entropy_with_logits(output, b[[2]]$to(device = "cpu"))
    loss$backward()
    optimizer$step()
    l <- c(l, loss$item())
  }
  
  cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))
}
#>Loss at epoch 1: 0.697528
#>Loss at epoch 2: 0.684255
#>Loss at epoch 3: 0.671300
#>Loss at epoch 4: 0.649464
#>Loss at epoch 5: 0.637502
#>Loss at epoch 6: 0.626071
#>Loss at epoch 7: 0.624771
#>Loss at epoch 8: 0.620353
#>Loss at epoch 9: 0.617436
#>Loss at epoch 10: 0.616738
#>Loss at epoch 11: 0.615517
#>Loss at epoch 12: 0.615118
#>Loss at epoch 13: 0.614372
#>Loss at epoch 14: 0.611524
#>Loss at epoch 15: 0.612358
#>Loss at epoch 16: 0.612271
#>Loss at epoch 17: 0.610837
#>Loss at epoch 18: 0.610578
#>Loss at epoch 19: 0.611047
#>Loss at epoch 20: 0.612309

# Predict labels for test data

model$eval()

i <- 1
pred_labels <- rep(0, nrow(w_test))

for (b in enumerate(wtest_dl)) {
  output <- model(b[[1]]$to(device = "cpu"))
  pred_labels[i] <- round(output$item(), 0)
  i <- i + 1
}

# Evaluate

table(w_test$quality, pred_labels)
   pred_labels
     0  1
  0 66  8
  1 42 44
(torch_acc <- sum(diag(table(w_test$quality, pred_labels))) / nrow(w_test))
#>[1] 0.6875

何もチューニングしていないので{randomForest}に一歩及ばない感じになっていますが、あくまでも「こんな感じでその辺のテーブルデータも回せますよ」という参考材料にしてくだされば有難いです。ちなみにテンソルの扱いが面倒なところはTensorFlowに瓜二つなので、ご注意ください。

R版{lightgbm}パッケージ

これまで基本的にPython版が主に使われてきたKaggle competitors御用達のLightGBMに、ついにCRAN版が登場してRユーザーにも使いやすくなったという話を聞いたので、ちょろっと試してみようと思います。。。というつもりで記事を書いていたら、何故か10月2日付でCRANからarchiveに回されてしまっていました。

ただしR版自体はarchive zipからのインストール及びGitHubからremotes::install_urlなどで引き続きインストールできます。

agaricusデータセットを用いたチュートリアル

以下"agaricus"データセットを使った公式のチュートリアルを回してみた結果です。

library(lightgbm)
library(methods)

# We load in the agaricus dataset
# In this example, we are aiming to predict whether a mushroom is edible
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
train <- agaricus.train
test <- agaricus.test

# The loaded data is stored in sparseMatrix, and label is a numeric vector in {0,1}
class(train$label)
#>[1] "numeric"
class(train$data)
#>[1] "dgCMatrix"
attr(,"package")
#>[1] "Matrix"

#--------------------Basic Training using lightgbm----------------
# This is the basic usage of lightgbm you can put matrix in data field
# Note: we are putting in sparse matrix here, lightgbm naturally handles sparse input
# Use sparse matrix when your feature is sparse (e.g. when you are using one-hot encoding vector)
print("Training lightgbm with sparseMatrix")
#>[1] "Training lightgbm with sparseMatrix"
bst <- lightgbm(
  data = train$data
  , label = train$label
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000501 seconds.
#>You can set `force_row_wise=true` to remove the overhead.
#>And if memory is not enough, you can set `force_col_wise=true`.
#>[LightGBM] [Info] Total Bins 214
#>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#>[LightGBM] [Info] Start training from score -0.071580
#>[1] "[1]:\ttrain's binary_logloss:0.198597"
#>[1] "[2]:\ttrain's binary_logloss:0.111535"

# Alternatively, you can put in dense matrix, i.e. basic R-matrix
print("Training lightgbm with Matrix")
#>[1] "Training lightgbm with Matrix"
bst <- lightgbm(
  data = as.matrix(train$data)
  , label = train$label
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002130 seconds.
#>You can set `force_row_wise=true` to remove the overhead.
#>And if memory is not enough, you can set `force_col_wise=true`.
#>[LightGBM] [Info] Total Bins 214
#>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#>[LightGBM] [Info] Start training from score -0.071580
#>[1] "[1]:\ttrain's binary_logloss:0.198597"
#>[1] "[2]:\ttrain's binary_logloss:0.111535"

# You can also put in lgb.Dataset object, which stores label, data and other meta datas needed for advanced features
print("Training lightgbm with lgb.Dataset")
#>[1] "Training lightgbm with lgb.Dataset"
dtrain <- lgb.Dataset(
  data = train$data
  , label = train$label
)
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006867 seconds.
#>You can set `force_row_wise=true` to remove the overhead.
#>And if memory is not enough, you can set `force_col_wise=true`.
#>[LightGBM] [Info] Total Bins 214
#>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#>[LightGBM] [Info] Start training from score -0.071580
#>[1] "[1]:\ttrain's binary_logloss:0.198597"
#>[1] "[2]:\ttrain's binary_logloss:0.111535"

# Verbose = 0,1,2
print("Train lightgbm with verbose 0, no message")
#>[1] "Train lightgbm with verbose 0, no message"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
  , verbose = 0L
)
#>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000580 seconds.
#>You can set `force_row_wise=true` to remove the overhead.
#>And if memory is not enough, you can set `force_col_wise=true`.

print("Train lightgbm with verbose 1, print evaluation metric")
#>[1] "Train lightgbm with verbose 1, print evaluation metric"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , nthread = 2L
  , objective = "binary"
  , verbose = 1L
)
#>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000930 seconds.
#>You can set `force_row_wise=true` to remove the overhead.
#>And if memory is not enough, you can set `force_col_wise=true`.
#>[LightGBM] [Info] Total Bins 214
#>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#>[LightGBM] [Info] Start training from score -0.071580
#>[1] "[1]:\ttrain's binary_logloss:0.198597"
#>[1] "[2]:\ttrain's binary_logloss:0.111535"

print("Train lightgbm with verbose 2, also print information about tree")
#>[1] "Train lightgbm with verbose 2, also print information about tree"
bst <- lightgbm(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , nthread = 2L
  , objective = "binary"
  , verbose = 2L
)
#>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#>[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.930600
#>[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.433362
#>[LightGBM] [Debug] init for col-wise cost 0.002197 seconds, init for row-wise cost 0.002363 seconds
#>[LightGBM] [Debug] col-wise cost 0.000221 seconds, row-wise cost 0.000317 seconds
#>[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002901 seconds.
#>You can set `force_col_wise=true` to remove the overhead.
#>[LightGBM] [Info] Total Bins 214
#>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#>[LightGBM] [Info] Start training from score -0.071580
#>[LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3
#>[1] "[1]:\ttrain's binary_logloss:0.198597"
#>[LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3
#>[1] "[2]:\ttrain's binary_logloss:0.111535"

# You can also specify data as file path to a LibSVM/TCV/CSV format input
# Since we do not have this file with us, the following line is just for illustration
# bst <- lightgbm(
#     data = "agaricus.train.svm"
#     , num_leaves = 4L
#     , learning_rate = 1.0
#     , nrounds = 2L
#     , objective = "binary"
# )

#--------------------Basic prediction using lightgbm--------------
# You can do prediction using the following line
# You can put in Matrix, sparseMatrix, or lgb.Dataset
pred <- predict(bst, test$data)
err <- mean(as.numeric(pred > 0.5) != test$label)
print(paste("test-error=", err))
#>[1] "test-error= 0.0217256362507759"

#--------------------Save and load models-------------------------
# Save model to binary local file
lgb.save(bst, "lightgbm.model")

# Load binary model to R
bst2 <- lgb.load("lightgbm.model")
pred2 <- predict(bst2, test$data)

# pred2 should be identical to pred
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2 - pred))))
#>[1] "sum(abs(pred2-pred))= 0"

#--------------------Advanced features ---------------------------
# To use advanced features, we need to put data in lgb.Dataset
dtrain <- lgb.Dataset(data = train$data, label = train$label, free_raw_data = FALSE)
dtest <- lgb.Dataset.create.valid(dtrain, data = test$data, label = test$label)

#--------------------Using validation set-------------------------
# valids is a list of lgb.Dataset, each of them is tagged with name
valids <- list(train = dtrain, test = dtest)

# To train with valids, use lgb.train, which contains more advanced features
# valids allows us to monitor the evaluation result on all data in the list
print("Train lightgbm using lgb.train with valids")
#>[1] "Train lightgbm using lgb.train with valids"
bst <- lgb.train(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , nthread = 2L
  , objective = "binary"
)
#>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001222 seconds.
#>You can set `force_row_wise=true` to remove the overhead.
#>And if memory is not enough, you can set `force_col_wise=true`.
#>[LightGBM] [Info] Total Bins 214
#>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#>[LightGBM] [Info] Start training from score -0.071580
#>[1] "[1]:\ttrain's binary_logloss:0.198597\ttest's binary_logloss:0.204754"
#>[1] "[2]:\ttrain's binary_logloss:0.111535\ttest's binary_logloss:0.113096"

# We can change evaluation metrics, or use multiple evaluation metrics
print("Train lightgbm using lgb.train with valids, watch logloss and error")
#>[1] "Train lightgbm using lgb.train with valids, watch logloss and error"
bst <- lgb.train(
  data = dtrain
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , eval = c("binary_error", "binary_logloss")
  , nthread = 2L
  , objective = "binary"
)
#>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373
#>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000811 seconds.
#>You can set `force_row_wise=true` to remove the overhead.
#>And if memory is not enough, you can set `force_col_wise=true`.
#>[LightGBM] [Info] Total Bins 214
#>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107
#>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580
#>[LightGBM] [Info] Start training from score -0.071580
#>[1] "[1]:\ttrain's binary_error:0.0304007\ttrain's binary_logloss:0.198597\ttest's binary_error:0.0335196\ttest's binary_logloss:0.204754"
#>[1] "[2]:\ttrain's binary_error:0.0222632\ttrain's binary_logloss:0.111535\ttest's binary_error:0.0217256\ttest's binary_logloss:0.113096"

# lgb.Dataset can also be saved using lgb.Dataset.save
lgb.Dataset.save(dtrain, "dtrain.buffer")
#>[LightGBM] [Warning] File dtrain.buffer exists, cannot save binary to it

# To load it in, simply call lgb.Dataset
dtrain2 <- lgb.Dataset("dtrain.buffer")
bst <- lgb.train(
  data = dtrain2
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , valids = valids
  , nthread = 2L
  , objective = "binary"
)
 data$get_colnames() でエラー: 
  dim: cannot get dimensions before dataset has been constructed, please call lgb.Dataset.construct explicitly

# information can be extracted from lgb.Dataset using getinfo
label <- getinfo(dtest, "label")
pred <- predict(bst, test$data)
err <- as.numeric(sum(as.integer(pred > 0.5) != label)) / length(label)
print(paste("test-error=", err))
#>[1] "test-error= 0.0217256362507759"

途中エラーを吐いているのが気になりますが、一応回ってくれているみたいです。

{lightgbm}のWine Qualityデータセットへの適用

{torch}同様にWine Qualityデータセットでもやってみます。Matrix / SparseMatrix形式でないとデータを受け付けないところ以外はトラップはほぼないので、回すのは簡単だと思います。

library(lightgbm)

# Load the datasets
w_train <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_train.csv")
w_test <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_test.csv")

# Transform "quality" into binary
w_train$quality[w_train$quality < 6] <- 0
w_test$quality[w_test$quality < 6] <- 0
w_train$quality[w_train$quality > 0] <- 1
w_test$quality[w_test$quality > 0] <- 1

c_scale <- function(x){
  x <- (x - min(x)) / (max(x) - min(x))
}

w_train[, -12] <- apply(w_train[, -12], 2, c_scale)
w_test[, -12] <- apply(w_test[, -12], 2, c_scale)

# Run a training procedure

bst <- lightgbm(
  data = as.matrix(w_train[, -12])
  , label = w_train$quality
  , num_leaves = 4L
  , learning_rate = 1.0
  , nrounds = 2L
  , objective = "binary"
)
#>[LightGBM] [Info] Number of positive: 769, number of negative: 670
#>[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000175 seconds.
#>You can set `force_col_wise=true` to remove the overhead.
#>[LightGBM] [Info] Total Bins 1041
#>[LightGBM] [Info] Number of data points in the train set: 1439, number of used features: 11
#>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.534399 -> initscore=0.137813
#>[LightGBM] [Info] Start training from score 0.137813
#>[1] "[1]:\ttrain's binary_logloss:0.569634"
#>[1] "[2]:\ttrain's binary_logloss:0.52085"

# Predict labels for test data & evaluate

pred <- predict(bst, as.matrix(w_test[, -12]))
table(w_test$quality, round(pred, 0))
   
     0  1
  0 61 13
  1 36 50
sum(diag(table(w_test$quality, round(pred, 0)))) / nrow(w_test)
#>[1] 0.69375

チューニング皆無なので若干ACCが低いですが、ともあれこんな感じで{lightgbm}もR上で回ることが見て取れたかと思います。

感想など

正直言って、どちらのパッケージも「実体はPythonです」主張がかなり強くて*3、Pythonも使う身としては「これならRでやらなくてもいいんじゃないかなぁ」感満載でした（苦笑）。特に{torch}はぶっちゃけ{keras}とかで代用しても良いような気もしますが。。。とは言え、R側でしか出来ない処理に接続したい時などは有難いと思います。

なお、Wine Qualityデータセットを用いた部分のRコードは切り出してGitHubに上げてありますので、必要に応じてご参照ください。以上、このブログ恒例の「やってみただけです」記事でした。

*1:TensorFlowユーザーなので

*2:しかもオブジェクトとしてワークスペースに格納されるのでRStudioのEnviromentタブにも中身が表示されない

*3:特に{torch}は非常に顕著というかRで動かしているのにOOP的要素が色濃く、Pythonでやっている感が凄かった