これまで、RとPythonは両方使える人が少なくないながらも開発陣やコミュニティの思想が違うせいもあってか、「Rは統計学向け」「Pythonは機械学習向け」的な住み分けが年々進み、特に機械学習関連の重要なフレームワーク・ライブラリ類はPython向けのみがリリースされることが多く、R向けにはリリースされないということが常態化している印象がありました。
そんな中、この9月にPythonの機械学習OSSを代表する2つのライブラリが相次いでR版パッケージを発表したので、個人的にはなかなか驚きました。中には「この2つがRに来たからにはもうPythonは触らない」と豪語する過激派の方もいらっしゃるようですが(笑)、それはさておき個人的な備忘録としてこの2つのR版パッケージを試してみた記録を記事として残しておこうと思います。
なお、以下のモデリングはほぼ何もチューニングを行っておりません。あくまでも「こうすれば回ります」という動作確認以外の何物をも目的としておりませんので、チューニングなどは皆さんのお手元で皆さんの責任で行っていただくようお願いいたしますm(_ _)m またいつもながらですが、コード実装部分に誤りなどあれば是非コメント欄などでご指摘いただけると有難いです。
実行環境
sessionInfo() #>R version 3.6.3 (2020-02-29) #>Platform: x86_64-apple-darwin15.6.0 (64-bit) #>Running under: macOS Catalina 10.15.6 #> #>Matrix products: default #>BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib #>LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib #> #>locale: #>[1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8
上記の通りです。Mac Book ProでGPUは挿していないので、原則CPUモードで全て実行しています。
library(randomForest) # Load the datasets w_train <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_train.csv") w_test <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_test.csv") # Transform "quality" into binary w_train$quality[w_train$quality < 6] <- 0 w_test$quality[w_test$quality < 6] <- 0 w_train$quality[w_train$quality > 0] <- 1 w_test$quality[w_test$quality > 0] <- 1 c_scale <- function(x){ x <- (x - min(x)) / (max(x) - min(x)) } w_train[, -12] <- apply(w_train[, -12], 2, c_scale) w_test[, -12] <- apply(w_test[, -12], 2, c_scale) rf <- randomForest(as.factor(quality) ~., w_train) table(w_test$quality, predict(rf, newdata = w_test)) 0 1 0 66 8 1 38 48 sum(diag(table(w_test$quality, predict(rf, newdata = w_test)))) / nrow(w_test) #>[1] 0.70625
ついでに、それぞれのパッケージのチュートリアルとは別のベンチマークとして、時々このブログで使っているWine Qualityデータセットの赤ワインの方を{randomForest}で回した時のACCを置いておきます。
PyTorchのCRAN版{torch}パッケージ
RStudio社の尽力で、主要なDeep Learningフレームワークで最後までRに入ってなかったPyTorchというかTorchが、ついにCRANパッケージとしてRにやってきました。その名も{torch}です。インストール方法は普通にリンク先に書いてある通りにやればOKです。ただし{torchvision}はCRANからは入れられないので、GitHubからremotes::install_githubを使って入れます。すなわち、
install.packages("torch") # remotes::install_github("mlverse/torch") # ダメならこちらで remotes::install_github("mlverse/torchvision")
とやればOKです。バージョン依存関係などが気になるようなら、全てGitHubから入れてしまった方が確実だと思います。
KMNIST(くずし字MNIST)チュートリアル
上記リンク先にそのままコード例として紹介されているのがKMNIST(くずし字MNIST)チュートリアルです。これは僕も何度かお目にかかったことのあるTarin Clanuwatさんが中心となって構築した、日本のくずし字文献判読モデル学習のためのオープンデータセットを用いたチュートリアルで、使い古されたオリジナルのMNISTよりも難易度が高いため、より各モデルの特徴がつかみやすくなっています。普通にCNNを組んで回すだけですが、同様のネットワーク実装は他のフレームワークでも良く見かけるため、Torchの特徴が良く見て取れるのではないかと思います。
なお冒頭で予めお断りした通りCPUモードで実行しているため、以下の{torch}のコードは全てdevice = "cpu"としています。GPUを挿している方はdevice = "cuda"に換えてください。ちなみに以下の実行結果に至るまでに実はOpenMP絡みの謎のバグを踏んでしまい、上記リンクのようにしばらく作者のdfabel氏とissueであれこれやり取りしていたのでした。
Sys.setenv(KMP_DUPLICATE_LIB_OK="TRUE") # 僕の環境では必要でした:不具合の出る方は削除してください library(torch) library(torchvision) train_ds <- kmnist_dataset( ".", download = TRUE, train = TRUE, transform = transform_to_tensor ) test_ds <- kmnist_dataset( ".", download = TRUE, train = FALSE, transform = transform_to_tensor ) train_dl <- dataloader(train_ds, batch_size = 32, shuffle = TRUE) test_dl <- dataloader(test_ds, batch_size = 32) par(mfrow = c(4,8), mar = rep(0, 4)) images <- train_dl$.iter()$.next()[[1]][1:32, 1, , ] images %>% purrr::array_tree(1) %>% purrr::map(as.raster) %>% purrr::iwalk(~{plot(.x)}) net <- nn_module( "KMNIST-CNN", initialize = function() { # in_channels, out_channels, kernel_size, stride = 1, padding = 0 self$conv1 <- nn_conv2d(1, 32, 3) self$conv2 <- nn_conv2d(32, 64, 3) self$dropout1 <- nn_dropout2d(0.25) self$dropout2 <- nn_dropout2d(0.5) self$fc1 <- nn_linear(9216, 128) self$fc2 <- nn_linear(128, 10) }, forward = function(x) { x %>% self$conv1() %>% nnf_relu() %>% self$conv2() %>% nnf_relu() %>% nnf_max_pool2d(2) %>% self$dropout1() %>% torch_flatten(start_dim = 2) %>% self$fc1() %>% nnf_relu() %>% self$dropout2() %>% self$fc2() } ) model <- net() model$to(device = "cpu") optimizer <- optim_adam(model$parameters) for (epoch in 1:5) { l <- c() for (b in enumerate(train_dl)) { # make sure each batch's gradient updates are calculated from a fresh start optimizer$zero_grad() # get model predictions output <- model(b[[1]]$to(device = "cpu")) # calculate loss loss <- nnf_cross_entropy(output, b[[2]]$to(device = "cpu")) # calculate gradient loss$backward() # apply weight updates optimizer$step() # track losses l <- c(l, loss$item()) } cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l))) } #> Loss at epoch 1: 1.807057 #> Loss at epoch 2: 1.550666 #> Loss at epoch 3: 1.503203 #> Loss at epoch 4: 1.468439 #> Loss at epoch 5: 1.450541 model$eval() test_losses <- c() total <- 0 correct <- 0 for (b in enumerate(test_dl)) { output <- model(b[[1]]$to(device = "cpu")) labels <- b[[2]]$to(device = "cpu") loss <- nnf_cross_entropy(output, labels) test_losses <- c(test_losses, loss$item()) # torch_max returns a list, with position 1 containing the values # and position 2 containing the respective indices predicted <- torch_max(output$data(), dim = 2)[[2]] total <- total + labels$size(1) # add number of correct classifications in this batch to the aggregate correct <- correct + (predicted == labels)$sum()$item() } mean(test_losses) #> [1] 1.548806 test_accuracy <- correct/total test_accuracy #> [1] 0.9365
公式チュートリアルではACC 0.9449なので僅かに下回りますが、性能としては十分かなと思います。ともあれ、これで{torch}できちんとCNNが回ることが確認できました。
{torch}のWine Qualityデータセットへの適用
PyTorchを使ったことがなかった*1ので知らなかったんですが、Torch系列は"Dataset"の作り方が重要なんですね。これが結構Rユーザーにとっては複雑な印象で、Pythonでクラスを書いたことがない人には難しいかもしれません*2。
一応、{torch}のドキュメントにも簡単ながらDatasetの作り方が書いてあるので、そもそもTorchに初めて触るという人は先にこちらを読んでおくことをお薦めします。ここさえ何とかなれば、後は他のDeep Learningフレームワークを使ったことのある人なら簡単に回せると思います。
library(torch) Sys.setenv(KMP_DUPLICATE_LIB_OK="TRUE") # Load the datasets w_train <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_train.csv") w_test <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_test.csv") # Transform "quality" into binary w_train$quality[w_train$quality < 6] <- 0 w_test$quality[w_test$quality < 6] <- 0 w_train$quality[w_train$quality > 0] <- 1 w_test$quality[w_test$quality > 0] <- 1 c_scale <- function(x){ x <- (x - min(x)) / (max(x) - min(x)) } w_train[, -12] <- apply(w_train[, -12], 2, c_scale) w_test[, -12] <- apply(w_test[, -12], 2, c_scale) # Create {torch} dataset df_dataset <- dataset( "wine", initialize = function(df, response_variable) { self$df <- df[,-which(names(df) == response_variable)] self$response_variable <- df[[response_variable]] }, .getitem = function(index) { response <- torch_tensor(self$response_variable[index]) x <- torch_tensor(as.numeric(self$df[index,])) list(x = x, y = response) }, .length = function() { length(self$response_variable) } ) wtrain_ds <- df_dataset(w_train, "quality") wtest_ds <- df_dataset(w_test, "quality") wtrain_dl <- dataloader(wtrain_ds, batch_size = 32, shuffle = T) wtest_dl <- dataloader(wtest_ds, batch_size = 1, shuffle = F) # Define a network net <- nn_module( "wine_DNN", initialize = function() { self$fc1 <- nn_linear(11, 66) self$fc2 <- nn_linear(66, 44) self$fc3 <- nn_linear(44, 1) self$dropout <- nn_dropout(0.5) }, forward = function(x) { x %>% self$fc1() %>% nnf_relu() %>% self$fc2() %>% nnf_relu() %>% self$dropout() %>% self$fc3() %>% nnf_sigmoid() } ) model <- net() model$to(device = "cpu") optimizer <- optim_adam(model$parameters) # Run a learning iteration for (epoch in 1:20) { l <- c() for (b in enumerate(wtrain_dl)) { optimizer$zero_grad() output <- model(b[[1]]$to(device = "cpu")) loss <- nnf_binary_cross_entropy_with_logits(output, b[[2]]$to(device = "cpu")) loss$backward() optimizer$step() l <- c(l, loss$item()) } cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l))) } #>Loss at epoch 1: 0.697528 #>Loss at epoch 2: 0.684255 #>Loss at epoch 3: 0.671300 #>Loss at epoch 4: 0.649464 #>Loss at epoch 5: 0.637502 #>Loss at epoch 6: 0.626071 #>Loss at epoch 7: 0.624771 #>Loss at epoch 8: 0.620353 #>Loss at epoch 9: 0.617436 #>Loss at epoch 10: 0.616738 #>Loss at epoch 11: 0.615517 #>Loss at epoch 12: 0.615118 #>Loss at epoch 13: 0.614372 #>Loss at epoch 14: 0.611524 #>Loss at epoch 15: 0.612358 #>Loss at epoch 16: 0.612271 #>Loss at epoch 17: 0.610837 #>Loss at epoch 18: 0.610578 #>Loss at epoch 19: 0.611047 #>Loss at epoch 20: 0.612309 # Predict labels for test data model$eval() i <- 1 pred_labels <- rep(0, nrow(w_test)) for (b in enumerate(wtest_dl)) { output <- model(b[[1]]$to(device = "cpu")) pred_labels[i] <- round(output$item(), 0) i <- i + 1 } # Evaluate table(w_test$quality, pred_labels) pred_labels 0 1 0 66 8 1 42 44 (torch_acc <- sum(diag(table(w_test$quality, pred_labels))) / nrow(w_test)) #>[1] 0.6875
何もチューニングしていないので{randomForest}に一歩及ばない感じになっていますが、あくまでも「こんな感じでその辺のテーブルデータも回せますよ」という参考材料にしてくだされば有難いです。ちなみにテンソルの扱いが面倒なところはTensorFlowに瓜二つなので、ご注意ください。
R版{lightgbm}パッケージ
これまで基本的にPython版が主に使われてきたKaggle competitors御用達のLightGBMに、ついにCRAN版が登場してRユーザーにも使いやすくなったという話を聞いたので、ちょろっと試してみようと思います。。。というつもりで記事を書いていたら、何故か10月2日付でCRANからarchiveに回されてしまっていました。
ただしR版自体はarchive zipからのインストール及びGitHubからremotes::install_urlなどで引き続きインストールできます。
agaricusデータセットを用いたチュートリアル
以下"agaricus"データセットを使った公式のチュートリアルを回してみた結果です。
library(lightgbm) library(methods) # We load in the agaricus dataset # In this example, we are aiming to predict whether a mushroom is edible data(agaricus.train, package = "lightgbm") data(agaricus.test, package = "lightgbm") train <- agaricus.train test <- agaricus.test # The loaded data is stored in sparseMatrix, and label is a numeric vector in {0,1} class(train$label) #>[1] "numeric" class(train$data) #>[1] "dgCMatrix" attr(,"package") #>[1] "Matrix" #--------------------Basic Training using lightgbm---------------- # This is the basic usage of lightgbm you can put matrix in data field # Note: we are putting in sparse matrix here, lightgbm naturally handles sparse input # Use sparse matrix when your feature is sparse (e.g. when you are using one-hot encoding vector) print("Training lightgbm with sparseMatrix") #>[1] "Training lightgbm with sparseMatrix" bst <- lightgbm( data = train$data , label = train$label , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , objective = "binary" ) #>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373 #>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000501 seconds. #>You can set `force_row_wise=true` to remove the overhead. #>And if memory is not enough, you can set `force_col_wise=true`. #>[LightGBM] [Info] Total Bins 214 #>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107 #>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580 #>[LightGBM] [Info] Start training from score -0.071580 #>[1] "[1]:\ttrain's binary_logloss:0.198597" #>[1] "[2]:\ttrain's binary_logloss:0.111535" # Alternatively, you can put in dense matrix, i.e. basic R-matrix print("Training lightgbm with Matrix") #>[1] "Training lightgbm with Matrix" bst <- lightgbm( data = as.matrix(train$data) , label = train$label , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , objective = "binary" ) #>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373 #>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002130 seconds. #>You can set `force_row_wise=true` to remove the overhead. #>And if memory is not enough, you can set `force_col_wise=true`. #>[LightGBM] [Info] Total Bins 214 #>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107 #>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580 #>[LightGBM] [Info] Start training from score -0.071580 #>[1] "[1]:\ttrain's binary_logloss:0.198597" #>[1] "[2]:\ttrain's binary_logloss:0.111535" # You can also put in lgb.Dataset object, which stores label, data and other meta datas needed for advanced features print("Training lightgbm with lgb.Dataset") #>[1] "Training lightgbm with lgb.Dataset" dtrain <- lgb.Dataset( data = train$data , label = train$label ) bst <- lightgbm( data = dtrain , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , objective = "binary" ) #>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373 #>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006867 seconds. #>You can set `force_row_wise=true` to remove the overhead. #>And if memory is not enough, you can set `force_col_wise=true`. #>[LightGBM] [Info] Total Bins 214 #>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107 #>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580 #>[LightGBM] [Info] Start training from score -0.071580 #>[1] "[1]:\ttrain's binary_logloss:0.198597" #>[1] "[2]:\ttrain's binary_logloss:0.111535" # Verbose = 0,1,2 print("Train lightgbm with verbose 0, no message") #>[1] "Train lightgbm with verbose 0, no message" bst <- lightgbm( data = dtrain , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , objective = "binary" , verbose = 0L ) #>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000580 seconds. #>You can set `force_row_wise=true` to remove the overhead. #>And if memory is not enough, you can set `force_col_wise=true`. print("Train lightgbm with verbose 1, print evaluation metric") #>[1] "Train lightgbm with verbose 1, print evaluation metric" bst <- lightgbm( data = dtrain , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , nthread = 2L , objective = "binary" , verbose = 1L ) #>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373 #>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000930 seconds. #>You can set `force_row_wise=true` to remove the overhead. #>And if memory is not enough, you can set `force_col_wise=true`. #>[LightGBM] [Info] Total Bins 214 #>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107 #>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580 #>[LightGBM] [Info] Start training from score -0.071580 #>[1] "[1]:\ttrain's binary_logloss:0.198597" #>[1] "[2]:\ttrain's binary_logloss:0.111535" print("Train lightgbm with verbose 2, also print information about tree") #>[1] "Train lightgbm with verbose 2, also print information about tree" bst <- lightgbm( data = dtrain , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , nthread = 2L , objective = "binary" , verbose = 2L ) #>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373 #>[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.930600 #>[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.433362 #>[LightGBM] [Debug] init for col-wise cost 0.002197 seconds, init for row-wise cost 0.002363 seconds #>[LightGBM] [Debug] col-wise cost 0.000221 seconds, row-wise cost 0.000317 seconds #>[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002901 seconds. #>You can set `force_col_wise=true` to remove the overhead. #>[LightGBM] [Info] Total Bins 214 #>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107 #>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580 #>[LightGBM] [Info] Start training from score -0.071580 #>[LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3 #>[1] "[1]:\ttrain's binary_logloss:0.198597" #>[LightGBM] [Debug] Trained a tree with leaves = 4 and max_depth = 3 #>[1] "[2]:\ttrain's binary_logloss:0.111535" # You can also specify data as file path to a LibSVM/TCV/CSV format input # Since we do not have this file with us, the following line is just for illustration # bst <- lightgbm( # data = "agaricus.train.svm" # , num_leaves = 4L # , learning_rate = 1.0 # , nrounds = 2L # , objective = "binary" # ) #--------------------Basic prediction using lightgbm-------------- # You can do prediction using the following line # You can put in Matrix, sparseMatrix, or lgb.Dataset pred <- predict(bst, test$data) err <- mean(as.numeric(pred > 0.5) != test$label) print(paste("test-error=", err)) #>[1] "test-error= 0.0217256362507759" #--------------------Save and load models------------------------- # Save model to binary local file lgb.save(bst, "lightgbm.model") # Load binary model to R bst2 <- lgb.load("lightgbm.model") pred2 <- predict(bst2, test$data) # pred2 should be identical to pred print(paste("sum(abs(pred2-pred))=", sum(abs(pred2 - pred)))) #>[1] "sum(abs(pred2-pred))= 0" #--------------------Advanced features --------------------------- # To use advanced features, we need to put data in lgb.Dataset dtrain <- lgb.Dataset(data = train$data, label = train$label, free_raw_data = FALSE) dtest <- lgb.Dataset.create.valid(dtrain, data = test$data, label = test$label) #--------------------Using validation set------------------------- # valids is a list of lgb.Dataset, each of them is tagged with name valids <- list(train = dtrain, test = dtest) # To train with valids, use lgb.train, which contains more advanced features # valids allows us to monitor the evaluation result on all data in the list print("Train lightgbm using lgb.train with valids") #>[1] "Train lightgbm using lgb.train with valids" bst <- lgb.train( data = dtrain , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , valids = valids , nthread = 2L , objective = "binary" ) #>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373 #>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001222 seconds. #>You can set `force_row_wise=true` to remove the overhead. #>And if memory is not enough, you can set `force_col_wise=true`. #>[LightGBM] [Info] Total Bins 214 #>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107 #>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580 #>[LightGBM] [Info] Start training from score -0.071580 #>[1] "[1]:\ttrain's binary_logloss:0.198597\ttest's binary_logloss:0.204754" #>[1] "[2]:\ttrain's binary_logloss:0.111535\ttest's binary_logloss:0.113096" # We can change evaluation metrics, or use multiple evaluation metrics print("Train lightgbm using lgb.train with valids, watch logloss and error") #>[1] "Train lightgbm using lgb.train with valids, watch logloss and error" bst <- lgb.train( data = dtrain , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , valids = valids , eval = c("binary_error", "binary_logloss") , nthread = 2L , objective = "binary" ) #>[LightGBM] [Info] Number of positive: 3140, number of negative: 3373 #>[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000811 seconds. #>You can set `force_row_wise=true` to remove the overhead. #>And if memory is not enough, you can set `force_col_wise=true`. #>[LightGBM] [Info] Total Bins 214 #>[LightGBM] [Info] Number of data points in the train set: 6513, number of used features: 107 #>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.482113 -> initscore=-0.071580 #>[LightGBM] [Info] Start training from score -0.071580 #>[1] "[1]:\ttrain's binary_error:0.0304007\ttrain's binary_logloss:0.198597\ttest's binary_error:0.0335196\ttest's binary_logloss:0.204754" #>[1] "[2]:\ttrain's binary_error:0.0222632\ttrain's binary_logloss:0.111535\ttest's binary_error:0.0217256\ttest's binary_logloss:0.113096" # lgb.Dataset can also be saved using lgb.Dataset.save lgb.Dataset.save(dtrain, "dtrain.buffer") #>[LightGBM] [Warning] File dtrain.buffer exists, cannot save binary to it # To load it in, simply call lgb.Dataset dtrain2 <- lgb.Dataset("dtrain.buffer") bst <- lgb.train( data = dtrain2 , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , valids = valids , nthread = 2L , objective = "binary" ) data$get_colnames() でエラー: dim: cannot get dimensions before dataset has been constructed, please call lgb.Dataset.construct explicitly # information can be extracted from lgb.Dataset using getinfo label <- getinfo(dtest, "label") pred <- predict(bst, test$data) err <- as.numeric(sum(as.integer(pred > 0.5) != label)) / length(label) print(paste("test-error=", err)) #>[1] "test-error= 0.0217256362507759"
途中エラーを吐いているのが気になりますが、一応回ってくれているみたいです。
{lightgbm}のWine Qualityデータセットへの適用
{torch}同様にWine Qualityデータセットでもやってみます。Matrix / SparseMatrix形式でないとデータを受け付けないところ以外はトラップはほぼないので、回すのは簡単だと思います。
library(lightgbm) # Load the datasets w_train <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_train.csv") w_test <- read.csv("https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/wine/wine_red_test.csv") # Transform "quality" into binary w_train$quality[w_train$quality < 6] <- 0 w_test$quality[w_test$quality < 6] <- 0 w_train$quality[w_train$quality > 0] <- 1 w_test$quality[w_test$quality > 0] <- 1 c_scale <- function(x){ x <- (x - min(x)) / (max(x) - min(x)) } w_train[, -12] <- apply(w_train[, -12], 2, c_scale) w_test[, -12] <- apply(w_test[, -12], 2, c_scale) # Run a training procedure bst <- lightgbm( data = as.matrix(w_train[, -12]) , label = w_train$quality , num_leaves = 4L , learning_rate = 1.0 , nrounds = 2L , objective = "binary" ) #>[LightGBM] [Info] Number of positive: 769, number of negative: 670 #>[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000175 seconds. #>You can set `force_col_wise=true` to remove the overhead. #>[LightGBM] [Info] Total Bins 1041 #>[LightGBM] [Info] Number of data points in the train set: 1439, number of used features: 11 #>[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.534399 -> initscore=0.137813 #>[LightGBM] [Info] Start training from score 0.137813 #>[1] "[1]:\ttrain's binary_logloss:0.569634" #>[1] "[2]:\ttrain's binary_logloss:0.52085" # Predict labels for test data & evaluate pred <- predict(bst, as.matrix(w_test[, -12])) table(w_test$quality, round(pred, 0)) 0 1 0 61 13 1 36 50 sum(diag(table(w_test$quality, round(pred, 0)))) / nrow(w_test) #>[1] 0.69375
チューニング皆無なので若干ACCが低いですが、ともあれこんな感じで{lightgbm}もR上で回ることが見て取れたかと思います。