Hackathon Summer School 2022

Classification of pasture areas through EO data and machine learning

Author

Francisco Zambrano (from Chile)

1 Background

Land use and land cover (LULC) change is a key topic to understand the human footprint and environmental impacts on the planet (Winkler et al., 2019). In recent years the convergence of publicly available Earth Observation (EO) data (Ma, Y. et al., 2015), more accessible cloud-computing platforms and modern machine learning (ML) techniques (Gorelick et al., 2017; Schramm et al., 2021) allowed the production of several LULC maps at national (Souza et al., 2020), continental (Witjes et al., 2022) and global scale (Venter et al., 2022). In general, these products are self-consistent, which means that they map consistently different LULC classes considering distinct training samples and ML techniques, an aspect that makes the comparison across them challenging (Venter et al., 2022). Even so, most of them report a relatively low accuracy for pasture / grasslands classes, a LULC class that suffers from a lack of precise definitions able to incorporate cross-disciplinary evidence of animal production systems into mapping products (Phelps et al., 2017). These areas are well-known to be the single most extensive land cover class of the Earth’s ice-free land surface (Ellis et al., 2010) and highly affected by climate change (Gao et al., 2016). Considering the global relevance of this LULC class WRI, OpenGeoHub, LAPIG and IIASA initiated a Global Pastures Monitoring project supported by the Land & Carbon Lab (LCL) initiative.

2 Task definition

The task is to develop a reproducible computation notebook using your favorite programming language (Julia, Python or R) explaining step-by-step the implemented workflow and providing evidence for any claims that you make.

The computation notebook must include at least the following steps (minimum requirements):

  1. Download and access the training, validation and test set files publicly available in Parquet and CSV format,
  2. Train a machine learning model able to predict the three land cover classes established by the dataset (column class),
  3. Predict the test set using the trained model,
  4. Upload your predictions in a Google Drive and insert the URL link to access them in this Google Sheet, so we can derive a macro average f1-score (arithmetic mean of f1-score for the three classes, which is insensitive to the imbalance of the classes). IMPORTANT: Remember to change the access of uploaded file to “Anyone with the link”.

3 Procedure

3.1 Download and read the data

To download the data I used the script recommended with the instruction of the hackaton. The data was in parquet format and was read with arrow::read_parquet function.

Code
library(arrow)

read_pq_from_url <- function(url) {
  dst <- basename(url)
  
  if (!file.exists(dst)) {
    download.file(url, dst);
  } 
  
  return( read_parquet(dst) )
}

train_url <-'https://s3.eu-central-1.wasabisys.com/global-pastures/samples/lcv_pasture_classif.matrix.train_2000..2020_brazil.eumap_summer.school.2022.pq'
val_url <- 'https://s3.eu-central-1.wasabisys.com/global-pastures/samples/lcv_pasture_classif.matrix.val_2000..2020_brazil.eumap_summer.school.2022.pq'
test_url <- 'https://s3.eu-central-1.wasabisys.com/global-pastures/samples/lcv_pasture_classif.matrix.test_2000..2020_brazil.eumap_summer.school.2022.pq'

train_data <- read_pq_from_url(train_url)
val_data <- read_pq_from_url(val_url)
test_data <- read_pq_from_url(test_url)

3.2 Merging datasets

The training and validation datasets were merged to create an unifed dataset, then this dataset was spli into training and test. The prediction data in which I want to run the model will be new_data.

Code
## 02. Creating additional covariates
library(dplyr)
library(tidyr)

data <- train_data |> 
  bind_rows(val_data) |> 
  as_tibble() |> 
  select(-c(tile_id:longitude,class_pct,class_label)) |> 
  drop_na() |> 
  mutate(class = as.factor(class)) |> 
  janitor::clean_names() |> 
  select(-index_level_0) 

new_data <- test_data |> 
  as_tibble() |> 
  select(-c(tile_id:longitude)) |> 
  drop_na() |> 
  janitor::clean_names() |> 
  select(-index_level_0) 

3.3 Setting seed and a logger

Code
set.seed(321)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")

I created a task with the data merged using the varibale “class” as the target for the model. Then I made a partition of the data using 70% for training and 30% for testing.

3.4 Task and partition of train and test dataset

Code
library(mlr3)
task <- as_task_classif(data, target = "class",id = 'train')
split <- partition(task,ratio = .7)

3.5 Learners selected for the classification

First, I created the task using the train_data, the target is the class variable.

I selected the following learners, as the ones that will be stacked for the classification.

  1. glmnet : Lasso and Elastic-Net Regularized Generalized Linear Models ({glmnet}),
  2. ranger : A Fast Implementation of Random Forests, ({ranger})
  3. naive_bayes, Naive Bayes Classifier ({e1071})
  4. svm: super vector machine ({e1071})
Code
library(mlr3learners)
lrn_glmnet = lrn("classif.glmnet", predict_type = "prob")
lrn_ranger = lrn("classif.ranger", predict_type = "prob")
lrn_nb = lrn("classif.naive_bayes",predict_type = "prob")
lrn_svm = lrn("classif.svm", predict_type = "prob")

3.6 Pipeline {mlr3pipelines}

To train the stacked models I used {mlr3pipelines} by layers. The layers were:

  1. level_0 : feature selection using {mlr3filters} and out-of-bag predictions for the learners
  2. level_1 : principal components analysis (PCA) and out-of-bag predictions for the learners
  3. level_2 : random forest with {ranger}

3.6.1 level_0

Code
library(mlr3pipelines)

cv2_glmnet = po("learner_cv", lrn_glmnet, id = "glmnet_2")
cv2_ranger = po("learner_cv", lrn_ranger, id = "ranger_2")
cv2_nb = po("learner_cv", lrn_nb, id = "nb_2")
cv2_svm = po("learner_cv", lrn_svm, id = "svm_2")

# 07.- Filters for feature selection
library(mlr3filters)
jmi1 = po("filter", flt("jmi"), id = "filt_jmi1",filter.nfeat =2)
jmi2 = po("filter", flt("jmi"), id = "filt_jmi2",filter.nfeat =2)
jmi3 = po("filter", flt("jmi"), id = "filt_jmi3",filter.nfeat =2)
jmi4 = po("filter", flt("jmi"), id = "filt_jmi4",filter.nfeat =2)

# 08.- union for level0

level_0 = gunion(list(
  jmi1 %>>% cv2_glmnet,
  jmi2 %>>% cv2_ranger,
  jmi3 %>>% cv2_nb,
  jmi4 %>>% cv2_svm,
  po("nop", id = "nop1")))  %>>%
  po("featureunion", id = "union1")

level_0$plot()

3.6.2 level_1

Code
cv3_glmnet = po("learner_cv", lrn_glmnet, id = "glmnet_3")
cv3_ranger = po("learner_cv", lrn_ranger, id = "ranger_3")
cv3_nb = po("learner_cv", lrn_nb, id = "nb_3")
cv3_svm = po("learner_cv", lrn_svm, id = "svm_3")

level_1 = level_0 %>>%
  po("copy", 5) %>>%
  gunion(list(
    po("pca", id = "pca_1", param_vals = list(scale. = TRUE)) %>>% cv3_glmnet,
    po("pca", id = "pca_2", param_vals = list(scale. = TRUE)) %>>% cv3_ranger,
    po("pca", id = "pca_3", param_vals = list(scale. = TRUE)) %>>% cv3_nb,
    po("pca", id = "pca_4", param_vals = list(scale. = TRUE)) %>>% cv3_svm,
    po("nop", id = "nop3")
  )) %>>%
  po("featureunion", id = "union2")

level_1$plot()

3.6.3 level_2

Code
lrn_ranger = lrn("classif.ranger", predict_type = "prob",id='ranger')

level_2 = level_1 %>>%
  po("featureunion", 3, id = "u2") %>>%
  po("learner", lrn_ranger, id = "ranger")

level_2$plot()

3.7 Tunning and training

I defined the space of search for some of the parameters of the learners,

Code
lrn = as_learner(level_2)

search_space = ps(
  filt_jmi1.filter.nfeat = p_int(5,50),
  filt_jmi2.filter.nfeat = p_int(5,50),
  filt_jmi3.filter.nfeat = p_int(5,50),
  filt_jmi4.filter.nfeat = p_int(5,50),
  pca_1.rank. = p_int(3, 50),
  pca_2.rank. = p_int(3, 50),
  pca_3.rank. = p_int(3, 20),
  glmnet_2.alpha = p_dbl(0,1),
  ranger_2.mtry = p_int(1,10),
  ranger_2.sample.fraction = p_dbl(0,1),
  ranger_2.num.trees = p_int(1,2000),
  svm_2.kernel = p_fct(levels =c('linear','polynomial','radial','sigmoid')),
  glmnet_3.alpha = p_dbl(0,1),
  ranger_3.mtry = p_int(1,10),
  ranger_3.sample.fraction = p_dbl(0,1),
  ranger_3.num.trees = p_int(1,2000),
  svm_3.kernel = p_fct(levels =c('linear','polynomial','radial','sigmoid')),
  ranger_end.mtry = p_int(1, 10),
  ranger_end.sample.fraction = p_dbl(0.5, 1),
  ranger_end.num.trees = p_int(50, 200)
)

Then, I created the auto tunner learner at

Code
library(mlr3tuning)
at = auto_tuner(
  method = "random_search",
  learner = lrn,
  resampling = rsmp("cv"),
  search_space = search_space,
  measure = msr("classif.mauc_aunu"),
  term_evals = 3) 

Running the training and tunning the model on the train dataset,

Code
future::plan('multisession')
progressr::with_progress({
  at$train(task, split$train)
})

3.8 The model trained

Code
at$model
$learner
<GraphLearner:filt_jmi1.filt_jmi2.filt_jmi3.filt_jmi4.nop1.glmnet_2.ranger_2.rpart_2.svm_2.union1.copy.pca_1.pca_2.pca_3.pca_4.nop3.glmnet_3.ranger_3.rpart_3.svm_3.union2.u2.ranger_end>
* Model: list
* Parameters: filt_jmi1.filter.nfeat=23, filt_jmi1.threads=1,
  glmnet_2.resampling.method=cv, glmnet_2.resampling.folds=3,
  glmnet_2.resampling.keep_response=FALSE, glmnet_2.alpha=0.6807,
  filt_jmi2.filter.nfeat=47, filt_jmi2.threads=1,
  ranger_2.resampling.method=cv, ranger_2.resampling.folds=3,
  ranger_2.resampling.keep_response=FALSE, ranger_2.mtry=9,
  ranger_2.num.threads=1, ranger_2.num.trees=475,
  ranger_2.sample.fraction=0.2269, filt_jmi3.filter.nfeat=17,
  filt_jmi3.threads=1, rpart_2.resampling.method=cv,
  rpart_2.resampling.folds=3, rpart_2.resampling.keep_response=FALSE,
  filt_jmi4.filter.nfeat=15, filt_jmi4.threads=1,
  svm_2.resampling.method=cv, svm_2.resampling.folds=3,
  svm_2.resampling.keep_response=FALSE, svm_2.kernel=sigmoid,
  pca_1.scale.=TRUE, pca_1.rank.=4, glmnet_3.resampling.method=cv,
  glmnet_3.resampling.folds=3, glmnet_3.resampling.keep_response=FALSE,
  glmnet_3.alpha=0.5503, pca_2.scale.=TRUE, pca_2.rank.=19,
  ranger_3.resampling.method=cv, ranger_3.resampling.folds=3,
  ranger_3.resampling.keep_response=FALSE, ranger_3.mtry=7,
  ranger_3.num.threads=1, ranger_3.num.trees=1168,
  ranger_3.sample.fraction=0.4376, pca_3.scale.=TRUE, pca_3.rank.=10,
  rpart_3.resampling.method=cv, rpart_3.resampling.folds=3,
  rpart_3.resampling.keep_response=FALSE, pca_4.scale.=TRUE,
  svm_3.resampling.method=cv, svm_3.resampling.folds=3,
  svm_3.resampling.keep_response=FALSE, svm_3.kernel=sigmoid,
  ranger_end.mtry=4, ranger_end.num.threads=1,
  ranger_end.num.trees=153, ranger_end.sample.fraction=0.9536
* Packages: mlr3, mlr3pipelines, mlr3learners, ranger
* Predict Types:  response, [prob]
* Feature Types: logical, integer, numeric, character, factor, ordered,
  POSIXct
* Properties: featureless, hotstart_backward, hotstart_forward,
  importance, loglik, missings, multiclass, oob_error,
  selected_features, twoclass, weights

$tuning_instance
<TuningInstanceSingleCrit>
* State:  Optimized
* Objective: <ObjectiveTuning:filt_jmi1.filt_jmi2.filt_jmi3.filt_jmi4.nop1.glmnet_2.ranger_2.rpart_2.svm_2.union1.copy.pca_1.pca_2.pca_3.pca_4.nop3.glmnet_3.ranger_3.rpart_3.svm_3.union2.u2.ranger_end_on_train>
* Search Space:
                            id    class lower upper nlevels
 1:     filt_jmi1.filter.nfeat ParamInt   5.0    50      46
 2:     filt_jmi2.filter.nfeat ParamInt   5.0    50      46
 3:     filt_jmi3.filter.nfeat ParamInt   5.0    50      46
 4:     filt_jmi4.filter.nfeat ParamInt   5.0    50      46
 5:                pca_1.rank. ParamInt   3.0    50      48
 6:                pca_2.rank. ParamInt   3.0    50      48
 7:                pca_3.rank. ParamInt   3.0    20      18
 8:             glmnet_2.alpha ParamDbl   0.0     1     Inf
 9:              ranger_2.mtry ParamInt   1.0    10      10
10:   ranger_2.sample.fraction ParamDbl   0.0     1     Inf
11:         ranger_2.num.trees ParamInt   1.0  2000    2000
12:               svm_2.kernel ParamFct    NA    NA       4
13:             glmnet_3.alpha ParamDbl   0.0     1     Inf
14:              ranger_3.mtry ParamInt   1.0    10      10
15:   ranger_3.sample.fraction ParamDbl   0.0     1     Inf
16:         ranger_3.num.trees ParamInt   1.0  2000    2000
17:               svm_3.kernel ParamFct    NA    NA       4
18:            ranger_end.mtry ParamInt   1.0    10      10
19: ranger_end.sample.fraction ParamDbl   0.5     1     Inf
20:       ranger_end.num.trees ParamInt  50.0   200     151
* Terminator: <TerminatorEvals>
* Result:
   filt_jmi1.filter.nfeat filt_jmi2.filter.nfeat filt_jmi3.filter.nfeat
1:                     23                     47                     17
   filt_jmi4.filter.nfeat pca_1.rank. pca_2.rank. pca_3.rank. glmnet_2.alpha
1:                     15           4          19          10      0.6807026
   ranger_2.mtry ranger_2.sample.fraction ranger_2.num.trees svm_2.kernel
1:             9                0.2268957                475      sigmoid
   glmnet_3.alpha ranger_3.mtry ranger_3.sample.fraction ranger_3.num.trees
1:      0.5502556             7                 0.437633               1168
   svm_3.kernel ranger_end.mtry ranger_end.sample.fraction ranger_end.num.trees
1:      sigmoid               4                  0.9536162                  153
   classif.mauc_aunu
1:         0.9210168
* Archive:
   filt_jmi1.filter.nfeat filt_jmi2.filter.nfeat filt_jmi3.filter.nfeat
1:                     44                     20                     47
2:                      9                      8                     13
3:                     23                     47                     17
   filt_jmi4.filter.nfeat pca_1.rank. pca_2.rank. pca_3.rank. glmnet_2.alpha
1:                     38          11          13           8      0.3087261
2:                     29          17          40          18      0.8067394
3:                     15           4          19          10      0.6807026
   ranger_2.mtry ranger_2.sample.fraction ranger_2.num.trees svm_2.kernel
1:             2                0.9199991                 88      sigmoid
2:             3                0.4590146                214   polynomial
3:             9                0.2268957                475      sigmoid
   glmnet_3.alpha ranger_3.mtry ranger_3.sample.fraction ranger_3.num.trees
1:      0.5508915             9                0.7515166               1267
2:      0.5327359             6                0.7703981                 61
3:      0.5502556             7                0.4376330               1168
   svm_3.kernel ranger_end.mtry ranger_end.sample.fraction ranger_end.num.trees
1:   polynomial              10                  0.5060165                  183
2:      sigmoid               2                  0.5191757                  190
3:      sigmoid               4                  0.9536162                  153
   classif.mauc_aunu
1:         0.9148065
2:         0.9168070
3:         0.9210168

3.9 Performance

To evaluate the performance of the model I used the test dataset (30% from data)

Code
prediction_test = at$predict_newdata(data[split$test,])
prediction_test
<PredictionClassif> for 816 observations:
    row_ids truth response      prob.1      prob.2     prob.3
          1     3        3 0.000000000 0.000000000 1.00000000
          2     3        3 0.000000000 0.000000000 1.00000000
          3     3        3 0.000000000 0.005607428 0.99439257
---                                                          
        814     2        2 0.003812636 0.909267040 0.08692032
        815     2        3 0.194148771 0.046630875 0.75922035
        816     2        3 0.018487395 0.310696130 0.67081647

The metrics used to evaluate the model are accuracy (acc), multiclass brier score, and the multiclass auc score (mauc_aunu).

Code
measure_acc = msr("classif.acc")
measure_brier =msr("classif.mbrier")
measure_mauc =msr("classif.mauc_aunu")

acc <- prediction_test$score(measure_acc)
mbrier <- prediction_test$score(measure_brier)
mauc_aunu <- prediction_test$score(measure_mauc)

data.frame(metric = c('acc','mbrier','mauc_aunu'),value = c(acc,mbrier,mauc_aunu))
                     metric     value
classif.acc             acc 0.8100490
classif.mbrier       mbrier 0.2542765
classif.mauc_aunu mauc_aunu 0.9238399

3.10 Prediction over new data

Code
prediction_new = at$predict_newdata(new_data)
prediction_new
<PredictionClassif> for 1311 observations:
    row_ids truth response     prob.1     prob.2    prob.3
          1  <NA>        3 0.08236332 0.02559913 0.8920376
          2  <NA>        3 0.08574022 0.04460784 0.8696519
          3  <NA>        3 0.12509596 0.10917367 0.7657304
---                                                       
       1309  <NA>        3 0.01437908 0.01623094 0.9693900
       1310  <NA>        3 0.36272954 0.13516703 0.5021034
       1311  <NA>        3 0.14384272 0.08762320 0.7685341

3.11 Resampling

Takeing some time to run…

Coming soon..