MODIS-FIRMS and ground-truthing based wildヲre likelihood mapping of Sikkim Himalaya using machine learning algorithms. Polash Banerjee ( [email protected]) Sikkim Manipal University of Health Medical and Technological Sciences: Sikkim Manipal University https://orcid.org/0000-0002-2187-9347 Research Article Keywords: Forest ヲre, Prediction map, algorithm, statistical learning, GIS Posted Date: August 31st, 2021 DOI: https://doi.org/10.21203/rs.3.rs-750123/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Version of Record: A version of this preprint was published at Natural Hazards on August 15th, 2021. See the published version at https://doi.org/10.1007/s11069-021-04973-6.
51
Embed
MODIS-FIRMS and ground-truthing based wildre likelihood ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
MODIS-FIRMS and ground-truthing based wild�relikelihood mapping of Sikkim Himalaya usingmachine learning algorithms.Polash Banerjee ( [email protected] )
Sikkim Manipal University of Health Medical and Technological Sciences: Sikkim Manipal Universityhttps://orcid.org/0000-0002-2187-9347
License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License
Version of Record: A version of this preprint was published at Natural Hazards on August 15th, 2021. Seethe published version at https://doi.org/10.1007/s11069-021-04973-6.
Where, 𝑃𝑃𝑜𝑜𝑏𝑏𝑜𝑜 is the proportion of pixels correctly classified as wildfires or as non-wildfires, and 410 𝑃𝑃𝑒𝑒𝑚𝑚𝑒𝑒 is the proportion of pixels for which the amount of agreement is expected by chance only. 411
The ROC curve is used to visualize the change in sensitivity over the specificity of the 412
prediction model. A perfect prediction by a model should yield an ideal tuple of (1,1) implying 413
perfect sensitivity and specificity. Usually, a good model generates a concave ROC curve with 414
respect to the diagonal connecting (1,0) to (0,1) for the tuple (sensitivity, specificity), with a 415
high value of Area Under the Curve (AUC). 416
20
A comparative analysis of the performance of the machine learning methods was done using 417
the box and whisker plot and Scatter plot Matrices. To do this, 30 samples from the dataset 418
were selected using the cross-validation method. These plots provided visualization of the 419
similarities between the performances of the machine learning methods. 420
2.9.Wildfire likelihood model 421
In this study, several r-packages were used for data pre-processing and machine learning (Hunt, 422
2020; Kuhn, 2020; Liaw & Wiener, 2002; Robin et al., 2011). These packages were used for 423
training and testing the machine learning algorithms. Furthermore, machine learning 424
algorithms also provided the importance of the feature variables used in the model. Features 425
with no importance to the model were dropped from the dataset and algorithms were rerun. 426
This was followed by stacking of the feature rasters. The stack was used by the machine 427
learning algorithms for predicting the wildfire probability over the entire study area. The matrix 428
of predicted values of the entire study area was exported as a GeoTiff raster, that stores the 429
predicted values along with their respective latitude and longitude values. The rasters of 430
predicted values of wildfire were imported into the GIS framework for further analysis that 431
included the categorization of the raster into areas of very low, low, medium, high and very 432
high probability of wildfire (Figure 2). 433
3. Results 434
The wildfire inventory dataset was used to analyse whether wildfires were on the rise in Sikkim 435
Himalaya. Time series data of wildfires over the years 2000 to 2019 indicated that there was a 436
growing trend. The picture became clearer by forecasting the wildfires using Holt’s forecast 437
model. The model predicted that the wildfires were likely to increase from 82 in 2019 to 96 in 438
2022 with an uncertainty of ± 62.343 events (Figure 3). To identify the relevant environmental 439
features contributing to wildfires a multicollinearity analysis was performed. 440
21
3.1.Multicollinearity analysis 441
Based on the prevailing literature, initially, 16 environmental features were considered for the 442
prediction. However, multicollinearity analysis brought the number of explanatory 443
environmental features down to 15 (Figure 4, Table 2). 444
3.2.Impact of environmental features on wildfires 445
Wildfires showed more propensity over certain intervals of the ranges of environmental 446
features. Except for TWI, they were mostly normally distributed over the topographical feature. 447
For instance, in aspect, wildfires were more common over the interval 140o -220o, covering 448
southeast to southwest direction. Also, wildfires were more common over steep slopes (29o - 449
32o). Wildfire events over plan and profile curvatures showed normal distributions over certain 450
ranges. In the case of plan curvature, all the wildfires occurred in convex curvatures, while the 451
same occurred in concave curvatures for profile curvature (Pourghasemi, 2013; 2014). In 452
contrast, wildfires showed skewness over lower TWI interval (5 – 6.5). Wildfires also showed 453
more skewness over the meteorological features. Temperature-wise wildfires were very 454
common in warmer areas of Sikkim, mainly the lower altitude areas with high average 455
temperature (19o - 24o). Moreover, wildfires are more concentrated over low average wind 456
speed (about 1.6 ms-1). Regarding ecological features, wildfires were more common over 457
moderate NDVI values (about 0.6) and moderate tree cover (43% - 49%). Also, wildfires were 458
more clustered near the water bodies (400m – 630m away from water bodies). Regarding the 459
in-situ features, wildfire events were mostly confined to low carbon content soil (25 – 40 g/kg 460
of soil) and moderately humid soil surfaces (30 – 45 volumetric %). Considering the 461
anthropogenic features, wildfires showed moderate skewness towards areas close to 462
settlements (1 km to 2.5 km from settlements). While wildfires were high in the areas close to 463
roadways (≤ 400 m from the roads) (Supplement Figure 2, Figure 5). 464
22
3.3.Model summary and model performance 465
3.3.1. Generalized linear model 466
The GLM model was run with 14 environmental features as explanatory variables. However, 467
features like Aspect, plan and profile curvatures, and TWI were found to be not significant. 468
Thereby, they were dropped from the model and the model was rerun (Table 2). Hence, the 469
GLM model-based prediction included ten explanatory variables and 1090 instances. 10-fold 470
cross-validation was performed for model tuning. From table 2, it is evident that proximity to 471
roadways and low wind speed were the strongest determinants of wildfire in Sikkim Himalaya. 472
Also, features such as proximity to water bodies, slope and average ambient temperature were 473
partly accountable for wildfires. Interestingly, distance from human habitations had an inverse 474
effect on wildfire occurrences. Low soil carbon and drier soil promoted wildfires. Also, low 475
tree cover encouraged the chances of wildfires. The model was able to explain 62% of the 476
predictions (Table 3). The model performance was satisfactory, with low RMSE and MAE, 477
while high AUC, Accuracy, Kappa, Sensitivity, Specificity, Precision, F1 Score, and Goodness 478
of fit (R2) (Table 4). 479
3.3.2. Support vector machine 480
The nonlinear kernel, Radial Basis function was used in SVM for the prediction of wildfires. 481
SVM used 727 support vectors to distinguish between the presence and absence of wildfire 482
instances from the training dataset. 10-fold cross-validation was performed to tune the model. 483
SVM uses several parameters known as hyper-parameters to tune the algorithm and converge 484
to the solution. Model hyper-parameters, namely sigma (σ), epsilon (ε) and cost C settled at 485
0.106, 0.1 and 1, respectively. The ε tunes SVM by determining the number of support vectors 486
to be considered for regression. C, which is similar to λ in eq(9), is accountable for 487
regularization that provides a trade-off between over- and under-fitting of SVM. The objective 488
23
function value of SVM settled at -292.33 and the training error, the convergent error of the 489
model achieved from the training set, was 0.286. RMSE of the final iteration of SVM was the 490
same as GLM (Supplement Figure 3a). However, other performance indices were worse than 491
GLM except for sensitivity, F1 score and RMSE (Table 4). 492
3.3.3. Gradient boosting model 493
Under GBM, Stochastic Gradient Boosting was used using the Gaussian loss function. While 494
converging to the solution, the GBM takes smaller learning steps to reduce the effect of each 495
additional fitted weak learner tree. This penalization reduces the chances of giving undue 496
importance to erroneous iterations. This method is called ‘shrinkage’. The ‘n.minobsinnode’, 497
another tuning parameter of GBM, is the minimum number of observations in trees at the 498
terminal nodes. The GBM in this study used the default values of shrinkage and 499
n.minobsinnode at 0.1 and 10, respectively. GBM converged to the solution with 150 decision 500
trees (n.trees) with an interaction depth of 3 (Supplement Figure 3b). Performance-wise GBM 501
outperformed GLM and SVM, except for MAE (Table 4). 502
3.3.4. Random Forest model 503
RF was used to predict wildfires in Sikkim Himalaya using 500 decision trees. RF converged 504
to the solution when the algorithm selected eight environmental features at random at each split 505
(mtry) (Supplement Figure 3c). The mean squared residual of RF was 0.082, while RF 506
explained 67.27% variance of out-of-bag predictions of the target variable of the training set 507
(% Var explained). RF outperformed all the other prediction models (Table 4). 508
3.3.5. Comparative analysis of models 509
The box and whisker plots of accuracy and kappa showed a very similar pattern of distribution 510
over the 30 samples selected to form the wildfire dataset using the cross-validation method. In 511
24
both cases, RF distribution was rightwards than all other models indicating a higher model 512
performance, followed by GBM. The greater width of the box in the case of GBM indicated a 513
greater interquartile range. This showed a generalization approach of classification by GBM 514
which is reflected in its better classification ability than GLM and SVM. The long range of 515
observations as whiskers in the case of GLM indicated its inability to classify the incidents 516
efficiently. In contrast, in the case of MAE and RMSE, RF showed a distribution shifted 517
towards the left with a compact interquartile range and the range of observations skewed 518
towards the left. This showed that RF is a much better contender in classification than other 519
models which provide very similar MAE values, however with varied prediction distributions. 520
RMSE being sensitive to outliers showed a clearer picture with the lowest value for RF 521
followed by GBM. In the case of R2, RF showed a compact interquartile distribution with a 522
high value. In contrast, GBM showed the next highest value of R2, however, its mean value has 523
been pulled leftward by the skewed distribution of its predictions (Figure 6). Scatter plot matrix 524
(SPLOM) showed that RF and GBM have a higher correlation in predictions. Similarly, GLM 525
and SVM showed a strong correlation. In contrast, such a correlation was not observed in the 526
case of MAE, except for RF and GBM (Figure 7). The ROC curves indicated that all the 527
models performed satisfactorily, while RF outperformed all the other models (Figure 8) 528
3.4.Variable importance 529
Proximity to roadways got the highest importance in GLM, GBM and RF. This was followed 530
by average wind speed that got the highest importance in SVM and GLM. Also, features such 531
as average temperature, NDVI and tree cover were found important in SVM. Methods like RF 532
and GBM that uses regression trees for prediction gave disproportionately high importance to 533
proximity to roadways and average wind speed. In contrast, methods like SVM and GLM that 534
do not rely on regression trees gave more distributed importance to all the features. 535
Topographic features like plan curvature, profile curvature and TWI received no to low 536
25
importance in all the models. In-situ features received low to moderate importance. 537
Anthropogenic features, meteorological features and ecological features were found to be the 538
most important determinants of predictions (Figure 9). 539
3.5. Wildfire prediction maps 540
All the feature maps were projected to the plane coordinate system of WGS-1984-UTM-Zone-541
45N, as it is appropriate for India. Furthermore, all the feature maps were resampled to 30.7 m 542
resolution. Accordingly, all the WLMs had the same projection system and resolution. In all 543
WLMs, the southern part and valley areas of Sikkim Himalaya were found to be at a higher 544
risk of wildfires. In the case of GLM, wildfire probability in most of the study areas was found 545
to be very low, except for warmer valley areas of southern parts of Sikkim Himalaya. However, 546
GLM put more emphasis on areas with high soil carbon content. This led to the consideration 547
of such areas as high wildfire likelihood values (Figure 10a). In contrast, SVM gave a slightly 548
higher probability than GLM to most of the study area, along with giving a higher probability 549
of wildfire to a much larger fraction of the study area (Figure 10b). GBM devoted more areas 550
to wildfire than GLM and SVM (Figure 10c). RF gave more importance to valley areas than 551
GBM, although the spatial distribution of wildfire probability showed high similarity with 552
GBM (Figure 10d). Based on the accuracy and performance of the prediction models, the 553
WLM of RF was considered the best for Sikkim Himalaya. The WLM of RF was classified 554
into five categories namely very low, low, medium, high and very high likelihood of wildfire 555
based on natural breaks in the GIS framework (Figure 11). Compared to high likelihood 556
categories, very high likelihood of wildfire category had a relatively larger area (Figure 12) 557
4. Discussion 558
26
The overarching objective of this study was to prepare the WLM of Sikkim Himalaya based 559
on a comparative study of machine learning methods with appropriate explanatory variables. 560
The study yielded prediction maps with good model performance indices. 561
4.1.Comparison between machine learning methods and their implications 562
In this study instead of just one algorithm, four algorithms were considered. This was mainly 563
to identify the algorithm that performs best in the wildfire prediction out of the popular 564
algorithms considered. Contrary to previous studies, RF outperformed other machine learning 565
methods in wildfire predictions (Ogutu et al., 2011; Tehrany et al., 2019; Xie & Peng, 2019). 566
This observation was in harmony with studies performed by other authors (Guo et al., 2016; E. 567
Kim et al., 2015; Massada et al., 2013). The better performance of RF in comparison to GLM 568
and SVM can be because RF uses the ensemble method of learning instead of linear or kernel-569
based learning. In the ensemble method, the average output of several decision trees is 570
considered. This process increases the chances of correct prediction. Also, contrary to SVM, 571
RF is good at handling datasets with many outliers (Andreas, 2013). As observed from the 572
dataset, the histograms of several environmental features in this study were skewed. Perhaps, 573
this was another reason for the better performance of RF. 574
The comparative analysis of the models was based on samples extracted from the wildfire 575
dataset through the cross-validation method. It showed that GLM had a much wider range of 576
accuracy and kappa values. This can be explained by the limited number of feature variables 577
considered in the GLM model in comparison to other models. Furthermore, the smallest range 578
of MAE of GLM indicated that the possible reason for the wide ranges of accuracy and kappa 579
values can be due to a large set of outliers in the wildfire dataset (Géron, 2017). The higher 580
correlation between GLM and SVM as well as that of GBM and RF in terms of accuracy and 581
27
MAE showed that out of these pairs of models only one should be considered while making an 582
ensemble of models to improve the prediction capacity of wildfire events (Brownlee, 2016). 583
4.2.Importance of feature variables 584
Consistent with previous studies, this study suggested that meteorological features like wind 585
speed and to some extent ambient temperature were important determinants of wildfires. The 586
low wind speed and warm temperature of the valley areas are features of sub-tropical Sal and 587
Oak deciduous forests prone to wildfires in Sikkim Himalaya. The anthropogenic feature like 588
distance from the roadways on average was the strongest predictor of wildfires. To lesser 589
extent proximity to human habitations also contributed to the predictions of wildfires. These 590
observations second the previous studies on wildfires of Sikkim (Sharma et al., 2014). The 591
ecological feature like the fraction of tree cover and in-site features like soil carbon were 592
moderate predictors. Compared to other features, topographical features were not very good 593
predictors of wildfires (Arpaci et al., 2014; Estes et al., 2017; Flannigan & Harrington, 1988; 594
Guo et al., 2016; Jaafari et al., 2018; T. Kim et al., 2015; Ljubomir et al., 2019; Sachdeva et 595
al., 2018; Tien Bui et al., 2019; Yathish et al., 2019). Contrary to the MCDA-based study, 596
namely using AHP, on forest fire risk zones of Sikkim (Laha et al., 2020), the present study 597
gave limited importance to the aspect, except for the SVM model. However, indirect measures 598
of human population density, namely, proximity to human settlements and roadways supported 599
the observations of Laha et al. (2020). Like the observations by Banerjee (2021), this study 600
showed that proximity to roadways was the most important determinant of wildfire in Sikkim. 601
However, contrary to Banerjee (2021) average wind speed has been given more weight in this 602
study than average ambient temperature. Looking at the correlation matrix these two 603
meteorological variables had a significant negative correlation. However, their collinearity in 604
terms of VIF was within acceptable limits. Thereby, they were considered as independent 605
28
variables in this study. Furthermore, tree cover fraction has been considered as an important 606
factor in wildfire prediction in both the studies. 607
4.3.Future risks of wildfire 608
The study showed that wildfires were predominantly distributed in the lower altitudes and 609
valley areas of Sikkim Himalaya. Few observations can be made about these areas. The 610
meteorological conditions of these areas were identified as having relatively warmer ambient 611
temperatures and low wind speed. Also, the road network of these areas closely follows the 612
river network. Steep slope facing southeast to southwest aspect with low TWI explained most 613
of the wildfires of these areas (Graham et al., 2004; Jo et al., 2000; Mhawej et al., 2015). Low 614
soil carbon and water content areas had more incidents of wildfire. The role of human activities 615
in the occurrence of wildfires was evident from the study. These observations were similar to 616
previous studies (Arpaci et al., 2014; S. J. Kim et al., 2019). However, contrary to previous 617
studies, proximity to settlements as a feature had a contradictory role in this study as the bulk 618
of the wildfires were on average 2.5 km away from the human habitations (S. J. Kim et al., 619
2019; Massada et al., 2013; Nami et al., 2018; Vilar et al., 2016). This may be since the land 620
use around the settlements was mainly non-forest lands like agrarian or fallow lands. Thereby 621
areas of Sikkim bordering the state of West Bengal, district borders of West and South Sikkim 622
and populated valleys of North Sikkim are more prone to wildfires. 623
The WLMs did not effectively predict the wildfires of upper North Sikkim. This may be since 624
in this study meteorological factors, like the occurrence of lightning was not considered. In 625
contrast, a study done earlier does mention the role of lightning in wildfires in North Sikkim 626
(S. Sharma, Joshi, and Chhetri 2014). 627
This study is probably the first attempt to systematically prepare the WLM of Sikkim Himalaya 628
using multiple machine learning models. In line with studies done in other locations, this study 629
29
indicated that anthropogenic and meteorological factors were the most prominent descriptors 630
of wildfires. Also, this study highlighted that machine learning methods were reliable means 631
of preparing hazard maps. However, the reliability of the predictions heavily depends on the 632
wildfire inventory. This can be achieved by pruning instances with incorrect target variable or 633
incomplete instances. Usually, a large and representative inventory leads to better predictions. 634
Also, the engineering of features like normalization and removal of multicollinear features are 635
essential steps for dataset preparation. Regarding the choice of algorithms, consideration of the 636
nature of the dataset, in terms of whether the target variable is binomial, multinomial, 637
categorical, or continuous is important. Moreover, skewness of the features has an important 638
role in the choice of machine learning methods. Cross-validation and choice of 639
hyperparameters for the regularization are essential steps towards reliable algorithm-based 640
predictions. 641
The outcomes of this study can be useful to the stakeholders for the preparedness and effective 642
allocation of fire-retarding resources and manpower to wildfire-prone areas. Furthermore, 643
vulnerability assessment of wildfire in Sikkim can be performed based on this study by 644
overlaying socioeconomic and environmental cost map on the wildfire likelihood map of 645
Sikkim. Such studies can be very helpful in wildfire mitigation and land-use policies. 646
5. Conclusion 647
Applications of machine learning in geospatial analysis is progressively expanding. One of the 648
prominent niches of this new branch of science is the predictive modelling of natural hazards. 649
This study presents the wildfire prediction map of Sikkim Himalaya using four machine 650
learning methods. These methods were run over the wildfire dataset involving several 651
environmental features encompassing, meteorological, topographical, ecological, in-situ and 652
anthropogenic factors. The methods, namely Generalized Linear Model in the form of Logistic 653
30
Regression, Radial Basis Function Kernel-based Support Vector Machine, Gradient Booster 654
Method, and Random Forest are compared using model performance criteria. Amongst these, 655
Random Forest computes the most accurate prediction followed by Gradient Booster Method. 656
These methods produce high values of AUC, Accuracy, Kappa, Sensitivity, Specificity, 657
Precision, F1 Score, and Goodness of fit and low values of RMSE and MAE. These decision 658
tree-based methods marginally outcompeted SVM and GLM. 659
Furthermore, it is concluded that meteorological factors like ambient temperature and wind 660
speed over the dry season, as well as anthropogenic factors like proximity to roadways, are the 661
most important descriptors of wildfires in Sikkim Himalaya. Most of the wildfires in Sikkim 662
are prevalent in the low altitude valley areas of the south. These observations can be 663
internalized into the wildfire mitigation policies towards the consequences of slash and burn 664
farming, use of fire to discourage entry of wildlife in settlements and traffic-induced wildfires. 665
Also, long-term policy intervention can be prepared from this study regarding the impact of 666
climate change-induced changes in the meteorological conditions of Sikkim Himalaya. 667
This study shows that machine learning can be combined with GIS to produce robust geospatial 668
models of wildfire predictions. Machine learning can be a reliable wildfire management tool. 669
Such a tool can be further improved by integrating online learning where the prediction model 670
can have an incremental learning from a near real-time database like MODIS FIRMS. The 671
methodology of this study can be further extended to include more in situ and meteorological 672
factors into the feature space. Also, other artificial intelligence methods like ANN, evolutionary 673
algorithms and agent-based learning can be applied to the wildfire dataset to generate better 674
and reliable prediction maps. However, such studies need to trade-off between accuracy and 675
interpretability. 676
References 677
678
31
Al_Janabi, S., Al_Shourbaji, I., & Salman, M. A. (2018). Assessing the suitability of soft 679
computing approaches for forest fires prediction. Applied Computing and Informatics, 680
Methodology of the preparation of Wild�re likelihood map. (Source of raster stack image:https://i.stack.imgur.com/whXlL.png)
Figure 3
Time series of wild�re events in Sikkim Himalaya from 2000 to 2019. The Holt’s forecast model indicatesan increasing trend of wild�re in Sikkim Himalaya when projected to the year 2022. The forecast has anaverage boundary of ±62.343 wild�re events from 2020 onwards.
Figure 4
Correlation matrix of feature variables.
Figure 5
Environmental features. All the maps have been reclassi�ed using Jenks natural breaks method. Thenatural breaks method minimizes variance within categories while maximizing the variance betweencategories. This leads to an increase in the quality of the classi�cation (Jenks, 1967) (a) Aspect (Cont.)
Figure 6
Box and whisker plots of model performance indices.
Figure 7
Scatter plot matrix of accuracy and MAE
Figure 8
ROC curve of (a) GLM, (b) SVM, (c) GBM, (d) RF.
Figure 9
Importance or in�uence of the environmental features on the prediction models. For RF and GBM,variable importance was calculated by estimating the Mean Squared Error (MSE) of the out-of-boxsample by shu�ing the dataset. Loess r-squared method was used for estimating the variableimportance of SVM. For GLM, the absolute value of the t-statistic of the model parameters was used toestimate the variable importance (Kuhn, 2019).
Figure 10
Wild�re likelihood map of Sikkim Himalaya based on the prediction of (a) GLM (Cont.)
Figure 11
Wild�re likelihood map of Sikkim Himalaya showing various categories of likelihood of wild�re.
Figure 12
Areas under the various wild�re likelihood categories.
Supplementary Files
This is a list of supplementary �les associated with this preprint. Click to download.