Feature Selection for Tree Species Identification in Very High Resolution Satellite Images Matthieu Molinier and Heikki Astola VTT Technical Research Centre of Finland [email protected] , [email protected] IGARSS 2011 Vancouver, 28.7.2011
Dec 17, 2014
Feature Selection for Tree Species Identification in Very High Resolution Satellite Images
Matthieu Molinier and Heikki Astola
VTT Technical Research Centre of Finland
[email protected], [email protected]
IGARSS 2011 Vancouver, 28.7.2011
204/10/23
Introduction
NewForest – Renewal of Forest Resource Mapping
• A 1.5-year study (2009-2010) funded by The Finnish Funding Agency for Technology and Innovation (TEKES), with Finnish Companies (forest) and Research Organizations (VTT and University of Eastern Finland UEF)
Study motivation
• Improve methods for operative forest inventory from remote sensing data• Species-wise estimates (e.g. stem volume) not accurate enough (accuracy
vs. cost)
304/10/23
NewForest approach in forest variable estimation
Plot ID Coord-X Coord-Y Total stem volumeStem volume - pineStem volume - spruceStem volume - deciduousNo Eastings Northings ToV PinV SprV DecV# [m] [m] [m3/ha] [m3/ha] [m3/ha] [m3/ha]
49 508469.1 6973060 149.49 149.49 0 0117 510723.7 6972375 68.36 0 8.37 59.99118 510732.8 6972976 150.78 0 0 150.78121 511123 6973176 97.08 78.12 18.01 0.94122 511324 6973177 89.63 20.31 65.63 3.69132 516717.5 6969571 337.18 39.81 168.01 129.37133 516716.5 6969773 370.18 71.73 282.28 16.17134 516720.5 6969978 229.53 0 229.53 0135 516717.6 6970173 159.47 0.31 131.7 27.46158 510024.1 6973974 69.23 53.58 15.32 0.33159 510223.8 6973976 103.16 0 0 103.16160 510438.2 6973988 108.96 31.92 8.19 68.85168 510732.5 6974084 218.62 0 215.11 3.5169 510935.5 6974078 228.76 0 218.32 10.43171 513829.3 6972287 97.35 0 90.18 7.18172 513817 6972480 162.25 109.52 11.79 40.94174 514021.7 6973078 247.16 156.36 7.38 83.42175 514227.1 6973084 316.99 135.16 181.1 0.73176 514409.4 6973059 288.53 177.77 110.45 0.31196 515921.4 6971387 133.66 0 1.26 132.4197 515922.5 6971571 242.78 0 21.72 221.06200 516121.7 6971978 86.56 0 42.66 43.91209 513921.5 6971078 56.27 0 0 56.27210 514123.4 6971079 103.61 59 39.52 5.09212 514527.7 6971098 164.14 90.45 72.96 0.73213 514714.9 6971086 101.19 8.71 3.94 88.54222 513720.3 6969993 282.51 0 254.33 28.17223 513719.9 6970169 220.55 34.85 185.7 0226 514112.5 6970168 219.36 0 176 43.36227 514321.7 6970179 164.12 0 124.97 39.15
Modelling based on satellite image pixel reflectances and contextual features
Individual tree crown(ITC) detection and
crown width estimation
Combining data to predict
total amount and sizevariation by species
segmentation estimates
Refined, more accuratespecies-wise estimates
404/10/23
Study siteKarttula / Kuopio,
Central Finland
62.9007º N
27.2392º E
Karttula
GeoEye image, 26.6.2009, RGB NIR10.5 km x 11.5 km, 3% clouds
Mixed forest, spruce dominated25% pine, 45% spruce, 30% deciduous (mainly birch)
504/10/23
Optical image data pre-processing
• Rectification to geographic coordinate system (WGS84, NUTM35)
• Geo-coding corrected using Digital Elevation Model (Airborne Laser Scanning DEM) : mean corrections 2.65 m, maximum 20 m
• Calibration to Top Of Atmosphere (TOA) reflectances using the band-specific calibration coefficients
• Atmospherical correction into surface reflectances by applying the SMAC4-radiation transfer code
604/10/23
Ground reference data
Training data – from 222 field plots 212 field plots within GeoEye image area (2009) 10 additional 0-stem volume plots extracted visually Tree species classification : training data from 20 pure
species field plots
Testing data – from 178 field plots (mixed species) 178 field plots acquired in 2009, limited spatial distribution
(several plots per forest stand)
In total : 1164 ground objects mapped (276 pines, 277 spruces, 347 deciduous, 264 non-trees)
Training set ToV PinV SprV DecVMean 200,8 48,4 99,2 53,2 [m3/ha]Stdev 116,3 88,8 108,5 63,8 [m3/ha]
Test set ToV PinV SprV DecVMean 203,3 78,7 79,0 45,6 [m3/ha]Stdev 107,3 91,5 96,2 49,4 [m3/ha]
GeoEye image : 10.5 km x 11.5 km
704/10/23
Input for feature selection – 35 + 4 features
R G B NIR PANmean intensity within 1.5 m radius around tree candidates (TC)
SPECTRAL (5) – set A
CONTEXTUAL (9) – set B
From PAN, 7.5 m radius around TCmeanmean / medianskewnesskurtosiscontrastpm1 : mean of brightest pixelsps1 : std of brightest pixelspm2 : mean of darkest pixelsps2 : std of darkest pixels
SEGMENT-WISE (21) – set C
From PAN, 3 segment sizes : 50 m2, 85 m2, 125 m2
meanmean / median skewnesskurtosisstd : standard deviationpmean : partial meanpstd : partial standard deviation
Probe variables
random vectors or random permutations of a feature vectorprobe_gauss1, probe_gauss2probe_shuffle1, probe_shuffle2
804/10/23
Class definitions and training scheme
Class#
Class name
1 pine
2 spruce
3 deciduous
4 shadow
5 open area / sunlit
6 bare ground
7 green vegetation
Tree classes
Non-tree classes
WHOLE DATASET (1164 samples)900 trees, 264 non-trees
TESTING (391)
MODEL DESIGN (773)
2 / 3 1 / 3
TRAINING(512)
VAL(261)
2 / 3 1 / 3
stratified sampling to preserve classes proportions
model building ranking
904/10/23
Feature selection preparation (Guyon et al., 2003)
• Feature normalization to the range [0, 1]• Visual screening of scatter plots on the 35 real features : no obvious
correlations, very few outlier samples
• Variable ranking – assessing features one by one with the most simple classifier (single threshold), one(+) vs all(-). 4 scores :
– Fisher criteria F, scaled to [0 1]– R2 – Pearson correlation coefficient for a single feature vs +/- labels– AUC : Area under ROC curve (Receiver-Operative Curve)– sum of previous scores (FR2AUC)
• All scores computed for every class, then averaged to rank the variables for all 7 classes and for tree classes only (1,2,3).
• No single feature outperformed significantly and consistently the others
F
1004/10/23
Feature selection and image classification
• Classification accuracy on validation set VAL (261) as a score
• Sequential Forward Selection (SFS) with three classification methods :– Linear Discriminant Analysis (LDA)– Quadatric LDA– k-nearest neighbor (kNN) classifier, k [2 9]. Feature selection and
choice of k at the same time.
• Find the best minimal feature subset by a brute-force approach – 10 best features from the SFS– retrain the best model using all modeling dataset (TRAIN + VAL)
and test with the independent TEST set– brute force approach tractable in this case with simple classifiers– overcome the sub-optimality of SFS
1104/10/23
6-10 features is enough
Spectral features performed bestsegment-wise features not suited to mixed species study
Overall classification accuracy on tree classes over 80%
Probe variables selected more often in the first places with LDA than with kNN : linear classifier too simple. Quadratic LDA was overfitting.
kNN, k=5 best overall performance, and lowest difference from training to validation error => lower risk of overfitting
1204/10/23
Example of tree species classification map
pine : 76 %
spruce : 76 %
deciduous : 88 %
non-forest
• Pan-sharpened GeoEye image extract of 1 km x 1 km
• Individual tree crown classification with 5-NN classifier trained with pure species training data
• Non-forest mask generated with
k-means clustering + cluster labeling
1304/10/23
Predicted species-wise stem numbers vs. field plot data
Nspruce [stems/ha]Npine [stems/ha]
Pre
dict
ed [s
tem
s/ha
]
Ndecid [stems/ha]
• Predicted stem number per species plot against test data (178 test plots)
• Systematic under-estimation of predicted stem number with spruce and deciduous classes
• Noise partly due the small collecting radius (r = 8 m) of test data, and to geolocation differences between satellite and ground data
0 500 1000 1500 20000
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
True number of spruces/field plot
Pre
dict
ed n
umbe
r of
spr
uces
/fie
ld p
lot
y=0.98*x + 137.1y=0.98*x + 137.1
y=0.98*x + 137.1R2 = 0.24
y=0.98*x + 137.1
y=0.33*x + 239.8
0 500 1000 1500 2000 25000
500
1000
1500
2000
2500
y=0.56*x + 21.0
R2 = 0.54
True number of broadleaved/field plot
Pre
dict
ed n
umbe
r of
bro
adle
aved
/fie
ld p
lot
0 500 1000 1500 20000
200
400
600
800
1000
1200
1400
1600
1800
2000
y=0.85*x + 45.0R2 = 0.34
True number of pines/field plot
Pre
dict
ed n
umbe
r of
pin
es/fi
eld
plot
1404/10/23
Conclusions
• The methodology could detect individual treetops, identify their species and determine species proportions in mixed forest.
• Feature ranking and feature selection was performed on a set of 35 features for tree species classification.
• Several classifiers (model including a feature subset and a classification method) were built. The best turned out to be 5-NN with a subset of 6 features, mostly spectral. Segment-wise features could be discarded.
• The tree species proportion accuracy was good (1.4% to 3.5%), but the correlation of stem numbers / species not as good as expected.
Future work• Model selection with more elaborate classifiers (e.g. SVMs)• Embedding feature selection into a cross-validation scheme• Improve stem number estimation with adaptive filtering• Tree crown width estimation validation with ground data