PE&RS March 2016 full version - page 191

population (Y
1
) was developed to replicate a population
where the value was a linear combination of the X values
and the assumptions required for multiple linear regression
were valid. Likewise, the Normal Low (Y
2
) population was
equivalent to the Normal High population except that less
random noise was added. Both the Y
1
and Y
2
populations
were normally distributed and error under linear regression
was normal and homoscedastic. The Model Misspecification
(Y
3
) population was developed based on a non-linear combi-
nation of the X values that would violate the assumptions of
multiple linear regression. The Y
3
population was non-normal
and skewed left, and the error distribution under multiple
linear regression was non-normal and heteroscedastic.
Quantifying Prediction Uncertainty from the Known (Simulated) Populations
To approximate common sampling rates, we drew 500 sample
locations (0.05 percent sample) and extracted values for each
dependent variable (
y
1
,
y
2
,
y
3
) and independent variable (
X
= [
x
1
,
x
2
..,
x
6
]). First, for each dependent variable (
y
1
,
y
2
,
y
3
),
we constructed a random forest model of the form
y
=
f
(
X
)
+
ε
(Figure 1A). Call these models RFy
1
, RFy
2
, and RFy
3
. For
each random forest model we used 500 regression trees; two
independent variables were randomly selected for determin-
ing the splits at each node, and the models were bias cor-
rected (Liaw and Wiener, 2002). Next we employed bootstrap
resampling (Figure 1B) and we used
B
= 2,000 bootstrap
samples to create 2,000 random forest models, using the same
model specification as described above, for
y
1
,
y
2
, and
y
3
. We
then predicted values (
y
ˆ
), calculated the variance across tree-
level predictions (
var
(
y
ˆ
)) (Figure 1C), calculated the squared
prediction error ((
y
-
y
ˆ
_
)
2
), and estimated
τ
ˆ
for each RFy1, RFy2,
and RFy3 based on observations that were not part of each
bootstrap sample (Figure 1D). To estimate the
τ
ˆ
value for each
model (RFy1, RFy2, and RFy3) we selected the 95
th
percentile
of
T
values for each model (Figure 1D).
The random forest models RFy1, RFy2, and RFy3 were used
to predict each variable spatially (i.e., applied to all pixels in the
map). Additionally for each of the predicted maps, the vari-
ance across tree-level predictions (
y
ˆ
) was calculated and the
width of the 95 percent prediction intervals were then ±
τ
ˆ
·
sd
(
y
ˆ
)
(Figure 1E). We then compared the predicted values, and the
95 percent prediction interval to the true population values for
Y
1
(the Normal High population), Y
2
(Normal Low population),
and Y
3
(Model Misspecification population) by examining the
Figure 1. Schematic of proposed methods for random forest prediction uncertainty.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
March 2016
191
167...,181,182,183,184,185,186,187,188,189,190 192,193,194,195,196,197,198,199,200,201,...234
Powered by FlippingBook