uncertainty. The objective of this study is to develop a tech-
nique to approximate prediction uncertainty for random forest
models of continuous data. In our case we consider prediction
uncertainty to be the uncertainty around a future prediction for
a new observation (i.e., pixel-level uncertainty).
We further present a case example using predicted percent
tree canopy cover in Georgia, USA. Portraying map uncertainty
is an important component for geospatial data developers to
consider. In some cases, prediction uncertainty is a central
component to developing a final geospatial dataset. For exam-
ple, the 2001 and 2011
NLCD
percent tree canopy cover layers
strived to mask out areas where there is clearly no tree canopy
cover but canopy cover models predict low levels of tree can-
opy cover. In the 2001
NLCD
percent tree canopy cover layer,
the mask was created by creating a “liberal” tree cover map
and hand editing (Huang
et al
., 2001). However, the techniques
presented here facilitate a more parsimonious approach.
Methods
Through the methods section we use standard matrix and boot-
strap notation. Bold lower case letters (e.g.,
y
) represent a vector.
Bold upper case letters (e.g.,
X
) represent a matrix. A * super-
script followed by
b
(e.g.,
y
*b
) refers to the
b
th
bootstrap sample
and a * superscript followed by
–b
(e.g.,
y
*
-b
) denotes the portion
of the original data that was not part of the
b
th
bootstrap sample.
Greek letters represent parameters (e.g.,
τ
) and a vector of param-
eters or matrices of parameters are bold as described above.
Random Forest Overview
We provide a brief overview of Random Forest but point the
interested reader to Breiman (2001) for more details. Random
forest is an ensemble approach that relies on
CART
models.
The goal of
CART
is to understand (learn) the relationship be-
tween a dependent variable (
y
) and a set of predictor variables
(
X
). The learning algorithm employs recursive portioning in
which splits in the
X
variables are selected to create homog-
enous groupings of
y
. The recursive portioning continues
until either the subset of
y
at each node is the same value or
further splitting adds no further improvement. Random forest
differs from the
CART
procedure by (a) employing bootstrap
resampling (Efron and Tibshirani, 1993), and (b) random vari-
able selection. Consider a regression tree which is made up
of splits and nodes. With random forest a random subset of
X
variables (selected without replacement) is used to determine
the split for each node. Bootstrap resampling is used to de-
velop replicates of the
CART
model. For continuous variables
the ensemble estimate is the mean of the predicted values
across trees (
y
ˆ
_
)
and the variance across trees is
var
(
y
ˆ
).
Methods Overview
Generally speaking, our method to approximate prediction
uncertainty for random forest regression models has five main
steps (Figure 1). Step 1 is to fit a random forest model based on
all available data. Step 2 is to use bootstrap resampling to param-
eterize a large number of random forest models (Figure 1B). Boot-
strap resampling generally results in ~37 percent of the observa-
tions not being selected. Step 3: for each bootstrap replicate of
the random forest (
RF
) model the observed values and predicted
values are retained for observations not included in the bootstrap
sample (Figure 1C). Step 3 yields an error assessment dataset. In
Step 4 the properties of the prediction error are quantified using
the error assessment dataset (Figure 1D). Step 5 is to make a pre-
diction, including error, for a new observation (Figure 1E).
Bootstrap Resampling
The bootstrap is one tool that can also be used to approximate
prediction uncertainty of a
RF
model (Figure 1B). Consider the
response and predictor variables (
y, X
) where a bootstrap sam-
ple of (
y, X
) is (
y*
b
, X*
b
). Suppose we draw
B
= 2000 bootstrap
samples to create
B
= 2000 bootstrap datasets. Using the boot-
strap datasets we construct
RF
*
1
,
RF
*
2
, …,
RF
*
2000
random forest
models, and then for each replicate quantify the prediction
error for each observation in (
y*
-
b
) based on the corresponding
RF
*
b
replicate. The error assessment is constructed for each
observation based on the distribution of predicted values when
the observation was not part of the bootstrap sample (Figure
1C). The prediction error is
√
MSE where
MSE
is the mean
square error for each observation. This technique allows one to
quantify prediction error for each element of a holdout dataset
but does not directly apply to predictions based on a new
X
.
However, because random forest relies on bootstrap sampling
to construct the ensemble, a random forest model contains in-
formation that we can use to quantify prediction uncertainty for
new locations (i.e., new
X
data are available, Figure 1D and 1E).
Prediction Uncertainty
In traditional parametric models (e.g., multiple regression),
the prediction error for a new observation is a function of the
mean squared error (
MSE
) and the variability in
X
. Recall that
a random forest model is an ensemble of
CART
models and the
ensemble estimate is the mean across the set of
CART
model
predictions. Each of the
CART
models is considered a weak
learner. The predictions from these weak learners inherently
capture the variability in the relationship between
X
and
y
.
We can calculate the variance among predictions, , for each
observation in
X,
which represents the variability of predic-
tions among
CART
models. However, we need to scale between
var
(
y
ˆ
) and (
y
-
y
ˆ
_
)
2
to approximate prediction uncertainty
(Figure 1D). This is because to approximate the prediction
uncertainty for a new observation only
var
(
y
ˆ
) will be
available. A measure such as
= −
(
)
( )
ˆ
ˆ
y y
var y
2
τ
provides such a
scaling. To implement this approach, for each bootstrap data-
set (
y*
b
, X*
b
) a random forest model is constructed
RF
*
1
,
RF
*
2
,
…,
RF
*
B
. For each observation in
y*
-1
,
y*
-2
, … y*
-
B
,
a prediction
(y
ˆ
_
) is made from the corresponding model
RF
*
1
,
RF
*
2
, …,
RF
*
B.
Subsequently
T
= [
τ
*
-1
,
τ
*
-2
, …,
τ
*
-B
] is calculated. The
τ
ˆ
for a
95 percent confidence interval can be estimated either using
a bootstrap approach or a Monte Carlo approach (Figure 1D).
For the Monte Carlo approach, the value of
τ
ˆ
such that 95
percent of the predictions lie within
τ
ˆ
·
sd
(
y
ˆ
) of the true value
is estimated by taking the 95
th
percentile across all elements
in
T
. For the bootstrap approach, the value of
τ
ˆ
such that 95
percent of the predictions lie within
τ
ˆ
·
sd
(
y
ˆ
) of the true value is
found by taking the 95
th
percentile for each
τ
*
-B
in
T
, and the
bootstrap estimate of
τ
ˆ
is then the average across the B repli-
cates. For this analysis we used the Monte Carlo approach.
Simulation to Generate Known Populations
We used a simulation approach to examine our proposed
method to approximate prediction uncertainty in random
forest models, so that we could evaluate our technique for
known populations. We constructed six populations repre-
senting X variables. Each was constructed using Gaussian
random fields (Schlather
et al
., 2014) with different levels of
spatial correlation (Figure 2). Gaussian random fields were
used to simulate X variables because they offer a framework
to develop normally distributed data within the spatial do-
main. Each map was 1,000 by 1,000 pixels. From the X maps
we constructed three different Y populations (Figure 2):
Y
1
=X
1
+X
2
+X
3
+X
4
+X
5
+X
6
+N(0,2)
Y
2
= X
1
+X
2
+X
3
+X
4
+X
5
+X
6
+N(0,1)
Y
3
=X
1
∙X
2
+X
3
+(X
4
+X
5
)
2
+X
6
+N(0,1)
where N (mean, standard deviation) is additional random
noise drawn from a normal distribution. The Normal High
190
March 2016
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING