PE&RS March 2016 full version - page 190

uncertainty. The objective of this study is to develop a tech-

nique to approximate prediction uncertainty for random forest

models of continuous data. In our case we consider prediction

uncertainty to be the uncertainty around a future prediction for

a new observation (i.e., pixel-level uncertainty).

We further present a case example using predicted percent

tree canopy cover in Georgia, USA. Portraying map uncertainty

is an important component for geospatial data developers to

consider. In some cases, prediction uncertainty is a central

component to developing a final geospatial dataset. For exam-

ple, the 2001 and 2011

NLCD

percent tree canopy cover layers

strived to mask out areas where there is clearly no tree canopy

cover but canopy cover models predict low levels of tree can-

opy cover. In the 2001

NLCD

percent tree canopy cover layer,

the mask was created by creating a “liberal” tree cover map

and hand editing (Huang

et al

., 2001). However, the techniques

presented here facilitate a more parsimonious approach.

Methods

Through the methods section we use standard matrix and boot-

strap notation. Bold lower case letters (e.g.,

) represent a vector.

Bold upper case letters (e.g.,

) represent a matrix. A * super-

script followed by

(e.g.,

) refers to the

bootstrap sample

and a * superscript followed by

–b

(e.g.,

-b

) denotes the portion

of the original data that was not part of the

bootstrap sample.

Greek letters represent parameters (e.g.,

) and a vector of param-

eters or matrices of parameters are bold as described above.

Random Forest Overview

We provide a brief overview of Random Forest but point the

interested reader to Breiman (2001) for more details. Random

forest is an ensemble approach that relies on

CART

models.

The goal of

CART

is to understand (learn) the relationship be-

tween a dependent variable (

) and a set of predictor variables

(

). The learning algorithm employs recursive portioning in

which splits in the

variables are selected to create homog-

enous groupings of

. The recursive portioning continues

until either the subset of

at each node is the same value or

further splitting adds no further improvement. Random forest

differs from the

CART

procedure by (a) employing bootstrap

resampling (Efron and Tibshirani, 1993), and (b) random vari-

able selection. Consider a regression tree which is made up

of splits and nodes. With random forest a random subset of

variables (selected without replacement) is used to determine

the split for each node. Bootstrap resampling is used to de-

velop replicates of the

CART

model. For continuous variables

the ensemble estimate is the mean of the predicted values

across trees (

)

and the variance across trees is

var

(

Methods Overview

Generally speaking, our method to approximate prediction

uncertainty for random forest regression models has five main

steps (Figure 1). Step 1 is to fit a random forest model based on

all available data. Step 2 is to use bootstrap resampling to param-

eterize a large number of random forest models (Figure 1B). Boot-

strap resampling generally results in ~37 percent of the observa-

tions not being selected. Step 3: for each bootstrap replicate of

the random forest (

) model the observed values and predicted

values are retained for observations not included in the bootstrap

sample (Figure 1C). Step 3 yields an error assessment dataset. In

Step 4 the properties of the prediction error are quantified using

the error assessment dataset (Figure 1D). Step 5 is to make a pre-

diction, including error, for a new observation (Figure 1E).

Bootstrap Resampling

The bootstrap is one tool that can also be used to approximate

prediction uncertainty of a

model (Figure 1B). Consider the

response and predictor variables (

y, X

) where a bootstrap sam-

ple of (

y, X

) is (

, X*

). Suppose we draw

= 2000 bootstrap

samples to create

= 2000 bootstrap datasets. Using the boot-

strap datasets we construct

, …,

2000

random forest

models, and then for each replicate quantify the prediction

error for each observation in (

) based on the corresponding

replicate. The error assessment is constructed for each

observation based on the distribution of predicted values when

the observation was not part of the bootstrap sample (Figure

1C). The prediction error is

√

MSE where

MSE

is the mean

square error for each observation. This technique allows one to

quantify prediction error for each element of a holdout dataset

but does not directly apply to predictions based on a new

However, because random forest relies on bootstrap sampling

to construct the ensemble, a random forest model contains in-

formation that we can use to quantify prediction uncertainty for

new locations (i.e., new

data are available, Figure 1D and 1E).

Prediction Uncertainty

In traditional parametric models (e.g., multiple regression),

the prediction error for a new observation is a function of the

mean squared error (

MSE

) and the variability in

. Recall that

a random forest model is an ensemble of

CART

models and the

ensemble estimate is the mean across the set of

CART

model

predictions. Each of the

CART

models is considered a weak

learner. The predictions from these weak learners inherently

capture the variability in the relationship between

and

We can calculate the variance among predictions, , for each

observation in

which represents the variability of predic-

tions among

CART

models. However, we need to scale between

var

(

) and (

)

to approximate prediction uncertainty

(Figure 1D). This is because to approximate the prediction

uncertainty for a new observation only

var

(

) will be

available. A measure such as

= −

(

)

( )

y y

var y

provides such a

scaling. To implement this approach, for each bootstrap data-

set (

, X*

) a random forest model is constructed

…,

. For each observation in

-1

-2

, … y*

a prediction

) is made from the corresponding model

, …,

Subsequently

= [

-1

-2

, …,

-B

] is calculated. The

for a

95 percent confidence interval can be estimated either using

a bootstrap approach or a Monte Carlo approach (Figure 1D).

For the Monte Carlo approach, the value of

such that 95

percent of the predictions lie within

(

) of the true value

is estimated by taking the 95

percentile across all elements

. For the bootstrap approach, the value of

such that 95

percent of the predictions lie within

(

) of the true value is

found by taking the 95

percentile for each

-B

, and the

bootstrap estimate of

is then the average across the B repli-

cates. For this analysis we used the Monte Carlo approach.

Simulation to Generate Known Populations

We used a simulation approach to examine our proposed

method to approximate prediction uncertainty in random

forest models, so that we could evaluate our technique for

known populations. We constructed six populations repre-

senting X variables. Each was constructed using Gaussian

random fields (Schlather

et al

., 2014) with different levels of

spatial correlation (Figure 2). Gaussian random fields were

used to simulate X variables because they offer a framework

to develop normally distributed data within the spatial do-

main. Each map was 1,000 by 1,000 pixels. From the X maps

we constructed three different Y populations (Figure 2):

+N(0,2)

= X

+N(0,1)

∙X

+(X

)

+N(0,1)

where N (mean, standard deviation) is additional random

noise drawn from a normal distribution. The Normal High

190

March 2016

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

SEO Version

Warning.

You are currently viewing the SEO version of !text.
It has a number of design and functionality limitations.

We recommend viewing the Flash version or the basic HTML version of this publication.

167...,180,181,182,183,184,185,186,187,188,189 191,192,193,194,195,196,197,198,199,200,...234