Approximating Prediction Uncertainty for
Random Forest Regression Models
John W. Coulston , Christine E. Blinn , Valerie A. Thomas , and Randolph H. Wynne
Abstract
Machine learning approaches such as random forest have
increased for the spatial modeling and mapping of continu-
ous variables. Random forest is a non-parametric ensemble
approach, and unlike traditional regression approaches there
is no direct quantification of prediction error. Understanding
prediction uncertainty is important when using model-based
continuous maps as inputs to other modeling applications
such as fire modeling. Here we use a Monte Carlo approach to
quantify prediction uncertainty for random forest regression
models. We test the approach by simulating maps of dependent
and independent variables with known characteristics and
comparing actual errors with prediction errors. Our approach
produced conservative prediction intervals across most of the
range of predicted values. However, because the Monte Carlo
approach was data driven, prediction intervals were either too
wide or too narrow in sparse parts of the prediction distribu-
tion. Overall, our approach provides reasonable estimates of
prediction uncertainty for random forest regression models.
Introduction
Remote sensing scientists have a rich set of standard methods
with which the uncertainty of (inherently categorical) thematic
maps derived from remotely-sensed data can be estimated (e.g.,
Congalton and Green, 2008). For the most part, resulting uncer-
tainty estimates are (a) independent of the analytical method
used for the categorical data analysis, and (b) contain informa-
tion on category-specific accuracy but not pixel specific accura-
cy. Methods with which to estimate the uncertainty of mapped
continuous fields are, in contrast, much less standardized.
Category-specific accuracy, of course, is no longer relevant,
but the means by which uncertainty of continuous variables
is estimated is often tied to the technique used. Examples
abound, including use of
RMSE
in classical regression oriented
approaches (Fernandes
et al.
, 2004) and cross-validation-de-
rived
PRESS
(sum of squares of the prediction residuals)
RMSE
(Popescu
et al.
, 2004). Cross-validation approaches are also
widely used in regression tree analyses of remotely sensed data
(Bacini
et al
., 2007). The cross-validation can estimate many
prediction error statistics, including residual sum of squares.
However, increasingly cross-validation is used primarily for
model selection and (usually non-parametric) bootstrapping
is used once the model is “fixed” (see, e.g., Molinaro, 2005).
These methods have been extended to random forest imple-
mentations, but the resulting estimates of prediction uncertain-
ty are aggregated (i.e., global) and do not produce pixel-specific
uncertainties required for use in subsequent spatial modeling.
The use of machine learning techniques has increased sub-
stantially in remote sensing and geospatial data development.
For example, Homer
et al
. (2004) used regression trees for the
development of a categorical land cover map for the Unit-
ed States, and Coulston
et al
. (2012) used random forests to
develop a continuous field map of percent tree canopy cover.
Other techniques that have been proposed and tested include
artificial neural networks, support vector machines, stochas-
tic gradient boosting, and K nearest neighbor (Moisen and
Frescino, 2002; Wieland and Pittore, 2014). Machine learning
approaches have become particularly attractive because they
are well suited to recognize patterns in high-dimension data
(Cracknell and Reading, 2014). Further, several of these ap-
proaches allow for modeling either categorical response vari-
ables or continuous response variables (e.g., random forests,
support vector machines/support vector regression). How-
ever unlike traditional parametric approaches (e.g., multiple
regression), information about prediction error (standard error
of a prediction for a new data point) is not readily available.
Broad scale raster maps of continuous variables have been
developed for percent impervious surface (Homer
et al
., 2007),
percent tree canopy (Huang
et al
., 2001; Coulston
et al
., 2012),
forest biomass (Blackard
et al
., 2008), and forest carbon (Wil-
son
et al
., 2013) among other examples. These efforts all relied
on machine learning approaches and used either Landsat or
MODIS
imagery for predictor variables. Each pixel within these
modeled raster maps contains a predicted value yet, per-pixel
uncertainty is rarely expressed along with the predictions. Un-
derstanding the pixel-level uncertainty is critical to understand-
ing the utility of the data. Furthermore, many geospatial datasets
(such as those mentioned above) are used in subsequent model-
ing applications. For example, the 2001
NLCD
tree canopy cover
dataset (Huang
et al
., 2001) was a major component of forest fire
behavior and fuel models (Rollins and Frame, 2006). Clearly
the uncertainty around this fire behavior model is related to
the uncertainty in the underlying data, such as the 2001
NLCD
percent tree canopy cover. Our intent is to provide guidance on
quantifying prediction uncertainty at the pixel level.
While there are numerous machine learning techniques,
here we focus on random forest because it is straightforward
to train, computationally efficient, and provides stable pre-
dictions (Cracknell and Reading, 2014). Random forest is an
ensemble method that uses bootstrap aggregating (bagging)
to develop multiple models to improve prediction (Breiman,
2001). Along with bagging, random forests also relies on ran-
dom feature selection to develop a forest of independant
CART
models. This technique has been used by Powell
et al
. (2010)
and Baccini
et al
. (2008) to predict forest biomass, Evans and
Cushman (2009) to predict species occurrence probability,
Hernandez
et al
. (2008) to predict faunal species distributions,
and Moisen
et al
. (2012) to predict percent tree canopy cover.
Though there have been numerous studies describing and
using random forests, there is a lack of information regarding
John W. Coulston is with the USDA Forest Service, Southern
Research Station, Blacksburg, VA (
).
Christine E. Blinn , Valerie A. Thomas , and Randolph H.
Wynne are with Virginia Polytechnic Institute and State Uni-
versity, Department of Forest Resources and Environmental
Conservation, Blacksburg, VA.
Photogrammetric Engineering & Remote Sensing
Vol. 82, No. 3, March 2016, pp. 189–197.
0099-1112/16/189–197
© 2016 American Society for Photogrammetry
and Remote Sensing
doi: 10.14358/PERS.82.3.189
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
March 2016
189