Street Addressing and Mapping Board (
WVSAMB
) datasets (avail-
able from the West Virginia
GIS
Technical Center;
wvu.edu/
), which were originally created using manual photo-
interpretation of the same leaf-off orthophotography used to
produce the
DEM
data used in this study. The
WVSAMB
data were
supplemented with the 1:24 000 scale National Hydrography
Dataset (
NHD
). The Cost Distance tool creates a surface in which
each cell is assigned the accumulative cost to the closest water-
body. Slope was used as the measure of movement impedance
.
Topographic slope in degrees (Burrough and McDonell, 1998),
surface curvature, plan curvature (curvature perpendicular to
slope), and profile curvature (curvature in the direction of the
slope) (Moore
et al
., 1991; Zeverbergen and Thorne, 1987) were
also calculated using the Spatial Analyst Extension of ArcMap
10.2 (Esri, 2012). Additional variables were calculated using the
Arc
GIS
Geomorphometry & Gradient Metrics Toolbox (Evans
et al
., 2014) including
CTMI
(Moore
et al
., 1993; Gessler
et al
.,
1995), slope position (Berry, 2002), roughness (Blaszcynski, 1997;
Riley
et al
., 1999), and dissection (Evans, 1972). Slope position,
roughness, and dissection rely on focal statistics calculated using
a moving window, thus the result is dependent on the window
size used. For this study, we used window sizes of 11 × 11 pixels,
21 × 21 pixels, 41 × 41 pixels, and 51 × 51 pixels as an attempt to
capture terrain variability at the hillslope scale. The window sizes
were chosen based on the range of typical ridge to valley bottom
distances in the state. We also calculated summary measures from
the average of the outputs at all four window sizes. This resulted
in a total of five variables (four variables for the individual win-
dow sizes and one average) for each of slope position, roughness,
and dissection. A total of 21 predictor variables was therefore pro-
duced (Table 1). The terrain predictor variables were then associ-
ated with the training and validation pixels using the software
tool Geospatial Modeling Environment (
GME
) (Beyer, 2012).
Probability Model Creation
The
RF
algorithm was implemented using the Random Forest
package (Liaw and Wiener, 2002) within the statistical soft-
ware tool R (R Core Development Team, 2012). This algorithm
requires two user defined parameters: the number of trees
produced (ntree) and the number of predictor variables ran-
domly sampled as candidates at each node (mtry). For each
model, ntree was set to 501 trees, as this was found to be large
enough to produce stable results, and mtry was set to the
default value, the square root of the number of predictor vari-
ables, which resulted in a value of five. Five separate models
were produced using five separate training sets, which were
then combined to form a single model.
Model Validation
As we were primarily interested in the probabilistic predic-
tion as opposed to the per-pixel classification in this study,
we relied on receiver operating characteristic (
ROC
) curves
and the area under the
ROC
curve (
AUC
) measure to evaluate
and compare the
RF
models produced. An
ROC
curve plots
the true positive rate (in this case the proportion of wetlands
correctly classified as wetlands) against the false positive rate
(the proportion of absence (not wetland) points incorrectly
classified as wetlands) at various probability thresholds for
a binary classifier. The
AUC
measure is the area between the
curve in the
ROC
plot and the diagonal, and is equivalent to
the probability that the classifier will rank a randomly chosen
positive (true) record higher than a randomly chosen negative
(false) record. It is equivalent to the Wilcoxon test of ranks.
Generally, values over 0.9 indicate excellent prediction rates
(Hanley and McNeil, 1982; Swets
et al
., 2000; Fawcett, 2007).
ROC
curves and
AUC
measures were produced using the
pROC
package (Robin
et al
., 2011) in R (R Core Development Team,
2012). As one goal of this study was to compare multiple
models, we also made use of Delong’s test for two
ROC
curves
to assess the difference in model performance. This test,
which provides a
p
-value for statistical comparison, is avail-
able in the
pROC
package (Hanley and McNeil, 1982; Delong
et
al
., 1988; Venkatraman and Begg, 1996; Venkatraman, 2000;
Pepe
et al
., 2009; Robin
et al
., 2011; Wickham, 2011).
One strength of the
RF
algorithm is its ability to generate
measures of variable importance during the training process
by excluding each variable sequentially and recording the re-
sulting increase
OOB
error (Breiman, 2001; Rodríguez-Galiano
et al
., 2012a; Rodríguez-Galiano
et al
., 2012b). This ancillary
output of
RF
was used to assess the relative contribution of
each terrain variable for predicting the probability of wetland
occurrence. However, variable importance from standard
RF
tends to be biased towards correlated predictor variables,
which is the case in this study (Strobl
et al
., 2008; Strobl
et
al
., 2009; Genuer
et al
., 2010). Therefore, we used the condi-
tional variable importance measure, available in the R party
package, which is more robust in the presence of highly cor-
related input variables (Strobl
et al
., 2008; Strobl
et al
., 2009).
Results and Discussion
Importance of Physiography in Mapping Wetlands
Plate 1 shows a subset of the classification results within each
selected ecological subregion in West Virginia for the
PEM
wet-
lands, masked to the extent of grass cover in the state, and
PFO
/
T
able
2. AUC V
alues
for
PEM M
odels
for
E
ach
E
cological
S
ubregion
and
the
E
ntire
S
tate
. B
old
T
ext
I
ndicates
the
H
ighest
AUC V
alue
O
btained
for
E
ach
S
ubregion
B
eing
P
redicted
; * I
ndicates
S
tatistical
D
ifference
at
the
95% C
onfidence
L
evel
(
p
= 0.05)
between
the
M
odel
and
the
M
odel
T
rain
and
P
redicted
in
that
R
egion
PEM Model Comparison
Random Forest Model
Great Valley of
Virginia
Pittsburgh Low
Plateau
Ridge and
Valley
Western Allegheny
Mountains
Western
Coal Fields Statewide
Subregion Where
model applied
Great Valley Of Virginia
0.974
0.931*
0.945*
0.958*
0.953*
0.963*
Pittsburgh Low Plateau
0.924*
0.946
0.918*
0.931*
0.934*
0.939*
Ridge and Valley
0.903*
0.901*
0.940
0.913*
0.891*
0.900*
Western Allegheny Mountains
0.932*
0.936*
0.920*
0.954
0.935*
0.941*
Western Coal Fields
0.916*
0.938*
0.877*
0.911*
0.962
0.943*
West Virginia Total
0.893*
0.936*
0.879*
0.931*
0.937*
0.947
T
able
3. AUC V
alues
for
PFO/PSS M
odels
for
E
ach
E
cological
S
ubregion
and
the
E
ntire
S
tate
. B
old
T
ext
I
ndicates
the
H
ighest
AUC V
alue
O
btained
for
E
ach
S
ubregion
B
eing
P
redicted
; * I
ndicates
S
tatistical
D
ifference
at
the
95% C
onfidence
L
evel
(
p
= 0.05)
between
the
M
odel
and
the
M
odel
T
rain
and
P
redicted
in
that
R
egion
PFO/PSS Model
Comparison
Random Forest Model
Great Valley
of Virginia
Pittsburgh
Low Plateau
Ridge and
Valley
Western Allegheny
Mountains
Western
Coal Fields Statewide
Subregion Where
model applied
Great Valley Of Virginia
0.963
0.886*
0.912*
0.914*
0.920*
0.884*
Pittsburgh Low Plateau
0.980*
0.993
0.990*
0.992
0.991*
0.993
Ridge and Valley
0.986*
0.993
0.994
0.994
0.992*
0.992
Western Allegheny Mountains
0.975*
0.988*
0.982*
0.991
0.986*
0.990*
Western Coal Fields
0.995*
0.997
0.997*
0.998
0.998
0.998
West Virginia Total
0.963*
0.989*
0.985*
0.991
0.988*
0.991
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
June 2016
441