positive results that over-estimate the extent of the problem
may lead to a misallocation of limited funds and resources
.
The multiple comparison problem is a long-standing topic
within the field of statistics (Dunnett, 1955; Dunn 1961) and
remains an area of active research (McDonald, 2009). As an
example of the widespread acceptance of this issue in other
fields, the foundational paper by Benjamini and Hochberg
(1995) on controlling the false discovery rate (
FDR
), a popu-
lar method to account for the multiple comparison problem,
has been cited over 17,000 times (Web of Knowledge, URL:
http:webofknowledge.com
). The Benjamini and Hochberg (1995)
paper is often cited in the fields such as biochemistry / molecu-
lar biology (2,947 papers), genetics / heredity (2,518 papers), and
biotechnology / applied microbiology (1,653 papers), as well as
fields closely related to remote sensing such as environmental
science / ecology (994 papers) and plant sciences (801 papers)
.
By comparison, remote sensing papers rarely explicitly con-
sider the multiple comparison problem. For example, in a sys-
tematic search of a wide range of disciplines in the natural and
physical sciences related to the remote sensing of the Earth’s
surface using the Web of Knowledge, only seven remote sens-
ing papers were found that cited the Benjamini and Hochberg
(1995). Undoubtedly, other remote sensing investigations may
have considered the
MCP
in their statistical analyses without
reference to Benjamini and Hochberg, but the difference of at
least two orders of magnitude in the number of times that paper
had been cited in remote sensing compared to such wide rang-
ing other disciplines indicates that the multiple comparison
problem is largely underappreciated within the field of remote
sensing. To illustrate the wide range of remote sensing appli-
cations in which the multiple comparison problem applies,
papers that explicitly apply
FDR
include comparisons of mul-
tiple land classifiers (Brenning, 2009; Xu
et al
., 2014), detecting
pixel-based temporal trends across many pixels (Brown
et al
.
2012; Wessels
et al
., 2012), and comparing multiple hyperspec-
tral indices to canopy structure (Pena
et al
., 2012). Perhaps con-
tributing to the obscurity of the multiple comparison problem
is that its discussion is a methodological footnote rather than
the focus of any given paper. This paper hopes to help illumi-
nate the
MCP
issue within the remote sensing community.
Simple Solutions
Since the tradeoffs between false positive and false negative
results depend on the context of the research, no single solution
for the multiple comparison problem exists. Other papers in
other fields have discussed this issue in more detail (e.g., Cur-
ran-Everett, 2000; Gelman, 2012; McDonald, 2009). Below, four
common solutions found in other fields that can be easily imple-
mented in remote sensing analyses are presented: Bonferroni
Correction, False Discovery Rate, Independent Validation, and
Improved Interpretation. More complicated solutions including
multi-level or Bayesian models may also be implemented, but
are not discussed here. It should be noted that these solutions
target multiple comparisons when testing for significant rela-
tionships and not difference between population distributions
which can be solved using Tukey’s multiple comparisons
.
The Bonferroni Correction is one of the oldest and simplest
solutions. It adjusts the
α
-value of significance by dividing
the
α
-value by the number of hypotheses tested (Bonferroni,
1936). For example, if a test wishes to consider ten alternate
hypotheses at an
α
-value of 0.05, the corrected significance
threshold is 0.005 (
α
/10). The Bonferroni Correction is often
considered the most traditional and conservative correction
because it controls for the likelihood of a type-1 error oc-
curring in any of the multiple tests conducted. In situations
where a large number of hypotheses are tested, and especially
if sample sizes are small, the adjusted
α
-value can and often
does exclude all hypotheses tested
.
The False Discovery Rate is an alternative approach pro-
posed by Benjamini and Hochberg (1995) based on control-
ling the rate of false discovery or proportion of false positive
results within a group of tests (i.e., family-wise error). Using
the rank of p-values to help interpret significance among mul-
tiple comparisons,
FDR
uses the following equation:
q
–
value
= (
i
/
m
)
*
q
(1)
where
i
is the rank of the test based on sorted p-value from
smallest to greatest,
m
is the total number of tests, and
q
is
the maximum desired false discovery rate for all tests. The
p-value for each test is then compared to the
q
-value and if
the p-value is less than the
q
-value, the test is significant. For
easier interpretation, many software packages include ad-
justed p-values. Furthermore,
FDR
is adaptable to the number
of tests as the threshold for significance is the rate at which a
false discovery is expected (i.e., 5 percent). For a review on
the false discovery rate and other techniques for correcting for
multiple comparisons, see Groppe
et al
. (2011)
.
Another approach, and one with a long tradition in remote
sensing, is the use of independent training and validation da-
tasets (Congalton, 1991). While the use of validation datasets
is less common for trend detection, this approach can help
verify the validity of an empirical relationship derived from
multiple testing. Since the test of the empirical relationship
with the validation dataset is a new, single test, no p-value
adjustment is required. Furthermore, if the empirical rela-
tionship is used for prediction, then the uncertainty in the
prediction can also be assessed. One of the drawbacks of the
validation approach is the need for independent data, which
is often unattainable. For example, the detection of temporal
trends for each pixel over large areas is impossible to validate
.
The occurrence of the multiple comparison problem does
not automatically necessitate a quantitative solution. An
alternative approach to the multiple comparison problem is to
simply change the interpretation of the p-value (Gelman
et al
.,
2012), rather than trying to adjust the p-value so that it fits the
classic interpretation. For example, if 1,000 tests are conducted
and 500 are found to be significant, but only 50 would be
expected from random datasets, then it has been demonstrated
that the number of significant relationships is much greater
than the number expected by chance. In a situation where each
test detects a temporal trend in each pixel of a time-series of
images, the results would clearly indicate overall significant re-
sults (i.e., the number of significant results far exceeds that by
chance alone), although the number of pixels with significant
trends is likely over-estimated. However, since it would be
expected that 5 or 10 percent of the “significant” results could
be false positive results, this uncertainty should be considered
when mapping the results. Since it is difficult to ascertain
which pixels are false positive, interpretation of trends should
focus on aggregate results rather than local patterns. Further-
more, interpretation of the total area with a significant trend
that may be a false positive result should be discussed. In some
situations, the inclusion of false positive results may be useful.
For example, including potentially false positive results in a
pilot study help indicate areas of future research, as long as
those results are not used for generalization. Interpretation of
the results should clearly describe this situation if applicable.
First Case Study: Estimating Leaf Area Index of Mangroves
using Image Texture
Context
This case study demonstrates the impact of the multiple
comparison problem when trying to repeatedly test different
remote sensing products against ground measurements, in
this case image texture and leaf area index. The amount of
922
December 2015
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING