Automatic Estimation of Number of
Clusters in Hyperspectral Imagery
Amin Alizadeh Naeini, Mohammad Saadatseresht, and Saeid Homayouni
Abstract
One of the most challenging problems in automated clustering
of hyperspectral data is determining the number of clusters
(
NOC
) either prior to or during the clustering. We propose a
statistical method for best estimating the
NOC
, not only prior
to but also independent of the clustering. This method uses
both residual analysis (
RA
) and change point analysis (
CPA
) to
select a number of candidates. Because the
NOC
and the data
intrinsic dimension (
ID
) interact with one another in a predict-
able way,
ID
can provide useful inferential information about
the
NOC
. Indeed, once the
ID
has been found, the
NOC
can be
inferred on the basis of this information. The performance of
the proposed method is evaluated by processing several hyper-
spectral datasets. Furthermore, a comparison with the results
of the partitional approach, using some well-known similarity
measures, has demonstrated that our method is more effective
in detecting the optimal
NOC
.
Introduction
Unsupervised classifiers or clustering algorithms can po-
tentially resolve the limitations of the supervised methods.
The unsupervised algorithms perform the classification task
by exploiting the information provided by the data without
requiring any prior information or training samples. Even
though supervised classification algorithms perform better
than unsupervised ones, the unavailability of high quality
training data for supervised methods warrants the use of clus-
tering algorithms (Paoli
et al
., 2009).
Most clustering algorithms, especially those that are fre-
quently used for remotely sensed data, require a specified
NOC
(Schowengerdt, 2006). In some cases, the
NOC
can be estimat-
ed using expertise-based knowledge about the land-covers
(Liang
et al
., 2012). However, in many other situations, such
as a complex remotely sensed scene, the
NOC
for a given data-
set cannot be known
a priori
(Farrell Jr. and Mersereau, 2004),
and hence, must be estimated automatically. It is well known
that either over-estimation or under-estimation of the
NOC
will
severely affect the quality of clustering results (Liang
et al
.,
2012). Therefore, accurately estimating the
NOC
in any data is
of fundamental importance in clustering analysis.
Efforts to determine the
NOC
have mostly been divided
into the following cluster structures: (a) hierarchical cluster-
ing (Johnson, 1967), (b) histogram peak selection (
HPS
) based
clustering (Richards and Jia, 2006), (c) statistical clustering
(Datta, 2003), and (d) partitional clustering (Jain
et al
., 1999).
In hierarchical clustering, the fusion dendrogram can be
inspected in an attempt to determine the intrinsic
NOC
or
spectral classes. In this case, an objective measure, such as
Bayesian information criterion (
BIC
), is needed to cut off the
dendrogram (Jung
et al
., 2003). This kind of clustering is not
very popular within the remote sensing community, because
of both large number of pixels and high computational time
required by hierarchical clustering (Richards and Jia, 2006).
In
HPS
-based clustering, a multidimensional histogram of a
segment of an image may exhibit several peaks at the loca-
tions of similar spectral clusters. Pixels are then associated
with the nearest spectral peak for the assignment of clusters.
This technique is only useful when the dimensionality of
data is low; therefore, the technique is not applicable to
hyperspectral data (Richards and Jia, 2006). In statistical
clustering, the data are assumed to be a random independent
variable modeled by either a density or a probability function.
A variety of approaches for choosing the optimal
NOC
under
the mixture of distributions has been used (Mirkin, 2011).
However, the statistical clustering approaches are complex
and time-consuming because (a) the high dimensionality of
hyperspectral data, (b) the necessity of several parameters for
statistical modeling, and (c) the Gaussian assumption of class
distribution (Richards and Jia, 2006).
In the field of remote sensing, much research has been
done using partitional clustering methods such as the
k-means approach. (Prabhu, 2011; Mirkin, 2011). In these
methods the
NOC
varies from 2 to a predefined maximum
NOC
.
By calculating the validity indices of the clustering results
(see Davies and Bouldin, 1979; Calin´ ski and Harabasz, 1974;
Tibshirani
et al
., 2001; Prabhu, 2011) with different values of
the
NOC
; the ideal
NOC
of a given dataset is obtained in Liang
et al
. (2012). In addition, the partitional methods are not only
computationally efficient, but also suitable for high-dimen-
sional data such as hyperspectral data. However, they often,
problematically, get stuck in some local optima. It is worth
noting that this problem deteriorates in hyperspectral data
(Samadzadegan and Alizadeh Naeini, 2011). Furthermore, the
major difficulty is due to the absence of an objective criterion
for comparing the quality of various clustering results of the
same dataset (Wang, 2010). Although the first problem (i.e.,
getting stuck in some local optima) can be tackled by using
an evolutionary-based algorithm (Ura Maulik
et al
., 2011), the
second, despite measures, either internal measures (Prabhu,
Amin Alizadeh Naeini and Mohammad Saadatseresht are
with the Department of Geomatics Engineering, College of
Engineering, University of Tehran, North Kargar Street, P.O.
Box: 11155/4563, Tehran, Iran (
).
Saeid Homayouni is with Department of Geography, Simard
Hall, 60 University, University of Ottawa, Ottawa, ON, Cana-
da, K1N 6N5 (
).
Photogrammetric Engineering & Remote Sensing
Vol. 80, No. 7, July 2014, pp. 000–000.
0099-1112/14/8007–000
© 2014 American Society for Photogrammetry
and Remote Sensing
doi: 10.14358/PERS.80.7.000
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
July 2014
619