PE&RS November 2018 Full - page 724

scene classification (Chong

et al.

, 2009; Putthividhy

et al.

2010; Huang

et al.

, 2015).

However, the

BOVW

and topic models rely on the estab-

lishment of empirically designed features to depict the local

patches of the images as visual words. In order to overcome

this limitation, an increasing amount of research has focused

on unsupervised machine learning methods to autonomously

extract adaptive and suitable features from unlabeled input

data. For example, Coates

et al.

(2010) built a single-layer

UCNN

for unsupervised feature learning. They used different

unsupervised learning algorithms to generate local convolu-

tional features (i.e., function bases), and found that

-means

clustering, which is an extremely simple learning algorithm,

achieved the best performance. Blum

et al.

(2012) applied the

network proposed in Coates

et al.

(2010) to object recognition

from natural images with depth information. Dosovitskiy

al.

(2014) developed a multi-layer

UCNN

to learn feature rep-

resentations from unlabeled images, and the learned features

performed well in natural image classification. Recently,

in order to classify remotely sensed scenes, Li

et al.

(2016)

trained a multi-layer

UCNN

using

-means clustering to au-

tonomously mine complex structure features from high-reso-

lution images, and used support vector machine (

SVM

) for the

final scene classification. In this study, the features extracted

by the

UCNN

achieved a better scene classification accuracy

than

BOVW

and sparse coding.

To the best of our knowledge, little research has been

so performed concerning tea garden detection from remote

sensing data. However, this is necessary since tea cultivation

plays an important part in Chinese agriculture, but the current

tea garden monitoring relies on field investigation, which is

time-consuming and labor-intensive. In this context, in the

proposed scene-based framework, high-resolution satellite

images are employed to detect tea gardens, since these im-

ages can provide abundant spatial and textural information.

Considering that a tea garden is a semantic scene composed

of a variety of interrelated objects in a high-resolution image,

we propose to apply scene-based semantic learning methods

for tea garden detection, including the following experimental

configurations: (1)

BOVW

is used to represent the scenes with

spectral and Gabor textural features. An

SVM

classifier is then

employed to classify the representation into tea gardens and

non-tea gardens; (2)

sLDA

is used to extract the topic features

from the

BOVW

representation of the scenes and predict the cat-

egory label of each scene; and 3) A multi-layer

UCNN

is trained

to generate discriminative features from the original spectral

images, and the derived features are also classified by

SVM

The rest of this paper is organized as follows. The next sec-

tion introduces the tea garden detection framework, followed

by a description of the datasets and the experimental setup.

The next section presents the detection results and discus-

sion with the different methods and features. The last section

concludes the paper.

Methodology

In this section, we introduce the scene classification meth-

ods employed in this study (i.e.,

BOVW

sLDA

, and the

UCNN

Subsequently, the proposed scene-based tea garden detection

framework is described in detail.

Topic Scene Classification Models

BOVW

is the basis of the topic models, and thus it is pre-

sented before introducing the

sLDA

model. The

BOVW

model

was derived from a text analysis method which represents

a document by the word frequencies, ignoring their order.

The idea was then applied to images by utilizing the visual

words formed by vector quantizing the visual features. The

BOVW

representation is constructed in two stages, as shown in

Figure 2, i.e., visual word learning and feature encoding. Dur-

ing the visual word learning, the remotely sensed images are

divided into patches, and the spectral or textural information

of these patches is extracted to generate feature vectors which

can describe the patches. We then quantify the spectral and

textural descriptors using the k-means clustering algorithm.

The cluster centers, which are known as “visual words”,

form a dictionary. In the feature encoding, an unrepresented

scene is split into several patches. Each patch is assigned to

the label of the closest cluster center after extracting features

of the patch. In this way, an image can be represented by a fre-

quency histogram of the labeled patches. The histogram can

be regarded as a feature vector for the subsequent classifica-

tion, whose size is equal to the size of the dictionary.

The

BOVW

model represents a scene as a text document by

the frequencies of the visual words. The

LDA

model (Blei

al.

, 2003), which is a generative probabilistic model from the

statistical text literature, characterizes the scene as random

mixtures over latent topics, where each topic in turn is de-

scribed by a distribution over the visual words in the dic-

tionary. The process of

LDA

to generate a scene

can then be

described as follows (as shown in Figure 3):

Figure 1. Examples of tea gardens: (a) Digital camera photographs, and (b) Google Earth images.

724

November 2018

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

SEO Version

Warning.

You are currently viewing the SEO version of !text.
It has a number of design and functionality limitations.

We recommend viewing the Flash version or the basic HTML version of this publication.

667...,714,715,716,717,718,719,720,721,722,723 725,726,727,728,729,730,731,732,733,734,...746