PPD: Pyramid Patch Descriptor via
Convolutional Neural Network
Jie Wan, Alper Yilmaz, and Lei Yan
Abstract
Local features play an important role in remote sensing image
matching, and handcrafted features have been excessively
used in this area for a long time. This article proposes a
pyramid convolutional neural triplet network that extracts a
128-dimensional deep descriptor that significantly improves
the matching performance. The proposed approach first
extracts deep descriptors of the anchor patches and cor-
responding positive patches in a ba
pyramid convolutional neural netw
the approaches chooses the closest
anchor patch and corresponding po
the triplet sample based on the descriptor distances among
all other image patches in the batch. These triplets are used
to optimize the parameters of the network using a new loss
function. We evaluated the proposed deep descriptors on
two benchmark data sets (Brown and HPatches) as well as
real image data sets. The results reveal that the proposed
descriptor achieves the state-of-the-art performance on the
Brown data set and a comparatively very high performance
on the HPatches data set. The proposed approach finds
more correct matches than the classical handcrafted fea-
ture descriptors on aerial image pairs and is observed to be
robust to variations in the viewpoint and illumination.
Introduction
With the rapid development of image sensors over the past
few decades, more and more images have been used as an
effective way to extract useful information for a number of
applications. Automatic extraction, analysis, and understand-
ing of this useful information typically require extraction of
features from the image. The local points are the most widely
used features in most photogrammetry and computer vision
applications. Finding accurate correspondences among local
point features across images refers to the task of associating
points in one image with points in another image. This task
is challenging when differences occur due to camera mo-
tion, time lapse, and object motion. These variations require
high-quality descriptors, which are an essential step for most
tasks, such as Structure from motion (Schonberger and Frahm
2016), simultaneous localization and mapping (Mur-Artal
and Tardós 2017), image registration (Bentoutou
et al.
2005),
image matching (Dufournaud
et al.
2004), image retrieval
(Zhou
et al.
2017), and image scene classification (Cheng
et
al.
2017). A high-quality feature descriptor should be not only
robust to the changes of viewpoints, illuminations, shading,
and the partial occlusions but also highly discriminative to
those noncorresponding features with high similarity.
The past decades have witnessed the development of many
feature descriptors algorithms. Traditional handcrafted de-
scriptor extraction algorithms obtain a descriptor by encoding
the color and texture at pixels according to fixed rules. The
is one of the most commonly used
riptor composed of the histograms
ighted by their magnitudes around
the keypoint. Later, several algo-
rithms, such as
PCA-SIFT
, which uses principal components
analysis applied to the normalized gradient patch (Ke and
Sukthankar 2004), and
DSP-SIFT
, which uses pooling of gradient
orientations (Dong and Soatto 2015), have improved the perfor-
mance of the classical
SIFT
algorithm.
SURF
is another com-
monly used scale- and rotation-invariant descriptor that relies
on integral image (Bay
et al.
2008). Besides these descriptors,
binary descriptors have also been popular to compare feature
points in a very fast manner while requiring a comparatively
small amount of memory. Balntas
et al.
(2015) presented a full
inter- and intraclass optimization of binary descriptors that is
an adaptive online selection of binary intensity tests to simul-
taneously increase interclass and decrease intraclass distances.
BinBoost uses a boosted binary hash function to compute each
bit of the descriptor (Trzcinski
et al.
2013).
BRIEF
interprets an
image patch as a binary string that relies on a relatively small
number of intensity difference tests; thus, it has a very high
speed (Calonder
et al.
2012). In order to address the issue of
nonrigid deformations, the DaLI descriptor with high resilience
to nonrigid image transformations and illumination changes
has been proposed by Simo-Serra, Torras,
et al.
(2015). All
these methods are handcrafted descriptors driven by intuition
or the researcher’s expertise and are generated from low-level
features, such as image gradient or binary intensity tests; hence,
they inevitably suffer from information loss (Tian
et al.
2017).
Recently, deep learning, especially the convolutional
neural network (
CNN
), has been used as an efficient approach
in objection detection and recognition tasks that has dramati-
cally improved the performance (Zbontar and LeCun 2015). To
extend the performance of
CNN
in image recognition, research-
ers need to develop tests that reveal problems with matching
specific features. For example, Fischer
et al.
(2014) used the
filters from various layers of
CNN
trained on the ImageNet data
set and compared the matching performance to the application
of standard
SIFT
descriptors. The authors have observed that
these filter provided better descriptors than
SIFT
for the image
matching problem. This new feature was referred to as the
deep feature in successive articles. Following their work, re-
searchers started to use
CNN
to learn the descriptors for image
Jie Wan is with the Beijing Key Laboratory of Spatial
Information Integration and Its Applications, School of
Earth and Space Sciences, Peking University, Beijing, China;
and the Department of Civil, Environment, and Geodetic
Engineering, Ohio State University, Columbus, OH.
Alper Yilmaz is with the Department of Civil, Environment, and
Geodetic Engineering, Ohio State University, Columbus, OH.
Lei Yan is with the Beijing Key Laboratory of Spatial
Information Integration and Its Applications, School of Earth
and Space Sciences, Peking University, Beijing, China.
(Corresponding author,
).
Photogrammetric Engineering & Remote Sensing
Vol. 85, No. 9, September 2019, pp. 673–686.
0099-1112/19/673–686
© 2019 American Society for Photogrammetry
and Remote Sensing
doi: 10.14358/PERS.85.9.673
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
September 2019
673