patches.
CNN
-based methods can be divided into two classes
(Wei
et al.
2018). The first class of studies treats the feature
matching as a binary classification problem based on the simi-
larity score generated from a pair of image patches. An obvious
drawback of this class of studies is that the computational cost
is very high since each image patch should be matched with
all other image patches. The second class of studies directly
generate the descriptor from an image patch by passing it
through a
CNN
. This approach treats each image patch indepen-
dently and is similar to the traditional handcrafted descriptor;
thus, it can be easily used in various vision problems directly
replacing the classical handcrafted descriptor. Many variations
of this approach have been proposed to generate a deep learn-
ing–based descriptor through
CNN
by proposing the inclusion
of (but not being limited to) new loss functions, new architec-
tures, and regularization terms. A multi-resolution
CNN
-based
descriptor was proposed by Mitra
et al.
(2017) that required
different image resolutions that need to be manually gener-
ated from the original image patch.
also suggests a multi-scale architect
to improve the performance of the d
al.
2016). However, TFeat explores o
structure that focuses on fast extract
In this article, we propose a pyramid network structure to
make full use of the multi-scale information to produce the
deep descriptor for a given image patch. The main contribu-
tions of this article are as follows:
1. We propose a pyramid
CNN
that can better incorporate
the global context of the image patch at different scales to
learn the descriptor for an image patch.
2. Our triplet network uses a new distance loss function that
avoids setting the margins manually.
We test the proposed deep descriptor on two benchmark
data sets, and it achieves the highest performance on the
Brown benchmark data set and obtains the top accuracy on
the HPatches benchmark data set except for HardNet. The
proposed pyramid patch descriptor (
PPD
) performs better in
the verification matching tasks but worse in the retrieval task
compared with HardNet. For the first time in the literature,
we use three aerial image pairs to demonstrate the effective-
ness of the proposed deep descriptor
PPD
on realistic sce-
narios. The experimental results show that
PPD
, when applied
to the aerial image matching problem, can obtain more correct
correspondences compared to traditional feature descriptors,
such as
SIFT
and
BRISK
. The
PPD
is also observed to be more ro-
bust and effective in finding correspondences in image pairs
with illumination and viewpoint variation.
The remainder of this article is organized as follows. Relat-
ed research about extracting learning-based descriptors is pre-
sented in the section “Related Work.” The proposed specific
deep patch descriptor extraction algorithm is described in the
section “Proposed Method.” Experimental results and analysis
are presented in the section “Experiments and Analysis,” and
a discussion follows in the section “Discussion.” Conclusive
remarks on our study are given in the section “Conclusions.”
Related Work
The proposed algorithm extracts the patch descriptor through
a deep
CNN
. Hence, the following discussion presents a
detailed overview of descriptor learning methods based on
convolutional networks. Matchnet (Han
et al.
2015) directly
matches two image patches consisting of a deep convolution-
al network that extracts features from patches and a network
of three fully connected layers that computes a similarity
between the extracted features. Zagoruyko and Komodakis
(2015) propose two separate architectures, including a two-
channel network and a central-surround two-stream network
for the patch matching problem. Their results show that
the two-channel structure network performs better than the
Siamese network used in Matchnet. Kumar
et al.
(2016) use
a triplet network structure and propose a global loss func-
tion that minimizes the mean value of the distances between
matching pairs and maximizes that of nonmatching pairs. The
above methods use a metric learning layer in their networks,
limiting the application of such networks. According to the
architecture of the network and loss function, the methods
proposed in the published literature can be classified into two
classes: Siamese
CNN
and triplet
CNN
.
Siamese CNN
A commonly used architecture for learning feature embedding
with
CNN
is a Siamese network that consists of two identical
networks parallel to each other. Given a set of positive pairs
P
and a set of negative pairs
N
, the network tries to directly
decrease the distance of matching image pairs and increase
ng image pairs. Given two image
×
n
), DeepDesc (Simo-Serra, Trulls,
et
ive loss in Equation 1 to obtain the
hase:
(
)
l I I
f I
f I
if I I
P
f I
f I
if I
1 2
1
2 2
1 2
1
2 2
1
0
,
,
,
max ,
,
,
(
)
=
( )
–
( )
(
)
∈
−
( )
–
( )
ε
I
N
2
(
)
∈
(1)
where
f
(·) denotes the network,
ε
> 0 is the margin parameter,
and
|
·
|
2
is the
L
2
norm computed from two deep descriptors.
DeepDesc is a shallow network with only three convolutional
layers and no fully connected layers. DeepDesc uses a hard
negative mining approach to suppress the negative pairs that
do not contribute to the update of the gradients during the
training process.
Triplet CNN
Compared to the Siamese networks, which use two image
patches as the input of the network, the triplet networks use
three image patches to learn the deep descriptor for image
patches. The triplet network is different from the Siamese
network because it requires that the distance of the negative
pair be larger than that of the positive pair during the opti-
mization. One triplet contains three samples
I
a
,
I
p
, and
I
n
((
I
a
,
I
p
)
∈
P
(
I
a
,
I
n
)
∈
N
), where
I
a
is the anchor image patch,
I
p
is the
positive image patch for the anchor
I
a
, and
I
n
is the negative
image patch for the anchor. That is,
I
a
and
I
p
are different im-
age patches of the same physical point, and
I
n
comes from an-
other physical point. The triplet network forces the distance
between
I
a
and
I
p
to be smaller and the distance between
I
a
and
I
n
to be larger. There are two kinds of triple loss functions
for learning convolutional embedding in the literature. One is
margin ranking loss (Wang
et al.
2014):
l
(
I
a
,
I
p
,
I
n
) = max(0,
ε
+ d
+
– d
–
)
(2)
where
ε
> 0 is the margin parameter, d
+
=
|
f
(
I
a
) –
f
(
I
p
)|
2
, and
d
–
=
|
f
(
I
a
) –
f
(
I
n
)|
2
; the other is ratio loss (Hoffer and Ailon 2015):
l I I I
e
e e
e
e e
a p n
d
d d
d
d d
( , , )
=
+
+ -
+
+
+
-
+
-
-
2
2
1
(3)
Margin ranking loss is a convex approximation and tries to
optimize the network so that
d
–
>
ε
+
d
+
, while ratio loss aims
to maximize
d
d
-
+
.
674
September 2019
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING