September 2019 Full - page 674

patches.

CNN

-based methods can be divided into two classes

(Wei

et al.

2018). The first class of studies treats the feature

matching as a binary classification problem based on the simi-

larity score generated from a pair of image patches. An obvious

drawback of this class of studies is that the computational cost

is very high since each image patch should be matched with

all other image patches. The second class of studies directly

generate the descriptor from an image patch by passing it

through a

CNN

. This approach treats each image patch indepen-

dently and is similar to the traditional handcrafted descriptor;

thus, it can be easily used in various vision problems directly

replacing the classical handcrafted descriptor. Many variations

of this approach have been proposed to generate a deep learn-

ing–based descriptor through

CNN

by proposing the inclusion

of (but not being limited to) new loss functions, new architec-

tures, and regularization terms. A multi-resolution

CNN

-based

descriptor was proposed by Mitra

et al.

(2017) that required

different image resolutions that need to be manually gener-

ated from the original image patch.

Another approach, TFeat,

also suggests a multi-scale architect

ure that has been shown

to improve the performance of the d

eep descriptor (Balntas

al.

2016). However, TFeat explores o

nly a “shallow” network

structure that focuses on fast extract

ion of the deep descriptor.

In this article, we propose a pyramid network structure to

make full use of the multi-scale information to produce the

deep descriptor for a given image patch. The main contribu-

tions of this article are as follows:

1. We propose a pyramid

CNN

that can better incorporate

the global context of the image patch at different scales to

learn the descriptor for an image patch.

2. Our triplet network uses a new distance loss function that

avoids setting the margins manually.

We test the proposed deep descriptor on two benchmark

data sets, and it achieves the highest performance on the

Brown benchmark data set and obtains the top accuracy on

the HPatches benchmark data set except for HardNet. The

proposed pyramid patch descriptor (

PPD

) performs better in

the verification matching tasks but worse in the retrieval task

compared with HardNet. For the first time in the literature,

we use three aerial image pairs to demonstrate the effective-

ness of the proposed deep descriptor

PPD

on realistic sce-

narios. The experimental results show that

PPD

, when applied

to the aerial image matching problem, can obtain more correct

correspondences compared to traditional feature descriptors,

such as

SIFT

and

BRISK

. The

PPD

is also observed to be more ro-

bust and effective in finding correspondences in image pairs

with illumination and viewpoint variation.

The remainder of this article is organized as follows. Relat-

ed research about extracting learning-based descriptors is pre-

sented in the section “Related Work.” The proposed specific

deep patch descriptor extraction algorithm is described in the

section “Proposed Method.” Experimental results and analysis

are presented in the section “Experiments and Analysis,” and

a discussion follows in the section “Discussion.” Conclusive

remarks on our study are given in the section “Conclusions.”

Related Work

The proposed algorithm extracts the patch descriptor through

a deep

CNN

. Hence, the following discussion presents a

detailed overview of descriptor learning methods based on

convolutional networks. Matchnet (Han

et al.

2015) directly

matches two image patches consisting of a deep convolution-

al network that extracts features from patches and a network

of three fully connected layers that computes a similarity

between the extracted features. Zagoruyko and Komodakis

(2015) propose two separate architectures, including a two-

channel network and a central-surround two-stream network

for the patch matching problem. Their results show that

the two-channel structure network performs better than the

Siamese network used in Matchnet. Kumar

et al.

(2016) use

a triplet network structure and propose a global loss func-

tion that minimizes the mean value of the distances between

matching pairs and maximizes that of nonmatching pairs. The

above methods use a metric learning layer in their networks,

limiting the application of such networks. According to the

architecture of the network and loss function, the methods

proposed in the published literature can be classified into two

classes: Siamese

CNN

and triplet

CNN

Siamese CNN

A commonly used architecture for learning feature embedding

with

CNN

is a Siamese network that consists of two identical

networks parallel to each other. Given a set of positive pairs

and a set of negative pairs

, the network tries to directly

decrease the distance of matching image pairs and increase

the distance of nonmatchi

ng image pairs. Given two image

patches

and

, (

∈

), DeepDesc (Simo-Serra, Trulls,

al.

2015) uses the contrast

ive loss in Equation 1 to obtain the

descriptor in the training p

hase:

(

)

l I I

f I

if I I

f I

if I

1 2

2 2

1 2

2 2

max ,

(

)

( )

–

( )

(

)

∈

−

( )

–

( )

(

)

∈











(1)

where

(·) denotes the network,

> 0 is the margin parameter,

and

is the

norm computed from two deep descriptors.

DeepDesc is a shallow network with only three convolutional

layers and no fully connected layers. DeepDesc uses a hard

negative mining approach to suppress the negative pairs that

do not contribute to the update of the gradients during the

training process.

Triplet CNN

Compared to the Siamese networks, which use two image

patches as the input of the network, the triplet networks use

three image patches to learn the deep descriptor for image

patches. The triplet network is different from the Siamese

network because it requires that the distance of the negative

pair be larger than that of the positive pair during the opti-

mization. One triplet contains three samples

, and

((

)

∈

(

)

∈

), where

is the anchor image patch,

is the

positive image patch for the anchor

, and

is the negative

image patch for the anchor. That is,

and

are different im-

age patches of the same physical point, and

comes from an-

other physical point. The triplet network forces the distance

between

and

to be smaller and the distance between

and

to be larger. There are two kinds of triple loss functions

for learning convolutional embedding in the literature. One is

margin ranking loss (Wang

et al.

2014):

(

) = max(0,

+ d

– d

–

)

(2)

where

> 0 is the margin parameter, d

(

) –

(

, and

–

(

) –

(

; the other is ratio loss (Hoffer and Ailon 2015):

l I I I

e e

a p n

d d

( , , )











 + -













(3)

Margin ranking loss is a convex approximation and tries to

optimize the network so that

–

, while ratio loss aims

to maximize

674

September 2019

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

SEO Version

Warning.

You are currently viewing the SEO version of !text.
It has a number of design and functionality limitations.

We recommend viewing the Flash version or the basic HTML version of this publication.

611...,664,665,666,667,668,669,670,671,672,673 675,676,677,678,679,680,681,682,683,684,...702