September 2019 Full

PPD: Pyramid Patch Descriptor via

Convolutional Neural Network

Jie Wan, Alper Yilmaz, and Lei Yan

Abstract

Local features play an important role in remote sensing image

matching, and handcrafted features have been excessively

used in this area for a long time. This article proposes a

pyramid convolutional neural triplet network that extracts a

128-dimensional deep descriptor that significantly improves

the matching performance. The proposed approach first

extracts deep descriptors of the anchor patches and cor-

responding positive patches in a ba

tch using the proposed

pyramid convolutional neural netw

ork. Following this step,

the approaches chooses the closest

negative patch for each

anchor patch and corresponding po

sitive patch pair to form

the triplet sample based on the descriptor distances among

all other image patches in the batch. These triplets are used

to optimize the parameters of the network using a new loss

function. We evaluated the proposed deep descriptors on

two benchmark data sets (Brown and HPatches) as well as

real image data sets. The results reveal that the proposed

descriptor achieves the state-of-the-art performance on the

Brown data set and a comparatively very high performance

on the HPatches data set. The proposed approach finds

more correct matches than the classical handcrafted fea-

ture descriptors on aerial image pairs and is observed to be

robust to variations in the viewpoint and illumination.

Introduction

With the rapid development of image sensors over the past

few decades, more and more images have been used as an

effective way to extract useful information for a number of

applications. Automatic extraction, analysis, and understand-

ing of this useful information typically require extraction of

features from the image. The local points are the most widely

used features in most photogrammetry and computer vision

applications. Finding accurate correspondences among local

point features across images refers to the task of associating

points in one image with points in another image. This task

is challenging when differences occur due to camera mo-

tion, time lapse, and object motion. These variations require

high-quality descriptors, which are an essential step for most

tasks, such as Structure from motion (Schonberger and Frahm

2016), simultaneous localization and mapping (Mur-Artal

and Tardós 2017), image registration (Bentoutou

et al.

2005),

image matching (Dufournaud

et al.

2004), image retrieval

(Zhou

et al.

2017), and image scene classification (Cheng

et

al.

2017). A high-quality feature descriptor should be not only

robust to the changes of viewpoints, illuminations, shading,

and the partial occlusions but also highly discriminative to

those noncorresponding features with high similarity.

The past decades have witnessed the development of many

feature descriptors algorithms. Traditional handcrafted de-

scriptor extraction algorithms obtain a descriptor by encoding

the color and texture at pixels according to fixed rules. The

SIFT

algorithm (Lowe 2004)

is one of the most commonly used

algorithms that uses a desc

riptor composed of the histograms

of gradient orientations we

ighted by their magnitudes around

a local point referred to as

the keypoint. Later, several algo-

rithms, such as

PCA-SIFT

, which uses principal components

analysis applied to the normalized gradient patch (Ke and

Sukthankar 2004), and

DSP-SIFT

, which uses pooling of gradient

orientations (Dong and Soatto 2015), have improved the perfor-

mance of the classical

SIFT

algorithm.

SURF

is another com-

monly used scale- and rotation-invariant descriptor that relies

on integral image (Bay

et al.

2008). Besides these descriptors,

binary descriptors have also been popular to compare feature

points in a very fast manner while requiring a comparatively

small amount of memory. Balntas

et al.

(2015) presented a full

inter- and intraclass optimization of binary descriptors that is

an adaptive online selection of binary intensity tests to simul-

taneously increase interclass and decrease intraclass distances.

BinBoost uses a boosted binary hash function to compute each

bit of the descriptor (Trzcinski

et al.

2013).

BRIEF

interprets an

image patch as a binary string that relies on a relatively small

number of intensity difference tests; thus, it has a very high

speed (Calonder

et al.

2012). In order to address the issue of

nonrigid deformations, the DaLI descriptor with high resilience

to nonrigid image transformations and illumination changes

has been proposed by Simo-Serra, Torras,

et al.

(2015). All

these methods are handcrafted descriptors driven by intuition

or the researcher’s expertise and are generated from low-level

features, such as image gradient or binary intensity tests; hence,

they inevitably suffer from information loss (Tian

et al.

2017).

Recently, deep learning, especially the convolutional

neural network (

CNN

), has been used as an efficient approach

in objection detection and recognition tasks that has dramati-

cally improved the performance (Zbontar and LeCun 2015). To

extend the performance of

CNN

in image recognition, research-

ers need to develop tests that reveal problems with matching

specific features. For example, Fischer

et al.

(2014) used the

filters from various layers of

CNN

trained on the ImageNet data

set and compared the matching performance to the application

of standard

SIFT

descriptors. The authors have observed that

these filter provided better descriptors than

SIFT

for the image

matching problem. This new feature was referred to as the

deep feature in successive articles. Following their work, re-

searchers started to use

CNN

to learn the descriptors for image

Jie Wan is with the Beijing Key Laboratory of Spatial

Information Integration and Its Applications, School of

Earth and Space Sciences, Peking University, Beijing, China;

and the Department of Civil, Environment, and Geodetic

Engineering, Ohio State University, Columbus, OH.

Alper Yilmaz is with the Department of Civil, Environment, and

Geodetic Engineering, Ohio State University, Columbus, OH.

Lei Yan is with the Beijing Key Laboratory of Spatial

Information Integration and Its Applications, School of Earth

and Space Sciences, Peking University, Beijing, China.

(Corresponding author,

lyan@pku.edu.cn

).

Photogrammetric Engineering & Remote Sensing

Vol. 85, No. 9, September 2019, pp. 673–686.

0099-1112/19/673–686

and Remote Sensing

doi: 10.14358/PERS.85.9.673

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

September 2019

673

September 2019 Full - page 673

Warning.