PE&RS March 2018 Full

to a threshold (Moravec, 1981). The Förstner detector extracts

interest points by searching for optimal windows using the

auto-correlated matrix and optimizes the point locations

based on a differential edge intersection approach (Förstner,

1986). The Harris detector improves the Moravec detector

by using an auto-related matrix to ensure better repeatability

(Harris and Stephens, 1988). Traditional software packages

equipped with these detectors generally used the normalized

correlation coefficient as the matching metrics. Various con-

straints have been employed to improve the matching perfor-

mance, such as the epipolar constraint (Gruen and Baltsavias,

1988), triangulation constraint (Wu

et al

., 2012), and redun-

dant information from multiple images (Agouris and Schenk,

1992; Zhu

et al

., 2010). The point matches can be optimized

to sub-pixel level by using the least-squares matching meth-

ods (Gruen, 1985 and 2012).

Since the milestone work of the scale invariant feature

transform algorithm (

SIFT

) (Lowe, 1999), the use of invariant

features has offered another paradigm for sparse point match-

ing.

SIFT

combines a scale invariant interest point detector

and a descriptor based on the gradient distribution in the

detected local regions (Lowe, 2004).

SURF

, which was inspired

by the

SIFT

algorithm (Bay

et al

., 2008), incorporated the

Hessian matrix approximation to speed up matching. Other

algorithms that use machine learning techniques to detect

features, such as

SUSAN

(Smith and Brady, 1997) and

FAST

(Rosten and Drummond, 2006), have also been used widely

for point matching. Despite the subsequent introduction of

many new developments, the

SIFT

has proven remarkably

successful, and most subsequent improvements to it were

designed for efficiency.

In the past, a large number of dense matching methods

have been developed to suit different contexts. For dense

matching, disparity computation is the most important step.

Methods used to calculate disparity can be classified into

three main categories: local methods, global methods, and

methods based on a combination of local and global factors.

Local methods simply select the disparity with the mini-

mum cost value; i.e., a local “winner-take-all” optimization.

These local methods feature different strategies, such as block

matching and gradient-based optimization. Block matching

methods calculate disparity by seeking the highest correlation

between the reference image and the matching image within

a region around the pixel of interest. The most significant

drawback of block matching methods is sensitivity to lo-

cally ambiguous regions in images. Gradient-based methods

determine small local disparities between two images by

formulating a differential equation that correlates motion and

image brightness. Notably, local methods share a common

limitation: the uniqueness of matches is only enforced for the

reference image, whereas points in the searching image might

be matched to multiple points (Stentoumis

et al

., 2014).

In contrast to local methods, global methods have been

formulated in an energy-minimization framework (Geiger

et

al

., 2010; Shaobo

et al

., 2012; Blumenthal-Barby and Eisert,

2014). These methods aim to identify a disparity function

that minimizes global energy. The global methods incorporate

different strategies, including dynamic programming, belief

propagation, and graph cut. Dynamic programming approach-

es (Van Meerbergen

et al

., 2002; Veksler, 2005) involve the

calculation of a minimum-cost path through the matrix of all

pairwise matching costs between two corresponding scan-

lines. Belief propagation approaches (Sun

et al

., 2003; Yang

et al

., 2009) attempt to solve the maximum posteriori Markov

Random Field problem using global optimization. The graph

cut method (Boykov

et al

., 2001) was proposed based on the

max flow algorithm in graph theory.

Local methods are considered more straightforward and ef-

ficient for real-time applications and large dataset manipula-

tion. However, these methods are inferior to others in terms of

accuracy and robustness. The elaborate models used by global

methods to describe the matching process tend to increase

computational loads. Therefore, combinations of local and

global methods, which retain the strengths while discard-

ing the weaknesses of each technique, are now considered

the most successful option for image matching. Semi-global

matching (

SGM

) (Hirschmuller, 2011) is the most widely used

of these combination methods.

SGM

was initially proposed to handle complex image

matching and uses mutual information (

MI

) as the cost func-

tion (Viola and Wells III, 1997).

MI

depends on the individual

and joint entropies of two images.

SGM

uses the global cost

function with a smoothness constraint that penalizes discon-

tinuities. In contrast to the original version, the latest cost

function separates the penalty function into two parts: one

with pixel discontinuities, and one with larger discontinuities

(Hirschmuller, 2011), as shown in Equation 1.

E D

C p D

P T D D

p

p q

q N

p q

q N

p

( )

=

(

)

+

− =







 +

− >

















∈

∑

,

1

2

1

p

∑

.

(1)

This cost function performs well in flat or radiometrically

uniform areas, or even in areas with lower building densities.

However, the cost terms

C

(

p

,

D

p

) may decrease while the penalty

terms

P T D D

p q

q N

p

2

1

− >









∈

∑

increase in areas with high

densities of buildings. As a result, the matching performance

in these areas may not be as good as expected (with reference

to the experimental analysis in this paper).

Regarding the image matching methods described above,

matching difficulties in urban areas, which are mainly caused

by the occlusion problems from buildings, have rarely been

considered. In most image matching methods, areas of oc-

clusion are addressed by detection after matching. Some

research studies have compensated for occlusion problems

by using hard constraints, such as the uniqueness constraint

(Kolmogorov and Zabih, 2001) and ordering constraint (Silva

and Santos-Victor, 2000). The uniqueness constraint enforces

one-to-one mapping between pixels in two images, while the

ordering constraint preserves order along scanlines. However,

the abilities of these constraints to solve occlusion issues are

limited. The computer vision community has also used image

segmentation to estimate disparities in close-range images

(Bleyer and Gelautz, 2005; Bleyer

et al

., 2011). The resulting

disparity maps are normally interpolated from the matching

results and are relatively smooth. However, images in urban ar-

eas are very complicated. Disparities vary drastically between

buildings, and may not be smooth even within a single build-

ing. Therefore, a direct disparity interpolation is not rigorous.

This paper proposes an integrated image matching and

segmentation approach. By incorporating image matching

with image segmentation, objects on images with significantly

different disparities will be handled separately. The image

segmentation results will be used for occlusion filtering and

similarity measurements to provide better matching results

for 3D surface reconstruction in urban areas.

Integrated Image Matching and Segmentation Approach

The integrated image matching and segmentation approach,

named

SATM

+, is based on our existing self-adaptive triangula-

tion-constrained matching (

SATM

) framework (Wu

et al

., 2011,

2012). In the

SATM

framework, the first step is to identify

several robust seed points on stereo images, which are used to

136

March 2018

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

PE&RS March 2018 Full - page 136

Warning.