PE&RS March 2018 Full

disparity map with dataset 65246. However, the results seem

different with the other dataset, e.g., the

JR

result shown in

the last column of Figure 8b looks overly smoothed and its

density is more incomplete than the other two (but this does

not mean it is sparse). Please also note that both disparity

maps from

JPL

(see the second column of Figure 8) contains a

few spikes which are removed for visualization.

Given the error bounds calculated from the manual mea-

surements, the matching scores and rewarding scores of each

tie-point are evaluated and the results are shown in Figure

9. Matching scores of three algorithms are generally similar

when they can define a tie-point, but when it fails to define a

tie-point no score was awarded, e.g., see

JPL

matching scores

of

ID

23 and 49 in Figure 9a. The rewarding score of

UCL

’s

disparity map is generally lower than the other two with the

dataset 65246 (See Figure 9c). However, it is improved with

the other dataset having more depth discontinuities.

The total scores were calculated using an equal weight of the

matching scores and rewarding scores, and the results are sum-

marized in Table 2, where the best scores for certain datasets are

labeled in bold font. We can observe that for dataset 65246 that

JR

’s stereo matching pipeline produced the best result for the

overall area. To understand this result clearly, it is worth men-

tioning that the total score (

TS

) shown in (Equation 8) has been

designed to award more scores if a disparity map defines all

queried tie-points; in other words, no score is given if there is no

corresponding tie-point in a disparity map. Thus, this metric is

generally favored for a dense and smooth disparity map, which

we believe why

JR

’s results perform best on both test datasets.

To give more weight on the accuracy of an algorithm,

we modified Equation 8 not to penalize when they failed to

define a queried tie-point in a disparity map, and called this

score,

TS-B

. The results of

TS-B

of both datasets are also pre-

sented in Table 3.

We also introduce a new term

MFR

representing the Match-

ing Failure Rate.

MFR

can be used as an indicator for either the

incompleteness of a disparity map or how conservative the

algorithm is. As shown in Table 2,

JPL

’s results have higher

MFR

,

but without counting on the match failure area (i.e., using

TS-B

)

JPL

’s pipeline produced the best result on the dataset 65246. For

dataset 70000,

JPL

’s pipeline gets the second best score while

UCL

’s processing pipeline has produced the best accuracy.

Discussion and Conclusions

In this paper, we introduced an accuracy evaluation method

to assess the stereo matching results. The main motivation of

this work is to provide a straightforward method which can

be applied to the stereo matching evaluation work of plane-

tary rover missions, where it is currently impossible to obtain

ground truth data.

We have demonstrated the use of a generic portable stereo

workstation including a stereo cursor from the open source

StereoWS tool to produce visually correct tie-points of a stereo

pair, i.e., manual tie-points, with the help of a softcopy stereo

display. The manual tie-points from stereo measurements are

not identical for all candidate tie-points, but our assumption

is that the variation of multiple measurements can be used to

estimate the confidence of a tie-point, and these confidence

values can quantitatively evaluate the quality of disparity

maps from different algorithms. Based on this idea, we have

defined useful evaluation metrics using the statistics of mul-

tiple measurements (such as means and variance). We also de-

fine three types of tie-points to test the performance at highly

textured region, textureless region, and occluded region. The

performance of textureless region is quite interesting for

DTM

construction from orbital imagery but this is left for the future

work. Type (b) tie-points are related to the scene occlusion. At

the moment, we populate these points manually but it is also

possible to design a semi-automatic pipeline to collect these

points, e.g., detect one tie-point by conventional feature detec-

tor and find adjacent feature from background manually.

It is worth noting that in these experiments, the number

of tie-points, particularly for the discontinuities, may not be

sufficient in some cases. It would have been better to add

more tie-points. However, we erred on the side of setting

an experiment which could be accomplished with a group

of ten “citizen scientists” within a limited time period (one

week). Other comparison results, e.g., disparity density or 3D

accuracy, could also be employed in future experiments to

improve the final matching score.

During the evaluation work, we implemented an open

source stereo workstation with an integrated stereo matching

method that is used to produce the

UCL

results shown in the

evaluation. We have published the Java code of the Stereo

(a)

(b)

(c)

(d)

Figure 9. (a) Individual matching scores of the processing results of two datasets; and (b) Rewarding scores from 10 tie-point

pairs in two datasets.

Table 2. Total score (

TS

) estimated from (8) with

α

= 0.5.

Dataset

65246

70000

UCL

JPL JR

UCL

JPL

JR

Matching score 63.96 61.26 64.16 50.45 45.15 57.01

MFR(%)

0.00 3.50 0.00 16.10 26.80 10.70

Rewarding score 50.11 61.87 67.15 43.07 31.05 44.64

MFR(%)

0.00 10.00 0.00 10.00 30.00 0.00

TS

55.65 61.63 65.95 46.02 36.69 49.59

Table 3. Total Score B (TS-B) which is similar to

TS

but

removes the effect of missing tie-points.

Dataset

65246

70000

UCL

JPL JR

UCL

JPL

JR

Matching score 63.96 63.45 64.16 60.11 61.67 63.86

Rewarding score 50.11 68.75 67.15 47.85 44.35 44.64

TS-B

55.65 66.63 65.95 52.75 51.28 52.33

166

March 2018

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

PE&RS March 2018 Full - page 166

Warning.