12-19 December Full - page 605

the Facet Errors. Only 8 out of 27 projectivity errors are worse

than training on the same test area.

As mentioned earlier, additional modalities play an impor-

tant role in prediction accuracy. We start with image-based

attributes. In some cases, they were pivotal in obtaining better

results for geometric errors (

FIB

BIB

), as well as for topological

ones (

FUS

FIT

). These features have a significant coherence

power when trained over Elancourt (

FIB

and

FUS

), and projects

very well to other scenes (

FIB

FUS

BIB

, and

FIT

), (Table 4). On

the other hand, as expected, geometric features alone are best

for topological errors, when trained on dense areas, especially

BIT

(Table 4). Finally, although sticking out for

FIG

in a minor

capacity (cf. Table 4), height-based features proved to be less

transferable. In fact, adding height-based features leads, in

most cases, to a small decrease in accuracy (= 2%) for atomic

errors. All these previous findings further justify why we did

not leave out any modality, as they are more frequently criti-

cal for transferability than in the ablation study (Table 3).

An analysis can also be drawn for atomic errors with re-

spect to the best training scene. We can see that for

BOS

, train-

ing on a dense urban scene like Nantes, is the best solution,

as for topology errors (

FIT

and

BIT

). Paris-13 represents also a

dense downtown scene but with even more diverse building

types. This is instrumental to achieve transferability for

BUS

and

BIB

. Conversely, Elancourt offers more heterogeneity on

the

LoD

-2 level. As a consequence, it is the best training zone

for

FUS

FIB

, and

FIG

. Finally, as one can obviously suspect,

FOS

learning is evenly transferable, as it is well detected when

training on any scene.

Generalization Study

We try to find out how omitting one urban zone from the

training dataset affects the test results on that same area. An-

other way to look at it is, from an operational point of view,

to find out how much learning on a union of many urban

scenes is helpful when testing on an unseen one. We also seek

to confirm the outcome of the transferability experiments.

Experiments that merge all zones except

∪

≠

for train-

ing and test on

are noted by the couple (

∪

≠

) or by

∪

≠



. There are three possibilities: Elancourt

∪

Nantes



Paris-13, Paris-13

∪

Nantes



Elancourt, and Paris-13

∪

Elancourt



Nantes. The

-score evolution per experiment

and error is depicted in Figure 12.

We compare these experiments with t

he ablation study on

the same area (cf. Table 4). We analyze r

esults along the same

criteria as the transferability study.

We start again with a comparison depending on error fami-

lies. Out of the 11 possibilities for the Building Errors family,

eight yield worse results. For the Facet Errors family, 6 out of

13 comparisons exhibit the same trend. This is worse than the

transferability comparisons in ratio. This results from the fact

that fusing more datasets, that are not tailored for a specific

error detection, does not help alleviating the problem. It only

evens out the best (resp. worst) performances by including the

best (resp

worst) urban scene in the training set.

Similarly to the previous study, image and height modali-

ties play a major role in error detection. Image-based features

are crucial for

FIB

BIB

FUS

, and

BUS

detection (Table 4).

Height-based attributes, however, induce a larger improve-

ment in predicting

FIG

and

BIT

, while geometric ones are rel-

egated to playing a minor role. Otherwise, a curiosity can be

noticed: only when fusing all modalities together, in Paris-13

and Nantes, does predictions improve for

BOS

We also confirm the observations about the best urban

scene for error prediction training. In this case, the best zone

should always give the worst scores. It is mostly the case with

all atomic errors, with the exception of

BIT

. This outlier can

be explained by the resemblance of the Paris-13 3D model

to Nantes (which was established to be the best) samples.

Indeed, for most labels, Nantes and Paris-13 reach the same

scores. However, the discrepancy in

-scores proves the

added value for each dataset.

Representativeness Study

The objective is to find out, after merging all training samples

from all datasets, what is the minimal amount of data that can

guaranty stable predictions. We can, thereafter, understand

how much learning on one scene typology can affect results

compared to a mixed training set. Figure 13 depicts

-score as a

function of training ra

tios (between 20–70%) and atomic errors.

We note the high s

tability of the

-score. This indicates

that having a small he

terogeneous dataset is not detrimental

to the learning capaci

ty and can be even the most suitable

solution.

BOS

FOS

, and

FIG

have a standard deviation under

2%, as opposed to

FIB

BIT

, and

FIT

. Indeed, they have large

variance, and even a larger standard deviation than mean

value. Scalability is, hence, ensured with a limited training

set. No standard logarithmic behavior can be found at the

studied scales. 20% of the full label set is sufficient so as to

retrieve results with a performance similar to the initial abla-

tion study. The best results are observed for

BOS

BUS

, and

FUS

These errors are topological defects of building roof facets

which require a high diversity of training samples for their de-

tection. More sophisticated features are however still required

to help predicting less frequent and more semantic labels.

Finesse Study

In this section, we reproduce the experimental settings de-

scribed in the previous two sections. This time, the finesse

level is fixed at 2. The goal is to find out how good, transfer-

able and stable are the model quality predictions at the seman-

tic level of error families (i.e., Building Errors vs. Facet Errors).

Error Family Detection

We start by the ablation study. Table 5 reveals that inserting

more remote sensing modalities do not change the prediction

Figure 12. Mean F-score and standard deviation for the

generalization study per test zone.

Figure 13. Mean F-score and standard deviation for the represent-

ativeness experiments depending on the training set size.

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

December 2019

875

SEO version

Warning.

You are currently viewing the SEO version of 12-19 December Full.
It has a number of design and functionality limitations.

We recommend viewing the basic HTML version or installing the Adobe Flash Player.

581...,595,596,597,598,599,600,601,602,603,604 606,607,608,609,610,611,612,613,614,615,...648