PE&RS May 2015

For the case of two processors, one processor is used as a

master for receiving and writing the resulting data, and the

other is used as a slave to read and process the image data.

The execution time for two processors is slightly shorter than

that for one processor, except for the case of de-correlation on

the Windows computer. With an increase in the number of

processors, the execution times decrease. When the number of

processors is increased from two to eight, the decreases in pro-

cessing times are significant. For example, in the case of the

BR

fusion algorithm on the Windows computer, the time for

two processors is 660 sec, while the time for eight processors

is 119 sec. In this range, parallel computing achieves nearly

linear speedup for the selected algorithms. When the number

of processors is greater than eight, the decrease in process-

ing time is gradual, or a decrease in processing time does not

occur. For instance, for de-correlation on the Linux computer,

the time for ten processors is 185 sec, while the time for

sixteen processors is 122 sec. The decrease in time is 63 sec.

In the cases of the filtering and

RPC

-based correction on the

Windows computer, further decreases in time do not occur.

Factors Determining Parallel Performance

The differences in speedup values for the different algorithms

are mainly determined by the following factors.

First, the amount of required computation per image size

is one of the main factors that determined the effectiveness of

parallel computing. In other words, under conditions of the

same amount of disk input/output (

I/O

), the larger the amount

of required computation, the higher the achieved maximum

speedup. This is straightforward and evidenced by the fact that

the maximum speedup values are different when the amount

of disk

I/O

is the same. For instance, in these experiments, the

amount of disk

I/O

for de-correlation is the same as that for the

RPC

-based correction, but the maximum speedups are different:

7.1× and 7.9× on the computers with Windows OS and Linux

OS, respectively, for the de-correlation as opposed to 4.8×

and 7.3× on the computers with Windows OS and Linux OS,

respectively, for the

RPC

-based correction. This result occurs

because the computational cost of de-correlation is larger than

that of the

RPC

-based correction in these experiments. Another

example is that the

DEM

extraction algorithm achieves as high

as 13.6× speedup because of its very large required amount of

computation. Hence, the conclusion can be drawn that under

conditions of the same amount of disk

I/O

, the greatest speedup

is determined by the amount of computation required.

Second, the amount of data in these algorithms is relatively

large. Because the disk

I/O

operation is not very efficient due to

the limits of the hardware, the reading and writing processes

are time-consuming. Thus, the amount of required disk

I/O

per

image size is also one of the main factors that determine the

parallel performance. In this regard, the experiments in this pa-

per record the entire time required to finish the full procedure

of an algorithm, including the time for reading and writing and

the time for computing, instead of only the time for computing.

For those cases (i.e.,

RPC

-based correction and mosaics on

the Windows computer and mosaics on the Linux computer),

the maximum speedup is not improved very much, even

though more processors are used. These experimental results

imply that the large amount of disk

I/O

is the main determin-

ing factor that prevents greater improvement of the speedup

for those algorithms. Theoretically, the time required to di-

rectly read the source images and to write the resulting imag-

es without computation is a hard limit, which is most closely

approached when unlimited processors are utilized. Through

analysis, the best case is that the entire time required to finish

the complete operation is equal or close to the limit. Because

the algorithms in this paper are all data-intensive opera-

tions, the best parallel computing strategy applied to achieve

high speed and efficiency is that in which the computation

operations and the disk

I/O

operations occur simultaneously.

In other words, overlap between the computational time and

the disk

I/O

time results in the greatest decrease of the overall

runtime. The strategies described in The Method of Parallel

Processing Section are underscored by the fact that the short-

est times indicated in Table 4 and Table 5 are closer to the

theoretical limit. Thus, the parallel strategy is optimal.

Third, the computing platform’s performance also deter-

mines the execution times and the effectiveness of paral-

lel computing. The performance of a multi-core computer

depends on three factors, i.e., the computing power of one

CPU

core, the number of cores, and disk

I/O

performance.

Generally speaking, if a specific number of cores which is less

than the number achieving the maximum speedup is used,

the more powerful a single

CPU

core, the shorter the execu-

tion time. For instance, when four cores are used, the time

required for

BR

fusion on the Windows computer is shorter

than that on the Linux computer. The reason is that the power

of a single core in the Windows computer is stronger than one

in the Linux computer. Thus, a smaller number of powerful

cores are equal to a larger number of weak cores in terms of

the execution performance. If the amount of required compu-

tation is large enough, for example, in the

DEM

extractions, the

larger number of cores and higher performance of disk

I/O

are

preferable. The speedup is 13.6× for

DEM

extractions on the

Windows computer when 24 processors are used. In any case,

high performance of disk

I/O

means that less time is required

for reading and writing.

In these experiments, once the size of the available

RAM

in a computer is above a threshold,

RAM

is not a bottleneck

affecting the parallel performance. Assuming there are

k

cores

in a computer, the threshold of required memory is the size

supporting the concurrent processing of (

k

- 1) blocks. From

the

Parallel Processing Mechanism

Section, we know that

memory consumption in the adopted parallel mechanism is

small. Usually the ordinary personal computer can meet the

requirements. Although the available

RAM

in the employed

computers is relatively large (48 GB and 12 GB, respectively),

which is assembled in the course of purchase, the amount of

RAM

is not a factor that impacts parallel performance when an

achievable multi-core computer is used.

Comparison with Other Multi-core Parallel Methods

Compared with the experimental results in the literature

(Remon

et al.

, 2011), in which no significant improvements

is observed for the multi-core version over the optimized

OSP

(orthogonal subspace projection) using one core, and the

multi-core version for N-FINDR with eight cores achieves a

3.1× speedup (15.001 sec using one core, 4.879 sec using eight

cores). The speedup values in this paper range from 3.7× to

6.6× if eight processors are used. The results on a multi-core

system with 12 physical cores presented by Bernabe

et al.

(2013) show that there is no significant difference between us-

ing 4 or 12 cores (Figure 11 in this referenced paper), because

the parallel implementation is not optimized to increase its

scalability to a high number of cores. Our method can achieve

high scalability evidenced by the results in Table 4 and Table 5.

The higher performance is achieved using two improve-

ments. First, the flexible and optimal parallel strategies

adapted for remote sensing image processing can be embed-

ded in the method by means of the

MPI

. However, the parallel

computing in the literature (Bernabe

et al.

, 2013; Remon

et

al.

, 2011), which is built on the OpenMP, basic linear alge-

bra subprograms (

BLAS

) and linear algebra package (

LAPACK

)

libraries supported by the compilers, exploits multi-threaded

linear algebra subprograms and parallelized loops. Second,

the latest computers containing more cores are used in these

experiments. Therefore, the scalability of the adopted parallel

mechanism properly matches the newest hardware.

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

May 2015

383

PE&RS May 2015 - page 383

Warning.