For the case of two processors, one processor is used as a
master for receiving and writing the resulting data, and the
other is used as a slave to read and process the image data.
The execution time for two processors is slightly shorter than
that for one processor, except for the case of de-correlation on
the Windows computer. With an increase in the number of
processors, the execution times decrease. When the number of
processors is increased from two to eight, the decreases in pro-
cessing times are significant. For example, in the case of the
BR
fusion algorithm on the Windows computer, the time for
two processors is 660 sec, while the time for eight processors
is 119 sec. In this range, parallel computing achieves nearly
linear speedup for the selected algorithms. When the number
of processors is greater than eight, the decrease in process-
ing time is gradual, or a decrease in processing time does not
occur. For instance, for de-correlation on the Linux computer,
the time for ten processors is 185 sec, while the time for
sixteen processors is 122 sec. The decrease in time is 63 sec.
In the cases of the filtering and
RPC
-based correction on the
Windows computer, further decreases in time do not occur.
Factors Determining Parallel Performance
The differences in speedup values for the different algorithms
are mainly determined by the following factors.
First, the amount of required computation per image size
is one of the main factors that determined the effectiveness of
parallel computing. In other words, under conditions of the
same amount of disk input/output (
I/O
), the larger the amount
of required computation, the higher the achieved maximum
speedup. This is straightforward and evidenced by the fact that
the maximum speedup values are different when the amount
of disk
I/O
is the same. For instance, in these experiments, the
amount of disk
I/O
for de-correlation is the same as that for the
RPC
-based correction, but the maximum speedups are different:
7.1× and 7.9× on the computers with Windows OS and Linux
OS, respectively, for the de-correlation as opposed to 4.8×
and 7.3× on the computers with Windows OS and Linux OS,
respectively, for the
RPC
-based correction. This result occurs
because the computational cost of de-correlation is larger than
that of the
RPC
-based correction in these experiments. Another
example is that the
DEM
extraction algorithm achieves as high
as 13.6× speedup because of its very large required amount of
computation. Hence, the conclusion can be drawn that under
conditions of the same amount of disk
I/O
, the greatest speedup
is determined by the amount of computation required.
Second, the amount of data in these algorithms is relatively
large. Because the disk
I/O
operation is not very efficient due to
the limits of the hardware, the reading and writing processes
are time-consuming. Thus, the amount of required disk
I/O
per
image size is also one of the main factors that determine the
parallel performance. In this regard, the experiments in this pa-
per record the entire time required to finish the full procedure
of an algorithm, including the time for reading and writing and
the time for computing, instead of only the time for computing.
For those cases (i.e.,
RPC
-based correction and mosaics on
the Windows computer and mosaics on the Linux computer),
the maximum speedup is not improved very much, even
though more processors are used. These experimental results
imply that the large amount of disk
I/O
is the main determin-
ing factor that prevents greater improvement of the speedup
for those algorithms. Theoretically, the time required to di-
rectly read the source images and to write the resulting imag-
es without computation is a hard limit, which is most closely
approached when unlimited processors are utilized. Through
analysis, the best case is that the entire time required to finish
the complete operation is equal or close to the limit. Because
the algorithms in this paper are all data-intensive opera-
tions, the best parallel computing strategy applied to achieve
high speed and efficiency is that in which the computation
operations and the disk
I/O
operations occur simultaneously.
In other words, overlap between the computational time and
the disk
I/O
time results in the greatest decrease of the overall
runtime. The strategies described in The Method of Parallel
Processing Section are underscored by the fact that the short-
est times indicated in Table 4 and Table 5 are closer to the
theoretical limit. Thus, the parallel strategy is optimal.
Third, the computing platform’s performance also deter-
mines the execution times and the effectiveness of paral-
lel computing. The performance of a multi-core computer
depends on three factors, i.e., the computing power of one
CPU
core, the number of cores, and disk
I/O
performance.
Generally speaking, if a specific number of cores which is less
than the number achieving the maximum speedup is used,
the more powerful a single
CPU
core, the shorter the execu-
tion time. For instance, when four cores are used, the time
required for
BR
fusion on the Windows computer is shorter
than that on the Linux computer. The reason is that the power
of a single core in the Windows computer is stronger than one
in the Linux computer. Thus, a smaller number of powerful
cores are equal to a larger number of weak cores in terms of
the execution performance. If the amount of required compu-
tation is large enough, for example, in the
DEM
extractions, the
larger number of cores and higher performance of disk
I/O
are
preferable. The speedup is 13.6× for
DEM
extractions on the
Windows computer when 24 processors are used. In any case,
high performance of disk
I/O
means that less time is required
for reading and writing.
In these experiments, once the size of the available
RAM
in a computer is above a threshold,
RAM
is not a bottleneck
affecting the parallel performance. Assuming there are
k
cores
in a computer, the threshold of required memory is the size
supporting the concurrent processing of (
k
- 1) blocks. From
the
Parallel Processing Mechanism
Section, we know that
memory consumption in the adopted parallel mechanism is
small. Usually the ordinary personal computer can meet the
requirements. Although the available
RAM
in the employed
computers is relatively large (48 GB and 12 GB, respectively),
which is assembled in the course of purchase, the amount of
RAM
is not a factor that impacts parallel performance when an
achievable multi-core computer is used.
Comparison with Other Multi-core Parallel Methods
Compared with the experimental results in the literature
(Remon
et al.
, 2011), in which no significant improvements
is observed for the multi-core version over the optimized
OSP
(orthogonal subspace projection) using one core, and the
multi-core version for N-FINDR with eight cores achieves a
3.1× speedup (15.001 sec using one core, 4.879 sec using eight
cores). The speedup values in this paper range from 3.7× to
6.6× if eight processors are used. The results on a multi-core
system with 12 physical cores presented by Bernabe
et al.
(2013) show that there is no significant difference between us-
ing 4 or 12 cores (Figure 11 in this referenced paper), because
the parallel implementation is not optimized to increase its
scalability to a high number of cores. Our method can achieve
high scalability evidenced by the results in Table 4 and Table 5.
The higher performance is achieved using two improve-
ments. First, the flexible and optimal parallel strategies
adapted for remote sensing image processing can be embed-
ded in the method by means of the
MPI
. However, the parallel
computing in the literature (Bernabe
et al.
, 2013; Remon
et
al.
, 2011), which is built on the OpenMP, basic linear alge-
bra subprograms (
BLAS
) and linear algebra package (
LAPACK
)
libraries supported by the compilers, exploits multi-threaded
linear algebra subprograms and parallelized loops. Second,
the latest computers containing more cores are used in these
experiments. Therefore, the scalability of the adopted parallel
mechanism properly matches the newest hardware.
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
May 2015
383