12-19 December Full

proposed to acquire specific land cover reflectance for further

fusion (Xie

et al.

2016). To deal with the complex surface

changes, Huang and Zhang (2014) proposed the unmixing-

based spatiotemporal reflectance fusion model (U-STFM),

which obviously improved the accuracy of capturing land-

cover-type changes. Zhu

et al.

(2016) proposed a flexible

spatial-temporal data fusion (

FSDAF

) algorithm with minimum

input data, which is suitable for heterogeneous landscape and

land-cover-type changes.

In the learning based methods, based on sparse representa-

tion theory, Huang and Song established the corresponding

relationship between high- and low-resolution images to

predict the high-resolution images of another period, which

shows advantages in forecasting images with phenological

changes and land-cover type changes (Huang and Song 2012).

In order to reduce the use of input data, one pair of images are

used for dictionary learning and the prediction process is im-

plemented in a two-stage framework (Song and Huang 2013).

However, the algorithm based on sparse representation brings

certain complexity and instability. In recent years, some

scholars have introduced deep learning algorithm into the

field of spatiotemporal fusion. Song

et al.

(2018) established a

novel spatiotemporal fusion using deep convolutional neural

network (

STFDCNN

), which can effectively extract and express

information in large-scale remote sensing data, and achieve

satisfactory performance. Tan

et al.

(2018) proposed a new

deep convolutional spatiotemporal fusion network (

DCSTFN

),

which fully uses convolutional neural networks to obtain

high spatial-temporal resolution images. The results show

that this method has higher fusion precision and is more ro-

bust than the traditional spatiotemporal fusion algorithm.

The above methods are mostly based on pixel-by-pixel

processing, which will make the fusion of each fine-coarse

resolution image pair take a long time to predict results. In

practical application, when long time series images are need-

ed, these methods will be very time-consuming. In contrast,

learning-based methods spend more time in training process,

but need much less time in forecasting. At present, deep

learning has achieved excellent performance in computer

vision (Zhang

et al.

2017; Kim, Lee, and Lee 2016; Tong

et al.

2017; Ledig

et al.

2017; Girshick

et al.

2014; Ren

et al.

2015;

Wang

et al.

2017). More and more deep l

earning neural net-

works have been introduced into remote

sensing field (Zhang,

Zhang, and Du 2016; Zhu

et al.

2017). F

urthermore, the spa-

tiotemporal fusion methods based on de

ep learning proposed

recently show their potential to process data in batches and

detect texture features of objects (Song

et al.

2018).

Given the aforementioned concerns, the objective of this

research is to develop an effective spatiotemporal fusion

algorithm. In order to reduce the impact of large resolution

differences between input data on the algorithm, a two-stage

fusion algorithm is proposed. In the first stage, the input data

is firstly preprocessed to the transitional resolution, and then

a linear interpolation model is used to fuse the input data for

initial fusion. In the second stage, residual dense network is

introduced to reconstruct the preliminary fusion results to

obtain the final fusion results with the same resolution to the

fine-resolution image. The main contributions of this paper

are as follows: 1) The linear interpolation model effectively

integrates the spatial and spectral information from the pre-

processed input data to produce more accurate transitional

fused results, 2) the introduced residual dense network fully

extracts and uses the hierarchical features from transitional

fused data, and can effectively establish the mapping from

preliminary images to real fine-resolution images for recon-

struction, thus, more detailed spatial structure of the image

can be captured in the fusion results, and 3) the model can be

saved and reused after training, which improves data process-

ing efficiency when acquiring long-term sequence images in

the same study area.

Methodology

In this paper, a two-stage spatiotemporal fusion method is pro-

posed for remote sensing images. Here, the data sets of Land-

sat and

MODIS

are used to demonstrate the proposed fusion

method. (however, other remote sensing images such as Senti-

nel-3 and Sentinel-2 can also be adopted in our method). One

Landsat-

MODIS

image pair and a

MODIS

image on the prediction

date are used as input data to predict the Landsat image on the

prediction date. Considering the large spatial resolution differ-

ence between Landsat image and

MODIS

image, the method is

implemented in two stage. In the first stage, the Landsat image

is downsampled to the transitional resolution, and the

MODIS

image is upsampled to the same resolution. Then a linear

interpolation model is used to fuse the preprocessed input

images for obtaining preliminary fusion results. In the second

stage, the residual model is introduced to reconstruct the

preliminary fusion results to acquire the Landsat image on the

prediction date. The main idea and flowchart of the proposed

method are demonstrated in Figure 1.

Figure 1. Flowchart o

f the proposed method.

The Linear Injection Mode

l

The difference between the high and low resolution of Land-

sat-

MODIS

data sets used in this paper is 20 times. In this case,

we consider the fusion of 10 times resolution difference first.

The Landsat data is reduced by 2 times for the spatial resolu-

tion, and the

MODIS

data is upsampled to the same resolu-

tion with the resampled Landsat data. A linear interpolation

model commonly used for pan-sharpening is used to fuse the

resampled Landsat and

MODIS

data (Vivone

et al.

2015). The

model is given as follows:

ˆ

,

F C

F -C

2,

2

1

2

1

b

b b b b

g

↓

= ′ +

′

(

)

(1)

where F

1

2

,b

↓

represents the b band Landsat image at time t1 af-

ter reducing the resolution by 2 times. C

1

,b

′

and C

2

,b

′

are the

upsampled image with the same resolution to F

1

2

,b

↓

at t1 and t2

using bicubic interpolation method. The linear interpolation

model integrates the spatial information of the high-resolution

image and spectral information of the low-resolution image

to produce the fused results F

1

2

,b

↓

. In Equation 1, the injection

gain

g

b

is defined, for b band, as

908

December 2019

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

12-19 December Full - page 638

Warning.