PERS_April2018

Video Event Recognition and Anomaly Detection

by Combining Gaussian Process and Hierarchical

Dirichlet Process Models

Michael Ying Yang, Wentong Liao, Yanpeng Cao and Bodo Rosenhahn

Abstract

In this paper, we present an unsupervised learning frame-

work for analyzing activities and interactions in surveillance

videos. In our framework, three levels of video events are

connected by Hierarchical Dirichlet Process (

HDP

) model: low-

level visual features, simple atomic activities, and multi-agent

interactions. Atomic activities are represented as distribu-

tion of low-level features, while complicated interactions are

represented as distribution of atomic activities. This learning

process is unsupervised. Given a training video sequence,

low-level visual features are extracted based on optic flow and

then clustered into different atomic activities and video clips

are clustered into different interactions. The

HDP

model auto-

matically decides the number of clusters, i.e., the categories of

atomic activities and interactions. Based on the learned atom-

ic activities and interactions, a training dataset is generated

to train the Gaussian Process (

GP

) classifier. Then, the trained

GP

models work in newly captured video to classify interac-

tions and detect abnormal events in real time. Furthermore,

the temporal dependencies between video events learned by

HDP

-Hidden Markov Models (

HMM

) are effectively integrated

into

GP

classifier to enhance the accuracy of the classifica-

tion in newly captured videos. Our framework couples the

benefits of the generative model (

HDP

) with the discrimi-

nant model (

GP

). We provide detailed experiments showing

that our framework enjoys favorable performance in video

event classification in real-time in a crowded traffic scene.

Introduction

High-level video event classification is an important issue

in computer vision and have attracted great attention in

recent years

[1]

due to their significant practical values such

as security monitoring, traffic controlling, etc. Most existing

approaches focused on recognition of an individual activity

[2]

, or a collective activity

[3]

in clean backgrounds. This task

remains challenging in a crowded public scene due to a large

number of agents with different activities at the same time,

and complicated interactions such as traffic flows at a busy

junction. Moreover, a surveillance video captured from a

crowded scene normally is low-quality.

Discriminant models such as

GP

models and SVM are the

most popular approaches to classify video event

[4]

,

[5]

,

[6]

,

[7]

because of their advantage in terms of classification accuracy.

However, they are supervised model and a training data set

with manual label is necessary in advance. Besides, they are

feature-based approaches. They have a high requirement in

the applicability and the preciseness of features to ensure

their performance. The most widely used features include

HOG

feature, flow-based features, etc.

Generative models especially the topic models such as

LDA

[8]

and

HDP

[9]

, and

[10]

have achieved great progress in high-level

video event recognition in the complex surveillance scenes.

They effectively learn activities and interactions from non-la-

beled video by analyzing semantic relationships instead. How-

ever, they have serious limitations: consuming computation

and work in batch. Besides, most existing methods neglect the

temporal dependencies between activities and interactions

[9]

.

Inspired by the power of generative and discriminative

models, in this paper, we propose a method to combine the

HDP

models and the

GP

models to realize unsupervised video

behavior classification in real-time in a complex and crowded

traffic scene. The first step is unsupervised learning the

activities using

HDP

models and traffic states using

HDP

-

HMM

,

respectively. Based on their learning results, we construct fea-

ture vectors to represent activities and traffic states in a new

way. A training set is then generated with these feature vec-

tors to feed the

GP

models. In addition, the temporal depen-

dencies between two states are integrated into our

GP

models

to enhance classification accuracy.

The major contributions of this paper are following. First,

we effectively combine unsupervised generative model

HDP

with supervised discriminant model

GP

, to realize unsu-

pervised classification of video event. Second, we integrate

transition information between two states with

GP

models

to enhance the accuracy of classification. Third, we provide

detailed experiments showing that our framework enjoys fa-

vorable performance in video event classification in real-time

in a crowded traffic scene.

Related Work

Topic models have received increasing attention to analyze

activity in surveillance video

[11], [10], [11], [13],[7]

. However,

[12],[9]

are

offline and batch procedures and temporal dependencies are

neglected. Hospedales

et al

.

[11]

, used latent Dirichlet allocation

(

LDA

) models to infer activities in a video, which requires pre-

defined number of clusters. It is hard to give a proper number

of possible activities that may occur in a video from a crowded

scene. Besides, their models perform Gibbs sampling in each

newly captured video clip to estimate the joint distribution. It

Michael Ying Yang is with the Scene Understanding Group,

University of Twente, Enschede, The Netherlands

(

michael.yang@utwente.nl

).

Wentong Liao and Bodo Rosenhahn are with the Institute

for Information Processing, Leibniz University, Hannover,

Hannover, Germany.

Yanpeng Cao is with the College of Mechanical Engineering,

Zhejiang University, Hangzhou, China. (Correspondingn Author)

Photogrammetric Engineering & Remote Sensing

Vol. 84, No. 4, April 2018, pp. 203–214.

0099-1112/17/203–214

and Remote Sensing

doi: 10.14358/PERS.84.4.203

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

April 2018

203

PERS_April2018_Public - page 203

Warning.