Video Event Recognition and Anomaly Detection
by Combining Gaussian Process and Hierarchical
Dirichlet Process Models
Michael Ying Yang, Wentong Liao, Yanpeng Cao and Bodo Rosenhahn
Abstract
In this paper, we present an unsupervised learning frame-
work for analyzing activities and interactions in surveillance
videos. In our framework, three levels of video events are
connected by Hierarchical Dirichlet Process (
HDP
) model: low-
level visual features, simple atomic activities, and multi-agent
interactions. Atomic activities are represented as distribu-
tion of low-level features, while complicated interactions are
represented as distribution of atomic activities. This learning
process is unsupervised. Given a training video sequence,
low-level visual features are extracted based on optic flow and
then clustered into different atomic activities and video clips
are clustered into different interactions. The
HDP
model auto-
matically decides the number of clusters, i.e., the categories of
atomic activities and interactions. Based on the learned atom-
ic activities and interactions, a training dataset is generated
to train the Gaussian Process (
GP
) classifier. Then, the trained
GP
models work in newly captured video to classify interac-
tions and detect abnormal events in real time. Furthermore,
the temporal dependencies between video events learned by
HDP
-Hidden Markov Models (
HMM
) are effectively integrated
into
GP
classifier to enhance the accuracy of the classifica-
tion in newly captured videos. Our framework couples the
benefits of the generative model (
HDP
) with the discrimi-
nant model (
GP
). We provide detailed experiments showing
that our framework enjoys favorable performance in video
event classification in real-time in a crowded traffic scene.
Introduction
High-level video event classification is an important issue
in computer vision and have attracted great attention in
recent years
[1]
due to their significant practical values such
as security monitoring, traffic controlling, etc. Most existing
approaches focused on recognition of an individual activity
[2]
, or a collective activity
[3]
in clean backgrounds. This task
remains challenging in a crowded public scene due to a large
number of agents with different activities at the same time,
and complicated interactions such as traffic flows at a busy
junction. Moreover, a surveillance video captured from a
crowded scene normally is low-quality.
Discriminant models such as
GP
models and SVM are the
most popular approaches to classify video event
[4]
,
[5]
,
[6]
,
[7]
because of their advantage in terms of classification accuracy.
However, they are supervised model and a training data set
with manual label is necessary in advance. Besides, they are
feature-based approaches. They have a high requirement in
the applicability and the preciseness of features to ensure
their performance. The most widely used features include
HOG
feature, flow-based features, etc.
Generative models especially the topic models such as
LDA
[8]
and
HDP
[9]
, and
[10]
have achieved great progress in high-level
video event recognition in the complex surveillance scenes.
They effectively learn activities and interactions from non-la-
beled video by analyzing semantic relationships instead. How-
ever, they have serious limitations: consuming computation
and work in batch. Besides, most existing methods neglect the
temporal dependencies between activities and interactions
[9]
.
Inspired by the power of generative and discriminative
models, in this paper, we propose a method to combine the
HDP
models and the
GP
models to realize unsupervised video
behavior classification in real-time in a complex and crowded
traffic scene. The first step is unsupervised learning the
activities using
HDP
models and traffic states using
HDP
-
HMM
,
respectively. Based on their learning results, we construct fea-
ture vectors to represent activities and traffic states in a new
way. A training set is then generated with these feature vec-
tors to feed the
GP
models. In addition, the temporal depen-
dencies between two states are integrated into our
GP
models
to enhance classification accuracy.
The major contributions of this paper are following. First,
we effectively combine unsupervised generative model
HDP
with supervised discriminant model
GP
, to realize unsu-
pervised classification of video event. Second, we integrate
transition information between two states with
GP
models
to enhance the accuracy of classification. Third, we provide
detailed experiments showing that our framework enjoys fa-
vorable performance in video event classification in real-time
in a crowded traffic scene.
Related Work
Topic models have received increasing attention to analyze
activity in surveillance video
[11], [10], [11], [13],[7]
. However,
[12],[9]
are
offline and batch procedures and temporal dependencies are
neglected. Hospedales
et al
.
[11]
, used latent Dirichlet allocation
(
LDA
) models to infer activities in a video, which requires pre-
defined number of clusters. It is hard to give a proper number
of possible activities that may occur in a video from a crowded
scene. Besides, their models perform Gibbs sampling in each
newly captured video clip to estimate the joint distribution. It
Michael Ying Yang is with the Scene Understanding Group,
University of Twente, Enschede, The Netherlands
(
).
Wentong Liao and Bodo Rosenhahn are with the Institute
for Information Processing, Leibniz University, Hannover,
Hannover, Germany.
Yanpeng Cao is with the College of Mechanical Engineering,
Zhejiang University, Hangzhou, China. (Correspondingn Author)
Photogrammetric Engineering & Remote Sensing
Vol. 84, No. 4, April 2018, pp. 203–214.
0099-1112/17/203–214
© 2018 American Society for Photogrammetry
and Remote Sensing
doi: 10.14358/PERS.84.4.203
PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING
April 2018
203