|
|||||||||||||||||||||
|
Jason J. Corso
|
People: Jason Corso (PI), Raymond Fu, Werner Ceusters, Venkat Krovi, and Michalis Petropoulos
Funding Agency: DARPA Mind's Eye in TCTO. This project kicked off in June 2010.
Overview and Project Goals
Comprehensive visual scene understanding has long been the ultimate challenge
in computer vision research. While images and videos of the natural world are
highly structured and redundant (Kersten, 1987; Ruderman, 1994), they exhibit
complex appearance and shape, complex hierarchical scale-varying nature, and
occlusion. Early successes have focused on particular sub-problems, such as
face detection (Viola and Jones, 2002, 2004). State of the art systems are
capable of detecting instances of objects-the "nouns" of the scene-among few
hundreds of object classes (Fei-Fei et al., 2004) and contests such as the
PASCAL Challenge annually pit the world's best object detection methods on
novel datasets. Although some may argue these object detection methods have not
been thoroughly evaluated in the wild, a more elusive problem now presents
itself: the "verbs" of the scene. As Biederman stated, nearly 30 years ago,
specifying not only the elements in an image but also the manner in which they
are interacting and relating to one another is integral to full image
understanding (Biederman, 1981).
However, representing and recognizing actions (especially those of
humans), with a view to understanding their underlying motivation, has proved
to be an extremely challenging task because: (A) Motion is the projected output
of a set of coordinated actions of often high-dimensional systems, an extremely
high-dimensional neuro-musculo-skeletal one in the case of humans (and thus not
particularly well-suited for any attempt at reconstruction solely on the basis
of visual observation of coordinated actions); (B) Motion occurs and gets
described semantically/linguistically at a wide variety of spatiotemporal
scales (i.e. varying levels of abstraction serve to agglomerate or subdivide
either spatial- and temporal-characteristics and careful attention paid to
appropriate creation of "equivalence classes"); (C) Most importantly, the
unambiguous extraction of intent from motion alone can never be achieved due to
the significant dependence upon contextual knowledge (making the case for the
development of a systematic ontology in which to ground the visual reasoning).
This representation, learning, recognition of and reasoning over activities in
persistent surveillance videos is the overarching objective of ISTARE
Data / Code Releases:
Please acknowledge any use of these releases Other Info:
Publications:
|