Prof. Corso moved to the Electrical Engineering and Computer Science department at the University of Michigan in the 8/2014. He continues his work and research group in high-level computer vision at the intersection of perception, semantics/language, and robotics. Unless you are looking for something specific, historically, here, you probably would rather go to his new page.
CSE 642

Main Course Material
Course Outline
Course Project
Full Teaching List

CSE 642 Techniques in AI: Vision for HCI
University at Buffalo SUNY
Fall 2009

Instructor: Jason Corso (jcorso)
Course Webpage:
Meeting Times:MW 3-4:20
Location: M Bell 340 and W Bell 224
We will meet directly in a computer lab for one of the two class meetings each week. The instructor will provide both a working software framework and a physical camera (which the student can take with them for the entire semester). The instructor will also provide a list of suitable projects at the beginning of the semester, but students are encouraged to design their own projects (subject to instructor approval). At the end of the semester, we will host an open-house demo of the projects.
Office Hours: Monday 1-3 or by appointment.


  • Code for the adaptive background subtraction lecture is available on the Eng Linux network at ~jcorso/642/23sept-bg. You will need to change the makefile paths. It will work as is, but as we discussed in lecture, there are some things that could be improved.
  • Class begins. First class will either be in Bell 224 or 340, check 224 first.

Main Course Material

Course Overview: The promise of computer vision for enabling natural human-machine interfaces is great: vision-based interfaces would allow unencumbered, large-scale spatial motion; they could make use of hand gestures, movements, or other similar natural input means; and video itself is passive, cheap, and nearly ubiquitous. In the simplest case, tracked hand motion and gesture recognition could replace the mouse in traditional applications, but, computer vision offers the additional possibility of using both hands simultaneously, using the face, incorporating multiple users concurrently, etc. In this course, we will develop these ideas from both a theoretical and a practical perspective. From the theoretical side, we will cover ideas ranging from interaction paradigms suitable for vision-based interfaces to mathematical models for tracking (e.g., particle filtering), modeling high-dimensional articulated objects, and modeling a grammar of interaction, as well as algorithms for rapid and real-time inference suitable for interaction scenarios. From the practical side, we will each build (in pairs) an actual working vision-based interactive system. Each project must ``close the loop'' and be integrated directly into an interactive computer system (e.g., sort photos on the screen by grabbing them with each hand and moving them around). During the semester, very practical-minded topics such as interactive system design and architecture, debugging programs that process high-dimensional video data, and program optimization will be discussed alongside the underlying computer vision theory.

Course Project: Each student will be required to implement a course project that is either a direct implementation of a method discussed during the semester or new research in Bayesian vision. A paper describing the project is required at the end of the semester (8-10 pages two column IEEE format) and we will have an open-house poster session to present the projects. Working project demos are suggested but not required for the poster session.

Prerequisites: It is assumed that the students have taken introductory courses in machine learning/pattern recognition (CSE 555 or 574), and computer vision (CSE 573). Permission of the instructor is required if these prerequisites have not been met. Working knowledge of C/C++ is a must.

Course Goals: Students taking this course will learn the major ideas in the theory of computer vision for building natural human-machine interfaces, and they will gain practical experience in building large, real-time computer vision systems. These are challenging and complementary goals, unmatched in most vision curricula.

Textbooks: There is no textbook for the course. The instructor will prepare lectures and hand-out relevant academic articles. The OpenCV book (Learning OpenCV by Bradski and Kaehler) is suggested as OpenCV will serve as the core software library on which the projects will be built.

Grading: Grading will be discussed in the class. Basically, the grade will be a function of the students having achieved the course goals and it will ultimately be measured by the success of his or her project. Quantitatively speaking, the project will cover about 75% of the grade and the remaining 25% will be filled will class participation and the occasional minor homework/quiz.

Programming Language: C++ is the programming language.

Course Outline

The outline is presented in an idea-centric scheme rather than an application-centric one. This is an initial run; as the course evolves, this will be refined. During each lecture, we will ground the theoretical ideas with practical examples and real-world applications.

  1. Computer Vision Background Material

    1. Light-Models, Image Acquisition, and Backgrounds [Trucco & Verri, 1998,Forsyth & Ponce, 2003]

    2. Camera Geometry, Calibration, and Stereo [Trucco & Verri, 1998,Hartley & Zisserman, 2004,Zhang, 2000]

    3. Interest Operators and Feature Extraction (There are many here, we survey some) [Mikolajczyk & Schmid, 2004,Harris & Stephens, 1988,Fraundorfer & Bischof, 2003,Lowe, 2004,Corso & Hager, 2005,Kadir & Brady, 2001,Matas et al., 2002]

  2. Tracking

    1. Feature Tracking [Shi & Tomasi, 1994]

    2. Direct Method (Image-Based Tracking) [Lucas & Kanade, 1981,Baker & Matthews, 2004]

    3. Model-based Tracking

    4. Exemplar-based Tracking

    5. Kalman Filtering [Kalman, 1960,Welch & Bishop, 95]

    6. Particle Filtering [Arulampalam et al., 2002,Isard & Blake, 1998]

    7. Mean shift / Kernel-based tracking [Collins, 2003,Comaniciu et al., 2003,Comaniciu & Meer, 2002,Cheng, 1995,Hager et al., 2004]

  3. Recognition

    1. Space-time Methods

      1. Discriminant Analysis

      2. State-based methods [Bobick & Wilson, 1997]

      3. Templates/Exemplar Methods

      4. Bayesian Network Classifiers

    2. Temporal Methods

      1. Model-Based Methods

      2. Hidden Markov Models [Rabiner, 1989]

      3. Dynamic Bayesian Networks [Murphy, 2002,Ghahramani, 1998]

      4. Stochastic Parsing [Ivanov & Bobick, 2000]

  4. Localized Interaction

    1. VICs framework [Ye et al., 2004,Corso et al., 2008]

    2. Dynamically Reconfigurable UIs [Kjeldsen et al., 2003]

    3. Tracking the scene

Course Project

The goal of the project is threefold: first, we want to put the theory of Vision for HCI to work; second, we want to build substantial student experience in programming and debugging a large system with images, video, and interactive/near-real-time constraints (on a Unix system); third, we want to give the students a chance to explore the fun and exciting area of building next generation and useful user interfaces.

The projects will be tackled in pairs (if there is an odd-student in the course then we can have a student work alone or have one group of three). At the beginning of the semester, the instructor will hand out commodity webcams to each student (on loan from his lab). If the students need additional hardware to complete their project, they should discuss it with the instructor as early as possible. (Erector sets will be available in the instructor's lab for some simple construction tasks.)

Project Software, Platform, and Requirements

All projects will be performed primarily in the Bell 340 lab working environment, which is a Linux environment. Students will be able to gain important working knowledge of Linux systems from this experience. However, the project will be built on the following open-source cross-platform libraries and the students should hence pursue cross-platform project code:

  1. OpenCV ( is a widely used open-source computer vision toolkit. It contains the necessary low-level classes like images, matrices, and capture devices; algorithms like linear system solving, and frequency transformations; as well as many advanced computer vision algorithms like face detection with boosting. \textbf{Constraint:} the temptation will be to design a project and build it by stitching together bits and pieces of OpenCV; this is not permitted. OpenCV is there to provide you with a substantial working base to build real software. Each project must include substantial new algorithm development on top of the existing methods in OpenCV; this will be a challenge.
  2. wxWidgets ( is a cross-platform GUI library. If a project demands advanced interface tools that HighGUI does not provide (such as buttons!), then the students must use wxWidgets. \textbf{Exception:} if the project involves extending an existing open-source piece of software to add new computer vision-based functionality, such as Firefox, then wxWidgets need not be used. Such exceptions must first be discussed with the instructor at the beginning of the semester.

Both of these libraries and additional ones have been installed by the CSE IT staff and are available on the CSE and Bell 340 networks. Although the students may use an IDE of their choosing if they so wish, the projects must provide a standard Makefile solution for compiling on the instructor's machine (paths may be manually configured). The projects must use a revision control system of the students choosing (either CVS or SVN).

List of Possible Projects

No two teams may tackle the same project. First come are first served. See project schedule below. Teams and project selection are due in class by Sept. 21 at which time project work will begin in the lab.

For reasons that may or may not be obvious, the actual project list will be distributed via hard-copy on the first day of class and not posted to the website.

Project Schedule

The project schedule is largely student organized within the following date framework:

  • Sept. 21 --- Teams are formed and projects are selected.
  • Sept. 28 --- Requirements document must be submitted to the instructor in class. We will discuss what this means in class. Basically, this is a less than three page document that describes the project goals, anticipated methods to be used, clearly states what the new research developments will be (e.g., implementation beyond those in OpenCV, or altogether new ideas), and defines a schedule that includes two milestone deliverables and the ultimate end-game deliverables.
  • Oct. 26 --- Milestone 1 Due. Content is specified by the students via their requirements document.
  • Nov. 23 --- Milestone 2 Due. Content is specified by the students via their requirements document.
  • Dec. 14+ --- Demo day. Final deliverables and project report are due. (Specific date of demo-day is yet to be determined, but it will be on or after the 14th.)

Project Write-Up

The project write-up is a standard two-column IEEE conference format at maximum of 8 pages. The students can use the CVPR style files. It should be approached as a standard paper containing introduction and related work, methodology, results, and discussion.

Project Evaluation and End-Goals

The projects can have multiple non-exclusive end-points, and it is the instructor's goal that all projects achieve all of these end-points. At the very least, a project must achieve at least one of these end-points to be considered a success. The three end-points are

  1. Project yields a working interactive demo at demo-day that can be used by all or most of the demo visitors. Demo visitors will be asked to to informally evaluate the projects and rank them.
  2. Project yields a nice demo video that is uploaded and well-received to YouTube.
  3. Project yields a research manuscript that is submitted to a good conference in vision or interfaces such as CVPR, or CHI. Since paper reviewing will not happen until after the end of the semester, the instructor has the ultimate say in judging the quality of the manuscript.

In addition to these possible end-states, the final project deliverable will include the project report, all of the project code, and Doxygen generated documentation. These can be emailed to the instructor (if the size is not prohibitive) or submitted via CD/key.

Here is how the project scores will be computed:

  • How to get an A: Achieve two or three of the end-points mentioned above. Coherent code and report.
  • How to get a B: Achieve one of the three end-points mentioned above. Coherent code and report.
  • How to get a C: Clear effort has been made but non of the three end-points mentioned above have been achieved; project does some thing but is incomplete. Coherent code and report.
  • How to get a D: Unclear how much effort was put in; project does not do anything.
  • How to get an F: Students have not done anything. There is no project to speak of.
  • In most cases, both partners of a project team will receive the same grade, but if there are cases of clear irresponsibility of a project team member, these will be addressed on an individual basis with the instructor and the project team.


    Arulampalam et al., 2002
    Arulampalam, S., Maskell, S., Gordon, N., & Clapp, T. 2002.
    A Tutorial on Particle Filters for On-line Non-linear/Non-Gaussian Bayesian Tracking.
    IEEE Transactions on Signal Processing, 50, 174-188.

    Baker & Matthews, 2004
    Baker, S., & Matthews, I. 2004.
    Lucas-Kanade 20 Years On: A Unifying Framework.
    International Journal of Computer Vision, 56(3), 221-255.

    Black & Yacoob, 1997
    Black, M. J., & Yacoob, Y. 1997.
    Tracking and Recognizing Rigid and Non-Rigid Facial Motions using Local Parametric Models of Image Motion.
    International Journal of Computer Vision, 25(1), 23-48.

    Bobick & Davis, 2001
    Bobick, A., & Davis, J. 2001.
    The Recognition of Human Movement Using Temporal Templates.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257-267.

    Bobick & Wilson, 1997
    Bobick, A. F., & Wilson, A. 1997.
    A State-based Approach to the Representation and Recognition of Gesture.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(12), 1325-1337.

    Cheng, 1995
    Cheng, Y. 1995.
    Mean Shift, Mode Seeking and Clustering.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 790-799.

    Cohen et al., 2003
    Cohen, I., Sebe, N., Cozman, F. G., Cirelo, M. C., & Huang, T. S. 2003.
    Learning Bayesian Network Classifiers for Facial Expression Recognition using both Labeled and Unlabeled Data.
    In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

    Collins, 2003
    Collins, R. 2003.
    Mean-Shift Blob Tracking Through Scale Space.
    In: IEEE Conference on Computer Vision and Pattern Recognition.

    Collins & Liu, 2003
    Collins, R., & Liu, Y. 2003.
    On-Line Selection of Discriminative Tracking Features.
    Pages 346-352 of: International Conference on Computer Vision, vol. 1.

    Comaniciu & Meer, 2002
    Comaniciu, D., & Meer, P. 2002.
    Mean Shift: A Robust Approach Toward Feature Space Analysis.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603-619.

    Comaniciu et al., 2003
    Comaniciu, D., Ramesh, V., & Meer, P. 2003.
    Kernel-Based Object Tracking.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 564-577.

    Corso & Hager, 2005
    Corso, J. J., & Hager, G. D. 2005.
    Coherent Regions for Concise and Stable Image Description.
    In: IEEE Conference on Computer Vision and Pattern Recognition.

    Corso et al., 2003a
    Corso, J. J., Burschka, D., & Hager, G. D. 2003a.
    Direct Plane Tracking in Stereo Image for Mobile Navigation.
    In: Proceedings of International Conference on Robotics and Automation.

    Corso et al., 2003b
    Corso, J. J., Ramey, N., & Hager, G. D. 2003b.
    Stereo-Based Direct Surface Tracking with Deformable Parametric Models.
    Tech. rept. The Johns Hopkins University.

    Corso et al., 2005
    Corso, J. J., Ye, G., & Hager, G. D. 2005.
    Analysis of Composite Gestures with a Coherent Probabilistic Graphical Model.
    Virtual Reality, 8(4), 242-252.

    Corso et al., 2008
    Corso, J. J., Ye, G., Burschka, D., & Hager, G. D. 2008.
    A Practical Paradigm and Platform for Video-Based Human-Computer Interaction.
    IEEE Computer, 42(5), 48-55.

    Cui & Weng, 2000
    Cui, Y., & Weng, J. 2000.
    Appearance-Based Hand Sign Recognition from Intensity Image Sequences.
    Computer Vision and Image Understanding, 78, 157-176.

    Darrell & Pentland, 1993
    Darrell, T., & Pentland, A. 1993.
    Space-Time Gestures.
    Pages 335-340 of: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

    Darrell et al., 1996
    Darrell, T. J., Essa, I. A., & Pentland, A. P. 1996.
    Task-specific Gesture Analysis in Real-Time using Interpolated Views.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1236-1242).

    Forsyth & Ponce, 2003
    Forsyth, D. A., & Ponce, J. 2003.
    Computer Vision.
    Prentice Hall.

    Fraundorfer & Bischof, 2003
    Fraundorfer, F., & Bischof, H. 2003.
    Detecting Distinguished Regions By Saliency.
    Pages 208-215 of: Scandinavian Conference on Image Analysis.

    Galata et al., 2001
    Galata, A., Johnson, N., & Hogg, D. 2001.
    Learning Variable-Length Markov Models of Behavior.
    Computer Vision and Image Understanding, 83(1), 398-413.

    Gavrila & Davis, 1995
    Gavrila, D. M., & Davis, L. S. 1995.
    Towards 3-D Model-Based Tracking and Recognition of Human Movement: A Multi-View Approach.
    International Conference on Automatic Face and Gesture Recognition.

    Ghahramani, 1998
    Ghahramani, Z. 1998.
    Learning Dynamic Bayesian Networks.
    Pages 168-197 of: Giles, C. L., & Gori, M. (eds), Adaptive Processing of Sequences and Data Structures, vol. Lecture Notes in Artificial Intelligence.

    Gorodnichy et al., 2002
    Gorodnichy, D. O., Malik, S., & Roth, G. 2002.
    Nouse Use Your Nose as a Mouse - A New Technology for Hands-free Games and Interfaces.
    Pages 354-361 of: International Conference on Vision Interface.

    Hager & Belhumeur, 1998
    Hager, G. D., & Belhumeur, P. N. 1998.
    Efficient Region Tracking with Parametric Models of Geometry and Illumination.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(10), 1025-1039.

    Hager et al., 2004
    Hager, G. D., Dewan, M., & Stewart, C. V. 2004.
    Multiple Kernel Tracking with SSD.
    IEEE Conference on Computer Vision and Pattern Recognition.

    Hardenberg & Berard, 2001
    Hardenberg, C. von, & Berard, F. 2001.
    Bare-Hand Human-Computer Interaction.
    Pages 113-120 of: Perceptual User Interfaces.

    Harris & Stephens, 1988
    Harris, C., & Stephens, M. J. 1988.
    A Combined Corner and Edge Detector.
    Pages 147-151 of: ALVEY Vision Conference.

    Hartley & Zisserman, 2004
    Hartley, R. I., & Zisserman, A. 2004.
    Multiple View Geometry in Computer Vision. Second edn.
    Cambridge University Press, ISBN: 0521540518.

    Isard & Blake, 1998
    Isard, M., & Blake, A. 1998.
    CONDENSATION - conditional density propagation for visual tracking.
    International Journal of Computer Vision, 29(1), 5-28.

    Ivanov & Bobick, 2000
    Ivanov, Y. A., & Bobick, A. F. 2000.
    Recognition of Visual Activities and Interactions by Stochastic Parsing.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 852-872.

    Kadir & Brady, 2001
    Kadir, T., & Brady, M. 2001.
    Saliency, Scale and Image Description.
    International Journal of Computer Vision, 43(2), 83-105.

    Kalman, 1960
    Kalman, R. E. 1960.
    A New Approach to Linear Filtering and Prediction Problems.
    Transactions of the ASMA; Journal of Basic Engineering, 82, 35-45.

    Kjeldsen et al., 2003
    Kjeldsen, R., Levas, A., & Pinhanez, C. 2003.
    Dynamically Reconfigurable Vision-Based User Interfaces.
    Pages 323-332 of: International Conference on Computer Vision Systems.

    Lowe, 2004
    Lowe, D. 2004.
    Distinctive Image Features from Scale-Invariant Keypoints.
    International Journal of Computer Vision, 60(2), 91-110.

    Lucas & Kanade, 1981
    Lucas, B.D., & Kanade, Takeo. 1981.
    An iterative Image Registration Technique with an Application to Stereo Vision.
    Pages 674-679 of: International Joint Conference on Artificial Intelligence.

    Matas et al., 2002
    Matas, J., Chum, O., Urban, M., & Pajdla, T. 2002.
    Robust Wide Baseline Stereo from Maximally Stable Extremal Regions.
    In: British Machine Vision Conference.

    Mikolajczyk & Schmid, 2004
    Mikolajczyk, K., & Schmid, C. 2004.
    Scale and Affine Invariant Interest Point Detectors.
    International Journal of Computer Vision, 60(1), 63-86.

    Murphy, 2002
    Murphy, K. 2002.
    Dynamic Bayesian Networks: Representation, Inference and Learning.
    Ph.D. thesis, University of California at Berkeley, Computer Science Division.

    Pentland & Liu, 1999
    Pentland, A., & Liu, A. 1999.
    Modeling and Prediciton of Human Behavior.
    Neural Computation, 11(1), 229-242.

    Rabiner, 1989
    Rabiner, L. 1989.
    A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.
    Proceedings of the IEEE, 77(2), 257-286.

    Rehg & Kanade, 1994
    Rehg, J. M., & Kanade, Takeo. 1994.
    Visual Tracking of High DOF Articulated Structures: An Application to Human Hand Tracking.
    Pages 35-46 of: European Conference on Computer Vision, vol. 2.

    Shi & Tomasi, 1994
    Shi, Jianbo, & Tomasi, Carlo. 1994.
    Good Features to Track.
    IEEE Conference on Computer Vision and Pattern Recognition, 0(0).

    Simon et al., 2000
    Simon, G., Fitzgibbon, A. W., & Zisserman, A. 2000.
    Markerless Tracking Using Planar Structures in the Scene.
    Pages 137-146 of: International Symposium on Augmented Reality.

    Stauffer & Grimson, 1999
    Stauffer, C., & Grimson, W.E.L. 1999.
    Adaptive Background Mixture Modeling for Real-Time Tracking.
    Pages 246-252 of: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2.

    Tong et al., 2007
    Tong, Y., Liao, W., Xue, Z., & Ji, Q. 2007.
    A Unified Probabilistic Framework for Facial Activity Modeling and Understanding.
    In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

    Toyama & Blake, 2002
    Toyama, K., & Blake, A. 2002.
    Probabilistic Tracking with Exemplars in a Metric Space.
    International Journal of Computer Vision, 48(1), 9-19.

    Trucco & Verri, 1998
    Trucco, E., & Verri, A. 1998.
    Introductory Techniques for 3-D Computer Vision.
    Prentice Hall.

    Utsumi & Ohya, 1999
    Utsumi, A., & Ohya, J. 1999.
    Multiple-Hand-Gesture Tracking Using Multiple Cameras.
    In: IEEE Conference on Computer Vision and Pattern Recognition.

    Welch & Bishop, 95
    Welch, G., & Bishop, G. 95.
    An Introduction to the Kalman Filter.
    Tech. rept. University of North Carolina at Chapel Hill.

    Wilson & Bobick, 1999
    Wilson, A., & Bobick, A. 1999.
    Parametric Hidden Markov Models for Gesture Recognition.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9), 884-900.

    Wren et al., 1997
    Wren, C. R., Azarbayejani, A., Darrell, T., & Pentland, A. 1997.
    Pfinder: Real-Time Tracking of the Human Body.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 780-785.

    Ye et al., 2004
    Ye, G., Corso, J. J., Burschka, D., & Hager, G. D. 2004.
    VICs: A Modular HCI Framework Using Spatio-Temporal Dynamics.
    Machine Vision and Applications, 16(1), 13-20.

    Zhang, 2000
    Zhang, Z. 2000.
    A Flexible New Technique for Camera Calibration.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11), 1330-1334.

    Zhang et al., 2001
    Zhang, Z., Wu, Y., Shan, Y., & Shafer, S. 2001.
    Visual Panel: Virtual mouse, keyboard, and 3D Controller With an Ordinary Piece of Paper.
    In: Perceptual User Interfaces.

last updated: Sat Jun 21 07:38:45 2014; copyright jcorso