Challenge Description

In this "Aerial View Activity Classification Challenge", the participants will classify human actions in low-resolution videos. The challenge aims to motivate researchers to explore techniques that may address the issues with recognizing human actions in low-resolution videos. This is usually the scenario when videos are filmed from a distant view (e.g. aerial images), and is the common setting for many surveillance and military applications. We provide videos of a single person performing various actions taken from the top of the University of Texas at Austin's main tower. The average height of human figures in this dataset is about 20 pixels. In addition to the resolution issue, there may be challenges from shadows and blurry visual cues. The contest participants are expected to classify videos from a total of 9 categories of human actions: {1: pointing, 2: standing, 3: digging, 4: walking, 5: carrying, 6: running, 7: wave1, 8: wave2, 9: jumping}.

  •     Types of Actions in the Aerial View Challenge
    Pointing Standing Digging Walking Carrying Running Wave1 Wave2 Jumping
    Image Image Image Image Image Image Image Image Image


The UT-Tower dataset consists of 108 low-resolution video sequences from 9 types of actions. Each action is performed 12 times by 6 individuals. The dataset is composed of two types of scenes: concrete square and lawn. The video specifications of each scene are summarized as follows:

Scene Actions Camera Resolution Frame Rate
concrete square pointing, standing, digging, walking stationary with jitter 360 x 2401 10 fps
lawn carrying, running, wave1, wave2, jumping stationary with jitter 360 x 240 10 fps

Ground truth labels for all actions videos are provided for the training and the testing. In addition, in order to alleviate segmentation and tracking issues and make participants focus on the classification problem, we provide ground truth bounding boxes as well as foreground masks for each video. Contestants are free to take advantages of them. Only the acting person is included in the bounding box.

The complete dataset is available here (670MB): video sequence | bounding box

Performance Evaluation Methodology

The evaluation setting is quite similar to the traditional setting often used with the Weizmann dataset. The performance of the systems created by contest participants will be evaluated using leave-one-out cross-validation, where one video sequence is used for testing at a time. The contestants are required to classify each test video into one of the 9 action categories, and submit the confusion matrix.


Please cite the following reference if you make use of the UT-Tower dataset in any form:

      author = "Chia-Chih Chen and M. S. Ryoo and J. K. Aggarwal",
      title = "{UT}-{T}ower {D}ataset: {A}erial {V}iew {A}ctivity {C}lassification {C}hallenge",
      year = "2010",
      howpublished = "\_View\_Activity.html"

It is optional to cite2 :

      author = "Chia-Chih Chen and J. K. Aggarwal",
      title = "Recognizing Human Action from a Far Field of View",
      journal = "IEEE Workshop on Motion and Video Computing (WMVC)",
      year = "2009"

1 Due to issues with right of publicity, we have masked the videos to contain only the actor.
2 This paper presents results on the lawn scene actions.