Google on Thursday announced AVA, short form for "atomic visual actions" as new labeled data set of human actions taking place in videos, which will apply to solve problems in computer vision.

While machine learning to understand human actions in videos is a fundamental research problem in Computer Vision, it's essential to applications such as personal video search and discovery, sports analysis, and gesture interfaces.

Recognizing human actions remains a big challenge, due to the fact that actions are, by nature, less well-defined than objects in videos, making it difficult to construct a finely labeled action video dataset.

Google's AVA consists of URLs for publicly available videos from YouTube, annotated with a set of 80 atomic actions (e.g. “walk”, “kick (an object)”, “shake hands”) that are spatial-temporally localized, resulting in 57.6k video segments, 96k labeled humans performing actions, and a total of 210k action labels.



The video segments, pulled from publicly available YouTube videos, is labeled manually using a potential list of 80 action types like walking, kicking or hugging.

Google analyzed a 15 minute clip from each video, and uniformly partitioned it into 300 non-overlapping 3-second segments, which strategy preserved sequences of actions in a coherent temporal context.

The uniqueness of AVA reveals some interesting statistics that are not available in other existing datasets.

Google promises to continue to expand and improve AVA, and eager to hear feedback from the community to help guide the future directions.

A look into Google’s atomic visual actions (AVA) data set

Google on Thursday announced AVA, short form for "atomic visual actions" as new labeled data set of human actions taking place in videos, which will apply to solve problems in computer vision.

While machine learning to understand human actions in videos is a fundamental research problem in Computer Vision, it's essential to applications such as personal video search and discovery, sports analysis, and gesture interfaces.

Recognizing human actions remains a big challenge, due to the fact that actions are, by nature, less well-defined than objects in videos, making it difficult to construct a finely labeled action video dataset.

Google's AVA consists of URLs for publicly available videos from YouTube, annotated with a set of 80 atomic actions (e.g. “walk”, “kick (an object)”, “shake hands”) that are spatial-temporally localized, resulting in 57.6k video segments, 96k labeled humans performing actions, and a total of 210k action labels.



The video segments, pulled from publicly available YouTube videos, is labeled manually using a potential list of 80 action types like walking, kicking or hugging.

Google analyzed a 15 minute clip from each video, and uniformly partitioned it into 300 non-overlapping 3-second segments, which strategy preserved sequences of actions in a coherent temporal context.

The uniqueness of AVA reveals some interesting statistics that are not available in other existing datasets.

Google promises to continue to expand and improve AVA, and eager to hear feedback from the community to help guide the future directions.