HA-VID is a human assembly video dataset that records people assembling our designed Generic Assembly Box (GAB). It is benchmarked on four foundation video understanding tasks and analyzed for its ability to comprehend application-oriented knowledge.
HA-VID stands out with three key aspects:
To ensure representation of real-world industrial assembly scenarios, the GAB was designed. It is a 250x250x250mm box with 35 standard and non-standard parts commonly used in industrial assembly. Four standard tools are required for assembly. The GAB includes three plates with different task precedence and collaboration requirements, providing contextual links between actions and enabling situational action understanding.
The CAD files, Bill of Materials and instructions to replicate GAB be downloaded below.
The Subject-agnostic task precedence graphs (SA-TPG) of the GAB plates can also be downloaded below as an .owl file.
Data was collected on three Azure Kinect RGB+D cameras mounted to an assembly workbench
facing the participant from left, front and top views.
To capture the progression of human procedural knowledge acquisition and behaviors during learning, a three-stage progressive assembly setup is designed. The stages include:
The instructions provided during the discovery and instruction stages can be downloaded below.
To enable human-robot assembly knowledge transfer, the structured temporal annotations are made following HR-SAT. HR-SAT ensures annotation transferability, adaptability, and consistency. The HR-SAT structure is briefly shown below.
For more information, you can visit the HR-SAT website, where you can also find the HR-SAT supplementary containing definitions of the action verbs.
The video below shows how videos were annotated with temporal annotations.
For spatial annotations, we use CVAT, a video annotation tool, to label bounding boxes for subjects, objects and tools frame-by-frame.
This project was funded by The University of Auckland FRDF New Staff Research Fund (No. 3720540)
HA-ViD is licensed under Creative Commons Attribution-NonCommercial 4.0 International License.
@misc{zheng2023havid,
title={HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding},
author={Hao Zheng and Regina Lee and Yuqian Lu},
year={2023},
eprint={2307.05721},
archivePrefix={arXiv},
primaryClass={cs.CV}
}