Effectively recognizing human gestures from variant viewpoints plays a fundamental role in the successful collaboration between humans and robots. Deep learning approaches have achieved promising performance in gesture recognition. However, they are usually data-hungry and require large-scale labeled data, which are not usually accessible in a practical setting. Synthetic data, on the other hand, can be easily obtained from simulators with fine-grained annotations and variant modalities. Existing state-of-the-art approaches have shown promising results using synthetic data, but there is still a large performance gap between the models trained on synthetic data and real data. To learn domain-invariant feature representations, we propose a novel approach which jointly takes RGB videos and 3D meshes as inputs to perform robust action recognition. We empirically validate our model on the RoCoG-v2 dataset, which consists of a variety of real and synthetic videos of gestures from the ground and air perspectives. We show that our model trained on synthetic data can outperform state-of-the-art models under the same training setting and models trained on real data.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.