MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation


PKU     THU     BeingBeyond
* Equal contribution
§ Corresponding author

Abstract

Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal model that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level “cerebrum” leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, and a low-level DiT-based motion decoder generates fine-grained trajectories via flow-matching, along with temporal orthogonal filtering to enhance to enhance smoothness. To unify heterogeneous datasets, we design Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across 5 in-domain and 2 cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.

Multimodal Egocentric Hand Motion Generator

[Pipeline] Built upon Eagle-2, MEgoHand aims to predict a sequence of H-step future MANO parameter sequences. The system prompt and task instruction are encoded using a frozen VLM tokenizer. At each timestep, an RGB image is processed by a pretrained depth estimator to obtain a metric depth map. The RGB and depth images are then combined and encoded into a visual embedding, which—together with the text embedding—is input to the frozen VLM. A DiT-based motion generator receives this multimodal representation along with the initial hand parameters to predict relative future hand motion.





[Decoding Stategy] To ensure temporal coherence in the generated motion sequences, we propose Temporal Orthogonal Filtering (TOF), a training-free decoding strategy to denoise predicted rotation sequences. At each timestep, we query the motion generator to produce overlapping motion chunks. To suppress high-frequency jitter, a temporal convolution with uniform weights aggregates all rotation and translation estimates corresponding to the same timestep. The resulting convolved rotation is then projected onto the closest valid SO(3) manifold via Singular Value Decomposition (SVD). We can freely adjust the frequency of the query to balance inference speed and generation quality. From the figure below, we can see that without smooth decoding, the predicted motion exhibits more fluctuations, which indicates smooth decoding stategy is effective in mitigating jitter.



Dataset Curation

["Can scaling larger HOI data benefit motion generation?"] MEgohand says "yes"! Existing HOI datasets vary in language instructions, annotation quality, and hand pose representations. We systematically integrate and preprocess large-scale public datasets into a unified and standardized training corpus. Two prominent challenges get in the way:


1. Some datasets like FPHA only provide 3D hand joint positions captured using wearable sensors instead of MANO parameters. We design Inverse MANO Retargeting Network which recovers the MANO parameters from raw hand keypoints.





2. Many existing datasets, such as ARCTIC, HOT3D, and OakInk2, provide RGB image sequences and object models without corresponding depth maps. We design Virtual RGB-D Renderer to synthesize depth images aligned with the available RGB frames.





Finally, we curate a unified multimodal dataset covering 3.35M RGB-D frames, 24K interaction trajectories, and 1.2K objects, -- 15× larger than previous work.





This initiative provides a solid foundation for building robust, universally applicable motion models and offers a comprehensive testbed for future research. Training on the curated dataset, we outperform the baseline LatentAct on both in-domain and cross-domain benchmarks.



Visualizations

Conclusion

In this paper, we explore how to advance the field of multimodal egocentric HOI motion generation. To this end, We introduce MEgoHand, a multimodal model for egocentric hand motion generation that integrates initial hand parameters, textual instructions, and RGB images to predict realistic hand-object interaction motion sequences. The hierarchical design combines a vision-language model and depth estimation for semantic understanding and 3D reasoning. A DiT-based motion generator conducts closed-loop prediction, enhanced by Temporal Orthogonal Filtering for temporal consistency. To address data scarcity, we curate a million-scale HOI dataset by leveraging inverse MANO retargeting and virtual RGB-D rendering. As an initial attempt to unify vision language models with 3D reasoning for motion generation, MEgoHand demonstrates strong generalization capabilities and achieves SOTA results.

BibTeX

@article{zhou2025megohand,
  title={MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation},
  author={Zhou, Bohan and Zhan, Yi and Zhang, Zhongbin and Lu, Zongqing},
  journal={arXiv preprint arXiv:2505.16602},
  year={2025}
}