Siyuan Huang
- Papers
- 38
Cite
Notes
Only stored in your browser.
Authored papers
38DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
arXiv 2026
GEMS: Agent-Native Multimodal Generation with Memory and Skills
arXiv 2026
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
arXiv 2026
LatentMem: Customizing Latent Memory for Multi-Agent Systems
arXiv 2026
Rethinking VLM Representation for VLA Initialization
arXiv 2026
EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models
arXiv 2025
ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting
arXiv 2025
Pretraining Language Models to Ponder in Continuous Space
arXiv 2025
DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models
arXiv 2025
Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation
arXiv 2025
VideoSSR: Video Self-Supervised Reinforcement Learning
arXiv 2025
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
arXiv 2025
FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting
arXiv 2025
Spotlight on Token Perception for Multimodal Reinforcement Learning
arXiv 2025
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
CVPR 2025 1
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
arXiv 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
arXiv 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
arXiv 2024
A3VLM: Actionable Articulation-Aware Vision Language Model
arXiv 2024
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
arXiv 2024
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
arXiv 2024
Multi-modal Situated Reasoning in 3D Scenes
arXiv 2024
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations
arXiv 2024
3D Vision and Language Pretraining with Large-Scale Synthetic Data
arXiv 2024
Graph Parsing Networks
arXiv 2024
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
arXiv 2024
An Embodied Generalist Agent in 3D World
arXiv 2023
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
arXiv 2023
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
ICCV 2023 1
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2023
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2023 1
Diffusion-based Generation, Optimization, and Planning in 3D Scenes
CVPR 2023 1
SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification
arXiv 2023
GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts
arXiv 2022
SQA3D: Situated Question Answering in 3D Scenes
arXiv 2022
Full-Body Articulated Human-Object Interaction
ICCV 2023 1
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation
arXiv 2022
Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning
ACL 2021 5
Affiliations
Frequent co-authors
10from 38 papers