0

Siyuan Huang

Papers
38

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
38papers

Authored papers

38

DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

arXiv 2026

2026

GEMS: Agent-Native Multimodal Generation with Memory and Skills

arXiv 2026

2026

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

arXiv 2026

2026

LatentMem: Customizing Latent Memory for Multi-Agent Systems

arXiv 2026

2026

Rethinking VLM Representation for VLA Initialization

arXiv 2026

2026

EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

arXiv 2025

2025

ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting

arXiv 2025

2025

Pretraining Language Models to Ponder in Continuous Space

arXiv 2025

2025

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

arXiv 2025

2025

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

arXiv 2025

2025

VideoSSR: Video Self-Supervised Reinforcement Learning

arXiv 2025

2025

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

arXiv 2025

2025

FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

arXiv 2025

2025

Spotlight on Token Perception for Multimodal Reinforcement Learning

arXiv 2025

2025

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

CVPR 2025 1

2025

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

arXiv 2024

2024

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

arXiv 2024

2024

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

arXiv 2024

2024

A3VLM: Actionable Articulation-Aware Vision Language Model

arXiv 2024

2024

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

arXiv 2024

2024

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

arXiv 2024

2024

Multi-modal Situated Reasoning in 3D Scenes

arXiv 2024

2024

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

arXiv 2024

2024

3D Vision and Language Pretraining with Large-Scale Synthetic Data

arXiv 2024

2024

Graph Parsing Networks

arXiv 2024

2024

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

arXiv 2024

2024

An Embodied Generalist Agent in 3D World

arXiv 2023

2023

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

arXiv 2023

2023

ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

ICCV 2023 1

2023

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

arXiv 2023

2023

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

CVPR 2023 1

2023

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

CVPR 2023 1

2023

SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification

arXiv 2023

2023

GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts

arXiv 2022

2022

SQA3D: Situated Question Answering in 3D Scenes

arXiv 2022

2022

Full-Body Articulated Human-Object Interaction

ICCV 2023 1

2022

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

arXiv 2022

2022

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning

ACL 2021 5

2021

Affiliations

No known affiliations.

Frequent co-authors

10

from 38 papers