Hao Li
- Papers
- 84
Cite
Notes
Only stored in your browser.
Authored papers
84Lance: Unified Multimodal Modeling by Multi-Task Synergy
arXiv 2026
The Python Simulations of Chemistry Framework: 10 years of an open-source quantum chemistry project
arXiv 2026
SpatialBench: Is Your Spatial Foundation Model an All-Round Player?
arXiv 2026
GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
arXiv 2026
MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources
arXiv 2026
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
arXiv 2026
GigaWorld-Policy: An Efficient Action-Centered World--Action Model
arXiv 2026
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
arXiv 2026
Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence
arXiv 2026
Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale
arXiv 2026
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
arXiv 2026
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
arXiv 2026
AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
arXiv 2026
From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
arXiv 2026
Chain of World: World Model Thinking in Latent Motion
arXiv 2026
AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management
arXiv 2026
Rethinking VLM Representation for VLA Initialization
arXiv 2026
Agent READMEs: An Empirical Study of Context Files for Agentic Coding
arXiv 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
Bridging Evolutionary Multiobjective Optimization and GPU Acceleration via Tensorization
arXiv 2025
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
arXiv 2025
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
arXiv 2025
Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning
arXiv 2025
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
arXiv 2025
The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants
arXiv 2025
DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
CVPR 2025 1
SOAP: Style-Omniscient Animatable Portraits
arXiv 2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
arXiv 2025
Rethinking Text-based Protein Understanding: Retrieval or LLM?
arXiv 2025
IPO: Iterative Preference Optimization for Text-to-Video Generation
arXiv 2025
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
arXiv 2025
AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents
arXiv 2025
IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
arXiv 2025
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
arXiv 2025
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
arXiv 2025
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
arXiv 2025
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
arXiv 2025
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
arXiv 2025
Omni-Video: Democratizing Unified Video Understanding and Generation
arXiv 2025
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
arXiv 2025
Character Mixing for Video Generation
arXiv 2025
Sequential Diffusion Language Models
arXiv 2025
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
arXiv 2025
Hierarchical Budget Policy Optimization for Adaptive Reasoning
arXiv 2025
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
CVPR 2025 1
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
arXiv 2025
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
arXiv 2025
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
arXiv 2025
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
arXiv 2025
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
arXiv 2024
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
arXiv 2024
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models
arXiv 2024
FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset
arXiv 2024
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation
arXiv 2024
LAN: Learning Adaptive Neighbors for Real-Time Insider Threat Detection
arXiv 2024
ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing
arXiv 2024
FuXi Weather: A data-to-forecast machine learning system for global weather
arXiv 2024
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
ICCV 2025
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue
arXiv 2024
The state-of-the-art in Cardiac MRI Reconstruction: Results of the CMRxRecon Challenge in MICCAI 2023
arXiv 2024
Grasp as You Say: Language-guided Dexterous Grasp Generation
arXiv 2024
VLSBench: Unveiling Visual Leakage in Multimodal Safety
arXiv 2024
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
arXiv 2024
Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation
arXiv 2024
Predicting fluorescent labels in label-free microscopy images with pix2pix and adaptive loss in Light My Cells challenge
arXiv 2024
Retinal IPA: Iterative KeyPoints Alignment for Multimodal Retinal Imaging
arXiv 2024
Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model
arXiv 2023
Learning A Sparse Transformer Network for Effective Image Deraining
CVPR 2023 1
XMem++: Production-level Video Segmentation From Few Annotated Frames
ICCV 2023 1
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
ICCV 2023 1
InfMLLM: A Unified Framework for Visual-Language Tasks
arXiv 2023
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space
ICCV 2023 1
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
arXiv 2023
Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks
arXiv 2023
MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection
ICCV 2023 1
FreestyleRet: Retrieving Images from Style-Diversified Queries
arXiv 2023
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
arXiv 2023
Detecting Line Segments in Motion-blurred Images with Events
arXiv 2022
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023 1
GiraffeDet: A Heavy-Neck Paradigm for Object Detection
giraffedet-a-heavy-neck-paradigm-for-object
Spatiotemporal Entropy Model is All You Need for Learned Video Compression
arXiv 2021
Neural Architecture Design for GPU-Efficient Networks
arXiv 2020
Position-Aware Tagging for Aspect Sentiment Triplet Extraction
EMNLP 2020 11
Visualizing the Loss Landscape of Neural Nets
visualizing-the-loss-landscape-of-neural-nets-1
Affiliations
Frequent co-authors
10from 84 papers