0

Hao Li

Papers
84

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
84papers

Authored papers

84

Lance: Unified Multimodal Modeling by Multi-Task Synergy

arXiv 2026

2026

The Python Simulations of Chemistry Framework: 10 years of an open-source quantum chemistry project

arXiv 2026

2026

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

arXiv 2026

2026

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

arXiv 2026

2026

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

arXiv 2026

2026

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

arXiv 2026

2026

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

arXiv 2026

2026

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

arXiv 2026

2026

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

arXiv 2026

2026

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

arXiv 2026

2026

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

arXiv 2026

2026

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

arXiv 2026

2026

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

arXiv 2026

2026

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

arXiv 2026

2026

Chain of World: World Model Thinking in Latent Motion

arXiv 2026

2026

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

arXiv 2026

2026

Rethinking VLM Representation for VLA Initialization

arXiv 2026

2026

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

arXiv 2025

2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

arXiv 2025

2025

Bridging Evolutionary Multiobjective Optimization and GPU Acceleration via Tensorization

arXiv 2025

2025

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

arXiv 2025

2025

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

arXiv 2025

2025

Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

arXiv 2025

2025

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

arXiv 2025

2025

The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

arXiv 2025

2025

DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

CVPR 2025 1

2025

SOAP: Style-Omniscient Animatable Portraits

arXiv 2025

2025

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

arXiv 2025

2025

Rethinking Text-based Protein Understanding: Retrieval or LLM?

arXiv 2025

2025

IPO: Iterative Preference Optimization for Text-to-Video Generation

arXiv 2025

2025

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

arXiv 2025

2025

AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

arXiv 2025

2025

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

arXiv 2025

2025

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

arXiv 2025

2025

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

arXiv 2025

2025

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

arXiv 2025

2025

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

arXiv 2025

2025

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

arXiv 2025

2025

Omni-Video: Democratizing Unified Video Understanding and Generation

arXiv 2025

2025

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

arXiv 2025

2025

Character Mixing for Video Generation

arXiv 2025

2025

Sequential Diffusion Language Models

arXiv 2025

2025

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

arXiv 2025

2025

Hierarchical Budget Policy Optimization for Adaptive Reasoning

arXiv 2025

2025

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

CVPR 2025 1

2025

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

arXiv 2025

2025

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

arXiv 2025

2025

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

arXiv 2025

2025

Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

arXiv 2025

2025

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

arXiv 2024

2024

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

arXiv 2024

2024

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

arXiv 2024

2024

FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset

arXiv 2024

2024

Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation

arXiv 2024

2024

LAN: Learning Adaptive Neighbors for Real-Time Insider Threat Detection

arXiv 2024

2024

ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

arXiv 2024

2024

FuXi Weather: A data-to-forecast machine learning system for global weather

arXiv 2024

2024

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

ICCV 2025

2024

Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

arXiv 2024

2024

The state-of-the-art in Cardiac MRI Reconstruction: Results of the CMRxRecon Challenge in MICCAI 2023

arXiv 2024

2024

Grasp as You Say: Language-guided Dexterous Grasp Generation

arXiv 2024

2024

VLSBench: Unveiling Visual Leakage in Multimodal Safety

arXiv 2024

2024

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

arXiv 2024

2024

Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation

arXiv 2024

2024

Predicting fluorescent labels in label-free microscopy images with pix2pix and adaptive loss in Light My Cells challenge

arXiv 2024

2024

Retinal IPA: Iterative KeyPoints Alignment for Multimodal Retinal Imaging

arXiv 2024

2024

Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model

arXiv 2023

2023

Learning A Sparse Transformer Network for Effective Image Deraining

CVPR 2023 1

2023

XMem++: Production-level Video Segmentation From Few Annotated Frames

ICCV 2023 1

2023

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

ICCV 2023 1

2023

InfMLLM: A Unified Framework for Visual-Language Tasks

arXiv 2023

2023

NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space

ICCV 2023 1

2023

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

arXiv 2023

2023

Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks

arXiv 2023

2023

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

ICCV 2023 1

2023

FreestyleRet: Retrieving Images from Style-Diversified Queries

arXiv 2023

2023

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

arXiv 2023

2023

Detecting Line Segments in Motion-blurred Images with Events

arXiv 2022

2022

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

CVPR 2023 1

2022

GiraffeDet: A Heavy-Neck Paradigm for Object Detection

giraffedet-a-heavy-neck-paradigm-for-object

2022

Spatiotemporal Entropy Model is All You Need for Learned Video Compression

arXiv 2021

2021

Neural Architecture Design for GPU-Efficient Networks

arXiv 2020

2020

Position-Aware Tagging for Aspect Sentiment Triplet Extraction

EMNLP 2020 11

2020

Visualizing the Loss Landscape of Neural Nets

visualizing-the-loss-landscape-of-neural-nets-1

2017

Affiliations

No known affiliations.

Frequent co-authors

10

from 84 papers