Hao Fei
- Papers
- 34
Cite
Notes
Only stored in your browser.
Authored papers
34Audio-Visual Intelligence in Large Foundation Models
arXiv 2026
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
arXiv 2026
SAMTok: Representing Any Mask with Two Words
arXiv 2026
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
arXiv 2026
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
arXiv 2026
Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment
arXiv 2026
Semantic Role Labeling: A Systematical Survey
arXiv 2025
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
arXiv 2025
Probing then Editing Response Personality of Large Language Models
arXiv 2025
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
arXiv 2025
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
arXiv 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
arXiv 2025
On Path to Multimodal Generalist: General-Level and General-Bench
arXiv 2025
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
arXiv 2025
Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding
arXiv 2025
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
arXiv 2025
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
arXiv 2025
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
ICCV 2025
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
arXiv 2025
Faithful Logical Reasoning via Symbolic Chain-of-Thought
arXiv 2024
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
arXiv 2024
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
arXiv 2024
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
arXiv 2024
A Survey on Benchmarks of Multimodal Large Language Models
arXiv 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
arXiv 2024
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
arXiv 2024
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
arXiv 2023
NExT-GPT: Any-to-Any Multimodal LLM
arXiv 2023
MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
arXiv 2023
Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination
arXiv 2023
LasUIE: Unifying Information Extraction with Latent Adaptive Structure-aware Generative Language Model
arXiv 2023
VPGTrans: Transfer Visual Prompt Generator across LLMs
NeurIPS 2023 11
Reasoning Implicit Sentiment with Chain-of-Thought Prompting
arXiv 2023
Generating Visual Spatial Description via Holistic 3D Scene Understanding
arXiv 2023
Affiliations
Frequent co-authors
10from 34 papers