Linjie Li
- Papers
- 49
Cite
Notes
Only stored in your browser.
Authored papers
49AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
arXiv 2026
RAGEN-2: Reasoning Collapse in Agentic RL
arXiv 2026
AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning
arXiv 2026
FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching
arXiv 2026
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
arXiv 2025
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
arXiv 2025
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
arXiv 2025
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
arXiv 2025
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
arXiv 2025
Computer-Use Agents as Judges for Generative User Interface
arXiv 2025
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
arXiv 2025
Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
arXiv 2025
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
arXiv 2025
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow
arXiv 2025
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
arXiv 2025
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
arXiv 2025
A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning
arXiv 2025
Glance: Accelerating Diffusion Models with 1 Sample
arXiv 2025
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
arXiv 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025 1
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
arXiv 2024
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
arXiv 2024
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
arXiv 2024
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
arXiv 2024
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
arXiv 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
arXiv 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
arXiv 2024
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
arXiv 2023
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024 1
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
arXiv 2023
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
arXiv 2023
Interfacing Foundation Models' Embeddings
arXiv 2023
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023 1
Adaptive Human Matting for Dynamic Videos
CVPR 2023 1
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
CVPR 2024 1
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
arXiv 2023
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design
arXiv 2023
Generalized Decoding for Pixel, Image, and Language
CVPR 2023 1
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
coarse-to-fine-vision-language-pre-training-1
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
CVPR 2023 1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CVPR 2023 1
GIT: A Generative Image-to-text Transformer for Vision and Language
arXiv 2022
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
CVPR 2022 1
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
arXiv 2021
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
EMNLP 2020 11
Graph Optimal Transport for Cross-Domain Alignment
ICML 2020 1
UNITER: UNiversal Image-TExt Representation Learning
ECCV 2020 8
Affiliations
Frequent co-authors
10from 49 papers