0

Linjie Li

Papers
49

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
49papers

Authored papers

49

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

arXiv 2026

2026

RAGEN-2: Reasoning Collapse in Agentic RL

arXiv 2026

2026

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

arXiv 2026

2026

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

arXiv 2026

2026

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

arXiv 2025

2025

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

arXiv 2025

2025

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

arXiv 2025

2025

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

arXiv 2025

2025

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

arXiv 2025

2025

Computer-Use Agents as Judges for Generative User Interface

arXiv 2025

2025

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

arXiv 2025

2025

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

arXiv 2025

2025

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

arXiv 2025

2025

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

arXiv 2025

2025

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

arXiv 2025

2025

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

arXiv 2025

2025

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

arXiv 2025

2025

Glance: Accelerating Diffusion Models with 1 Sample

arXiv 2025

2025

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

arXiv 2025

2025

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

CVPR 2025 1

2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

arXiv 2024

2024

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

arXiv 2024

2024

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

arXiv 2024

2024

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

arXiv 2024

2024

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

ICCV 2025

2024

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

arXiv 2024

2024

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

arXiv 2024

2024

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

arXiv 2024

2024

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

arXiv 2023

2023

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

arXiv 2023

2023

DisCo: Disentangled Control for Realistic Human Dance Generation

CVPR 2024 1

2023

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

arXiv 2023

2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

arXiv 2023

2023

Interfacing Foundation Models' Embeddings

arXiv 2023

2023

Equivariant Similarity for Vision-Language Foundation Models

ICCV 2023 1

2023

Adaptive Human Matting for Dynamic Videos

CVPR 2023 1

2023

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

CVPR 2024 1

2023

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

arXiv 2023

2023

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design

arXiv 2023

2023

Generalized Decoding for Pixel, Image, and Language

CVPR 2023 1

2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

coarse-to-fine-vision-language-pre-training-1

2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

CVPR 2023 1

2022

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

CVPR 2023 1

2022

GIT: A Generative Image-to-text Transformer for Vision and Language

arXiv 2022

2022

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

CVPR 2022 1

2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

arXiv 2021

2021

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

EMNLP 2020 11

2020

Graph Optimal Transport for Cross-Domain Alignment

ICML 2020 1

2020

UNITER: UNiversal Image-TExt Representation Learning

ECCV 2020 8

2019

Affiliations

No known affiliations.

Frequent co-authors

10

from 49 papers