Kevin Lin
- Papers
- 25
Cite
Notes
Only stored in your browser.
Authored papers
25Sleep-time Compute: Beyond Inference Scaling at Test-time
arXiv 2025
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
arXiv 2025
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
arXiv 2025
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
CVPR 2025 1
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
arXiv 2025
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
arXiv 2024
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
arXiv 2024
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
arXiv 2024
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
arXiv 2024
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
arXiv 2024
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024 1
Lost in the Middle: How Language Models Use Long Contexts
arXiv 2023
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
arXiv 2023
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design
arXiv 2023
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
arXiv 2023
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023 1
Adaptive Human Matting for Dynamic Videos
CVPR 2023 1
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
CVPR 2023 1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CVPR 2023 1
GIT: A Generative Image-to-text Transformer for Vision and Language
arXiv 2022
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
CVPR 2022 1
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
arXiv 2021
Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation
arXiv 2019
Affiliations
Frequent co-authors
10from 25 papers