Yi Wang

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

arXiv 2026

RIVER: A Real-Time Interaction Benchmark for Video LLMs

arXiv 2026

LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics

arXiv 2026

Code2World: A GUI World Model via Renderable Code Generation

arXiv 2026

AcademiClaw: When Students Set Challenges for AI Agents

arXiv 2026

Qwen-Image Technical Report

arXiv 2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

arXiv 2025

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

arXiv 2025

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

arXiv 2025

Make Your Training Flexible: Towards Deployment-Efficient Video Models

ICCV 2025

ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

arXiv 2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

arXiv 2025

PATS: Process-Level Adaptive Thinking Mode Switching

arXiv 2025

GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models

arXiv 2025

Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

arXiv 2025

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

arXiv 2025

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

ICCV 2025

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

arXiv 2025

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

arXiv 2025

Towards a Unified Copernicus Foundation Model for Earth Vision

ICCV 2025

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

arXiv 2025

EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization

arXiv 2025

Synthetic Generation and Latent Projection Denoising of Rim Lesions in Multiple Sclerosis

arXiv 2025

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

arXiv 2024

VideoMamba: State Space Model for Efficient Video Understanding

arXiv 2024

ChatMusician: Understanding and Generating Music Intrinsically with LLM

arXiv 2024

ESP-MedSAM: Efficient Self-Prompting SAM for Universal Domain-Generalized Medical Image Segmentation

arXiv 2024

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

CVPR 2025 1

Explaining Time Series via Contrastive and Locally Sparse Perturbations

arXiv 2024

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

arXiv 2024

ComposerX: Multi-Agent Symbolic Music Composition with LLMs

arXiv 2024

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

arXiv 2024

Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining

arXiv 2024

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

arXiv 2024

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

arXiv 2024

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

arXiv 2024

Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation

arXiv 2024

Internal Consistency and Self-Feedback in Large Language Models: A Survey

arXiv 2024

XTRUST: On the Multilingual Trustworthiness of Large Language Models

arXiv 2024

CaRe-Ego: Contact-aware Relationship Modeling for Egocentric Interactive Hand-object Segmentation

arXiv 2024

Tracking the Feature Dynamics in LLM Training: A Mechanistic Study

arXiv 2024

SSL4EO-L: Datasets and Foundation Models for Landsat Imagery

ssl4eo-l-datasets-and-foundation-models-for

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

CVPR 2023 1

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

CVPR 2024 1

NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer

CVPR 2023 1

Decoupling Common and Unique Representations for Multimodal Self-supervised Learning

arXiv 2023

Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts

ICCV 2023 1

GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data

arXiv 2023

Feature Guided Masked Autoencoder for Self-supervised Learning in Remote Sensing

arXiv 2023

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

arXiv 2023

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

ICCV 2023 1

Scaling Data Generation in Vision-and-Language Navigation

ICCV 2023 1

SimPLe: Similarity-Aware Propagation Learning for Weakly-Supervised Breast Cancer Segmentation in DCE-MRI

arXiv 2023

MAT: Mask-Aware Transformer for Large Hole Image Inpainting

CVPR 2022 1

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

arXiv 2022

NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views

arXiv 2022

SSL4EO-S12: A Large-Scale Multi-Modal, Multi-Temporal Dataset for Self-Supervised Learning in Earth Observation

arXiv 2022