Yi Wang
- Papers
- 61
Cite
Notes
Only stored in your browser.
Authored papers
61Qwen-Image-VAE-2.0 Technical Report
arXiv 2026
Urban Socio-Semantic Segmentation with Vision-Language Reasoning
arXiv 2026
RIVER: A Real-Time Interaction Benchmark for Video LLMs
arXiv 2026
AcademiClaw: When Students Set Challenges for AI Agents
arXiv 2026
LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics
arXiv 2026
Code2World: A GUI World Model via Renderable Code Generation
arXiv 2026
Qwen-Image Technical Report
arXiv 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
arXiv 2025
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
arXiv 2025
Make Your Training Flexible: Towards Deployment-Efficient Video Models
ICCV 2025
ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning
arXiv 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv 2025
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem
arXiv 2025
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
ICCV 2025
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
arXiv 2025
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
arXiv 2025
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
arXiv 2025
Synthetic Generation and Latent Projection Denoising of Rim Lesions in Multiple Sclerosis
arXiv 2025
Towards a Unified Copernicus Foundation Model for Earth Vision
ICCV 2025
PATS: Process-Level Adaptive Thinking Mode Switching
arXiv 2025
GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models
arXiv 2025
Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning
arXiv 2025
VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
arXiv 2025
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
arXiv 2024
VideoMamba: State Space Model for Efficient Video Understanding
arXiv 2024
ChatMusician: Understanding and Generating Music Intrinsically with LLM
arXiv 2024
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
arXiv 2024
Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation
arXiv 2024
Internal Consistency and Self-Feedback in Large Language Models: A Survey
arXiv 2024
ESP-MedSAM: Efficient Self-Prompting SAM for Universal Domain-Generalized Medical Image Segmentation
arXiv 2024
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
CVPR 2025 1
CaRe-Ego: Contact-aware Relationship Modeling for Egocentric Interactive Hand-object Segmentation
arXiv 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
arXiv 2024
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
arXiv 2024
Explaining Time Series via Contrastive and Locally Sparse Perturbations
arXiv 2024
SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation
arXiv 2024
ComposerX: Multi-Agent Symbolic Music Composition with LLMs
arXiv 2024
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
arXiv 2024
Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining
arXiv 2024
XTRUST: On the Multilingual Trustworthiness of Large Language Models
arXiv 2024
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
arXiv 2024
SSL4EO-L: Datasets and Foundation Models for Landsat Imagery
ssl4eo-l-datasets-and-foundation-models-for
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
CVPR 2023 1
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CVPR 2024 1
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
arXiv 2023
NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer
CVPR 2023 1
SimPLe: Similarity-Aware Propagation Learning for Weakly-Supervised Breast Cancer Segmentation in DCE-MRI
arXiv 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
ICCV 2023 1
Scaling Data Generation in Vision-and-Language Navigation
ICCV 2023 1
Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
arXiv 2023
Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts
ICCV 2023 1
GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data
arXiv 2023
Feature Guided Masked Autoencoder for Self-supervised Learning in Remote Sensing
arXiv 2023
MAT: Mask-Aware Transformer for Large Hole Image Inpainting
CVPR 2022 1
NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views
arXiv 2022
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
arXiv 2022
SSL4EO-S12: A Large-Scale Multi-Modal, Multi-Temporal Dataset for Self-Supervised Learning in Earth Observation
arXiv 2022
VCNet: A Robust Approach to Blind Image Inpainting
ECCV 2020 8
Image Inpainting via Generative Multi-column Convolutional Neural Networks
image-inpainting-via-generative-multi-column-1
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
arXiv 2015
Affiliations
Frequent co-authors
10from 61 papers