Yu-Gang Jiang
- Papers
- 61
Cite
Notes
Only stored in your browser.
Authored papers
61World Action Models: The Next Frontier in Embodied AI
arXiv 2026
Internal Safety Collapse in Frontier Large Language Models
arXiv 2026
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
arXiv 2026
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
arXiv 2026
CL-bench: A Benchmark for Context Learning
arXiv 2026
ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation
arXiv 2026
FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions
arXiv 2026
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
arXiv 2026
PixelSmile: Toward Fine-Grained Facial Expression Editing
arXiv 2026
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
arXiv 2026
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
arXiv 2026
WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
arXiv 2026
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
arXiv 2026
StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
arXiv 2025
OmniSVG: A Unified Scalable Vector Graphics Generation Model
arXiv 2025
SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
arXiv 2025
WorldPM: Scaling Human Preference Modeling
arXiv 2025
WithAnyone: Towards Controllable and ID Consistent Image Generation
arXiv 2025
BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models
arXiv 2025
MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
arXiv 2025
Multimodal Referring Segmentation: A Survey
arXiv 2025
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
arXiv 2025
Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
arXiv 2025
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
arXiv 2025
Safety at Scale: A Comprehensive Survey of Large Model Safety
arXiv 2025
EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
arXiv 2025
ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning
arXiv 2025
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
arXiv 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
arXiv 2025
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
arXiv 2024
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
arXiv 2024
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
arXiv 2024
Secrets of RLHF in Large Language Models Part II: Reward Modeling
arXiv 2024
SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
ICCV 2025
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
arXiv 2024
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
arXiv 2024
REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents
ICCV 2025
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
arXiv 2024
Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
arXiv 2024
Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor Attacks
arXiv 2024
Brain3D: Generating 3D Objects from fMRI
arXiv 2024
Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models
arXiv 2024
MouSi: Poly-Visual-Expert Vision-Language Models
arXiv 2024
OmniVid: A Generative Framework for Universal Video Understanding
CVPR 2024 1
ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
arXiv 2024
A Survey on Video Diffusion Models
arXiv 2023
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario
arXiv 2023
MotionEditor: Editing Video Motion via Content-Aware Diffusion
CVPR 2024 1
Fake Alignment: Are LLMs Really Aligned Well?
arXiv 2023
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
arXiv 2023
MRN: Multiplexed Routing Network for Incremental Multilingual Text Recognition
ICCV 2023 1
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
arXiv 2023
TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition
arXiv 2023
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
ICCV 2023 1
Reconstructive Neuron Pruning for Backdoor Defense
arXiv 2023
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
arXiv 2023
Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples
CVPR 2023 1
ResFormer: Scaling ViTs with Multi-Resolution Training
CVPR 2023 1
WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection
arXiv 2021
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection
arXiv 2021
Imbalanced Gradients: A Subtle Cause of Overestimated Adversarial Robustness
arXiv 2020
Affiliations
Frequent co-authors
10from 61 papers