Wenxuan Wang

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

arXiv 2025

Deep Research: A Systematic Survey

arXiv 2025

Uniform Discrete Diffusion with Metric Path for Video Generation

arXiv 2025

Emu3.5: Native Multimodal Models are World Learners

arXiv 2025

Unified Vision-Language-Action Model

arXiv 2025

A Survey of Deep Learning for Geometry Problem Solving

arXiv 2025

VLMs as GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks

arXiv 2025

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

arXiv 2025

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

ICCV 2025

TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM

arXiv 2025

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

arXiv 2025

LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

arXiv 2024

New Job, New Gender? Measuring the Social Bias in Image Generation Models

arXiv 2024

Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models

arXiv 2024

Diffusion Feedback Helps CLIP See Better

arXiv 2024

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

arXiv 2024

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

arXiv 2024

How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO

arXiv 2024

Distributional Soft Actor-Critic with Three Refinements

arXiv 2023

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

arXiv 2023

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench

arXiv 2023

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

arXiv 2023

What Makes Good In-context Demonstrations for Code Intelligence Tasks with LLMs?

arXiv 2023

Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine

arXiv 2023

Revisiting the Reliability of Psychological Scales on Large Language Models

arXiv 2023

All Languages Matter: On the Multilingual Safety of Large Language Models

arXiv 2023

BiasAsker: Measuring the Bias in Conversational AI System

arXiv 2023