William Yang Wang
- Papers
- 58
Cite
Notes
Only stored in your browser.
Authored papers
58Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants
arXiv 2026
TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents
arXiv 2026
Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies
arXiv 2025
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
arXiv 2025
MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG
arXiv 2025
MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents
arXiv 2025
InductionBench: LLMs Fail in the Simplest Complexity Class
arXiv 2025
Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding
arXiv 2025
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement
arXiv 2024
A Survey on Data Selection for Language Models
arXiv 2024
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
arXiv 2024
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding
arXiv 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
arXiv 2024
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
arXiv 2024
Can Editing LLMs Inject Harm?
arXiv 2024
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
arXiv 2024
DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics
arXiv 2024
Weak-to-Strong Jailbreaking on Large Language Models
arXiv 2024
Disentangling Memory and Reasoning Ability in Large Language Models
arXiv 2024
Scaling LLM Inference with Optimized Sample Compute Allocation
arXiv 2024
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
arXiv 2024
BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment
arXiv 2024
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
arXiv 2024
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
arXiv 2024
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text
multimodal-c4-an-open-billion-scale-corpus-of
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback
arXiv 2023
DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text
arXiv 2023
Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners
arXiv 2023
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
arXiv 2023
ReDi: Efficient Learning-Free Diffusion Inference via Trajectory Retrieval
arXiv 2023
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
arXiv 2023
Guiding Instruction-based Image Editing via Multimodal Large Language Models
arXiv 2023
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
arXiv 2023
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
NeurIPS 2023 11
Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning
arXiv 2023
A Survey on Detection of LLMs-Generated Content
arXiv 2023
Multimodal Procedural Planning via Dual Text-Image Prompting
arXiv 2023
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation
llmscore-unveiling-the-power-of-large
Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
improving-few-shot-generalization-by
MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models
arXiv 2023
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
arXiv 2023
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models
arXiv 2023
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CVPR 2023 1
Imagination-Augmented Natural Language Understanding
NAACL 2022 7
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
arXiv 2022
Towards Large-Scale Interpretable Knowledge Graph Reasoning for Dialogue Systems
Findings (ACL) 2022 5
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
arXiv 2022
ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering
arXiv 2022
FinQA: A Dataset of Numerical Reasoning over Financial Data
EMNLP 2021 11
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
arXiv 2021
A Dataset for Answering Time-Sensitive Questions
arXiv 2021
Attacking Open-domain Question Answering by Injecting Misinformation
contraqa-question-answering-under
Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval
ICLR 2021 1
Logical Natural Language Generation from Open-Domain Tables
logical-natural-language-generation-from-open-1
r/Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection
arXiv 2019
Self-Supervised Learning for Contextualized Extractive Summarization
self-supervised-learning-for-contextualized-1
TabFact: A Large-scale Dataset for Table-based Fact Verification
ICLR 2020 1
Hate Lingo: A Target-based Linguistic Analysis of Hate Speech in Social Media
arXiv 2018
Affiliations
Frequent co-authors
10from 58 papers