Dawn Song

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

arXiv 2026

dLLM: Simple Diffusion Language Modeling

arXiv 2026

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

arXiv 2026

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

arXiv 2026

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

arXiv 2026

Adaptation of Agentic AI

arXiv 2025

An Illusion of Progress? Assessing the Current State of Web Agents

arXiv 2025

VERINA: Benchmarking Verifiable Code Generation

arXiv 2025

Learning to Reason without External Rewards

arXiv 2025

Scalable Best-of-N Selection for Large Language Models via Self-Certainty

arXiv 2025

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

arXiv 2025

Progent: Programmable Privilege Control for LLM Agents

arXiv 2025

FrontierCS: Evolving Challenges for Evolving Intelligence

arXiv 2025

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

arXiv 2025

Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?

arXiv 2025

Improving LLM Safety Alignment with Dual-Objective Optimization

arXiv 2025

Can LLMs Design Good Questions Based on Context?

arXiv 2025

Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

arXiv 2025

Predicting Task Performance with Context-aware Scaling Laws

arXiv 2025

SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

arXiv 2025

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

arXiv 2024

Tamper-Resistant Safeguards for Open-Weight LLMs

arXiv 2024

KnowHalu: Hallucination Detection via Multi-Form Knowledge Based Factual Checking

arXiv 2024

Multimodal Situational Safety

arXiv 2024

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

arXiv 2024

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

arXiv 2024

Can Editing LLMs Inject Harm?

arXiv 2024

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

arXiv 2024

C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models

arXiv 2024

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

arXiv 2024

Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning

arXiv 2024

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

arXiv 2024

Representation Engineering: A Top-Down Approach to AI Transparency

arXiv 2023

2023

TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets

CVPR 2023 1

2023

Agent Instructs Large Language Models to be General Zero-Shot Reasoners

arXiv 2023

2023

Benchmarking Language Models for Code Syntax Understanding

arXiv 2022

2022

Forecasting Future World Events with Neural Networks

arXiv 2022

2022

Measuring Mathematical Problem Solving With the MATH Dataset

NeurIPS

2021

Measuring Coding Challenge Competence With APPS

arXiv 2021

2021

Measuring Massive Multitask Language Understanding

ICLR

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

ICCV 2021 10

Aligning AI With Shared Human Values

arXiv 2020

Extracting Training Data from Large Language Models

arXiv 2020