Ruoxi Jia
- Papers
- 11
Cite
Notes
Only stored in your browser.
Authored papers
11LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models
arXiv 2025
Safety at Scale: A Comprehensive Survey of Large Model Safety
arXiv 2025
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models
arXiv 2025
LLMs Can Plan Only If We Tell Them
arXiv 2025
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
arXiv 2024
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
arXiv 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
arXiv 2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
arXiv 2024
Data-Centric Human Preference Optimization with Rationales
arXiv 2024
Revisiting Data-Free Knowledge Distillation with Poisoned Teachers
arXiv 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
arXiv 2023
Affiliations
Frequent co-authors
10from 11 papers