Ruoxi Jia

Papers: 11

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

11papers

Authored papers

Safety at Scale: A Comprehensive Survey of Large Model Safety

arXiv 2025

2025

Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models

arXiv 2025

2025

LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

arXiv 2025

2025

LLMs Can Plan Only If We Tell Them

arXiv 2025

2025

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

arXiv 2024

2024

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

arXiv 2024

2024

Data-Centric Human Preference Optimization with Rationales

arXiv 2024

2024

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

arXiv 2024

2024

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

arXiv 2024

2024

Revisiting Data-Free Knowledge Distillation with Poisoned Teachers

arXiv 2023

2023

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

arXiv 2023

2023

Affiliations

No known affiliations.

Frequent co-authors

from 11 papers

Yi Zeng

6 shared papers

Bo Li

4 shared papers

Dawn Song

professor

Ming Jin

Peter Henderson

Prateek Mittal

Tinghao Xie

Xiangyu Qi

Anit Sahu

Baoyuan Wu