Zhexin Zhang

LongSafety: Evaluating Long-Context Safety of Large Language Models

arXiv 2025

Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints

arXiv 2025

Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

arXiv 2025

How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

arXiv 2025

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

arXiv 2024

Agent-SafetyBench: Evaluating the Safety of LLM Agents

arXiv 2024

Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models

arXiv 2024

From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks

arXiv 2024

Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework

arXiv 2024

Safety Assessment of Chinese Large Language Models

arXiv 2023

SafetyBench: Evaluating the Safety of Large Language Models

arXiv 2023

Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation

arXiv 2023

Unveiling the Implicit Toxicity in Large Language Models

arXiv 2023

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

arXiv 2023