We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community. https://github.com/ibm-granite/granite-guardian
Granite Guardian
Granite Guardian models provide comprehensive risk detection for prompts and responses in large language models (LLM), addressing social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and retrieval-augmented generation (RAG) hallucination risks.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 23
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2412.07724v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
23Giulio ZizzoAmbrish RawatInkit PadhiErik MiehlingPierre DogninManish NagireddyPrasanna SattigeriTejaswini PedapatiGiandomenico CornacchiaKieran FraserMuhammad Zaid HameedMark PurcellKeerthiram MurugesanSubhajit ChaudhuryMartín Santillán CooperMichael DesmondQian PanZahra AshktorabInge VejsbjergElizabeth M. DalyMichael HindWerner GeyerKush R. Varshney