Patrick Schramowski

LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

arXiv 2024

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

arXiv 2024

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

arXiv 2024

SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs

arXiv 2024

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You

arXiv 2024

Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness

arXiv 2023

2023

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

NeurIPS 2023 11

2023

Class Attribute Inference Attacks: Inferring Sensitive Class Information by Diffusion-Based Attribute Manipulations

arXiv 2023

2023

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

CVPR 2023 1

Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content?

arXiv 2022

Does CLIP Know My Face?

arXiv 2022

Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis

arXiv 2022

A Typology for Exploring the Mitigation of Shortcut Behavior

arXiv 2022

ILLUME: Rationalizing Vision-Language Models through Human Interactions

arXiv 2022

Revision Transformers: Instructing Language Models to Change their Values

arXiv 2022

Speaking Multiple Languages Affects the Moral Bias of Language Models

arXiv 2022

Inferring Offensiveness In Images From Natural Language Supervision

inferring-offensiveness-in-images-from-1

Adaptive Rational Activations to Boost Deep Reinforcement Learning

arXiv 2021

Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do

arXiv 2021

Interactively Providing Explanations for Transformer Language Models

interactively-generating-explanations-for-1