IRIS Colloquium I Laurène Vaugrante: Compromising Honesty and Harmlessness in Language Models via Deception Attacks

May 28, 2025, 2:00 p.m. (CEST)

Our third colloquium this year will be held by Laurène Vaugrante. This is an open, university-internal event for students and interested parties.

Time: May 28, 2025, 2:00 p.m. – 3:00 p.m.
Venue: Room 101 (UN 32.101), ground floor
Universitätsstr. 32 (entrance via Universitätsstr. 34)
Campus Vaihingen
Download as iCal:

Large language models (LLMs) can often be trusted to produce honest, harmless responses—yet they are not foolproof. We demonstrate a “deception attack” that fine-tunes LLMs to mislead users on chosen topics while remaining accurate elsewhere. Not only do these deceptive models undermine user trust, but they also exhibit toxic behaviors, including hate speech and harmful stereotypes. Our findings underscore the urgent need for stronger safeguards as LLMs become increasingly integrated into everyday applications.

The lecture is held in English.
 
Join us for cake after the colloquium.
The word colloquium against a background of a wave-like structure in shades of blue.
[Picture: SRF IRIS / S. Brandes]

We send out a newsletter at irregular intervals with information on IRIS events. To make sure you don't miss anything, simply enter your e-mail address. You will shortly receive a confirmation e-mail to make sure that you really are the person who wants to subscribe. After receiving your confirmation, you will be added to the mailing list. This is a hidden mailing list, which means that the subscriber list can only be viewed by the administrator.

Note: It is not possible to process your subscription to the newsletter without providing your e-mail address. The information you provide is voluntary and you can unsubscribe from the newsletter at any time.

Newsletter Subscription Page

Past Events


March 2025

February 2025

January 2025

November 2024

October 2024

July 2024

June 2024

May 2024

March 2024

February 2024

January 2024

December 2023

November 2023

October 2023

September 2023

July 2023

June 2023

May 2023

April 2023

March 2023

February 2023

January 2023

December 2022

November 2022

October 2022

July 2022

June 2022

May 2022

April 2022

February 2022

January 2022

December 2021

November 2021

October 2021

September 2021

July 2021

To the top of the page