🛡️Awesome LLM-Safety🛡️[](https://awesome.re)

October 12, 2024 · View on GitHub

English | 中文

🤗Introduction

Welcome to our Awesome-llm-safety repository! 🥰🥰🥰

🔥 News

2024.05 update NAACL 2024 Papers Collection, thanks @zhrli324, @feqHe!

🧑‍💻 Our Work

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.

If a resource is relevant to multiple subcategories, we place it under each applicable section. For instance, the "Awesome-LLM-Safety" repository will be listed under each subcategory to which it pertains🤩!.

✔️ Perfect for Majority

For beginners curious about llm-safety, our repository serves as a compass for grasping the big picture and diving into the details. Classic or influential papers retained in the README provide a beginner-friendly navigation through interesting directions in the field;
For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge. Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work. Our thorough compilation and careful selection are time-savers for you.

🧭 How to Use this Guide

Quick Start: In the README, users can find a curated list of select information sorted by date, along with links to various consultations.
In-Depth Exploration: If you have a special interest in a particular subtopic, delve into the "subtopic" folder for more. Each item, be it an article or piece of news, comes with a brief introduction, allowing researchers to swiftly zero in on relevant content.

💼 How to Contribution

If you have completed an insightful work or carefully compiled conference papers, we would love to add your work to the repository.

For individual papers, you can raise an issue, and we will quickly add your paper under the corresponding subtopic.
If you have compiled a collection of papers for a conference, you are welcome to submit a pull request directly. We would greatly appreciate your contribution. Please note that these pull requests need to be consistent with our existing format.

📜Advertisement

🌱 If you would like more people to read your recent insightful work, please contact me via email. I can offer you a promotional spot here for up to one month.

Let’s start LLM Safety tutorial!

🚀Table of Contents

🛡️Awesome LLM-Safety🛡️
- 🤗Introduction
- 🚀Table of Contents
- [🔐Security & Discussion](#security & discussion)
- 🔏Privacy
- 📰Truthfulness & Misinformation
- 😈JailBreak & Attacks
- [🛡️Defenses & Mitigation](#️defenses & mitigation)
  - 📖Tutorials, Articles, Presentations and Talks
  - Other
- 💯Datasets & Benchmark
- 🧑‍🏫 Scholars 👩‍🏫
- 🧑‍🎓Author

🤔AI Safety & Security Discussions

Date	Link	Publication	Authors
2024/5/20	Managing extreme AI risks amid rapid progress	Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann	Science

🔐Security & Discussion

📑Papers

Date	Institute	Publication	Paper
20.10	Facebook AI Research	arxiv	Recipes for Safety in Open-domain Chatbots
22.03	OpenAI	NIPS2022	Training language models to follow instructions with human feedback
23.07	UC Berkeley	NIPS2023	Jailbroken: How Does LLM Safety Training Fail?
23.12	OpenAI	Open AI	Practices for Governing Agentic AI Systems

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
22.02	Toxicity Detection API	Perspective API	link paper
23.07	Repository	Awesome LLM Security	link
23.10	Tutorials	Awesome-LLM-Safety	link
24.01	Tutorials	Awesome-LM-SSP	link

Date	Institute	Publication	Paper
19.12	Microsoft	CCS2020	Analyzing Information Leakage of Updates to Natural Language Models
21.07	Google Research	ACL2022	Deduplicating Training Data Makes Language Models Better
21.10	Stanford	ICLR2022	Large language models can be strong differentially private learners
22.02	Google Research	ICLR2023	Quantifying Memorization Across Neural Language Models
22.02	UNC Chapel Hill	ICML2022	Deduplicating Training Data Mitigates Privacy Risks in Language Models

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.10	Tutorials	Awesome-LLM-Safety	link
24.01	Tutorials	Awesome-LM-SSP	link

Date	Institute	Publication	Paper
21.09	University of Oxford	ACL2022	TruthfulQA: Measuring How Models Mimic Human Falsehoods
23.11	Harbin Institute of Technology	arxiv	A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
23.11	Arizona State University	arxiv	Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.07	Repository	llm-hallucination-survey	link
23.10	Repository	LLM-Factuality-Survey	link
23.10	Tutorials	Awesome-LLM-Safety	link

Date	Institute	Publication	Paper
20.12	Google	USENIX Security 2021	Extracting Training Data from Large Language Models
22.11	AE Studio	NIPS2022(ML Safety Workshop)	Ignore Previous Prompt: Attack Techniques For Language Models
23.06	Google	arxiv	Are aligned neural networks adversarially aligned?
23.07	CMU	arxiv	Universal and Transferable Adversarial Attacks on Aligned Language Models
23.10	University of Pennsylvania	arxiv	Jailbreaking Black Box Large Language Models in Twenty Queries

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.01	Community	Reddit/ChatGPTJailbrek	link
23.02	Resource&Tutorials	Latest Jailbreak Prompts	link
23.10	Tutorials	Awesome-LLM-Safety	link
23.10	Article	Adversarial Attacks on LLMs(Author: Lilian Weng)	link
23.11	Video	[1hr Talk] Intro to Large Language Models From 45:45(Author: Andrej Karpathy)	link
24.09	Repo	awesome_LLM-harmful-fine-tuning-papers	link
12.10	Resource	Jailbreak Commuinities	link
12.10	Article	Jailbreak Techniques and Safeguards	link

Date	Institute	Publication	Paper
21.07	Google Research	ACL2022	Deduplicating Training Data Makes Language Models Better
22.04	Anthropic	arxiv	Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.10	Tutorials	Awesome-LLM-Safety	link

Date	Institute	Publication	Paper
20.09	University of Washington	EMNLP2020(findings)	RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
21.09	University of Oxford	ACL2022	TruthfulQA: Measuring How Models Mimic Human Falsehoods
22.03	MIT	ACL2022	ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.10	Tutorials	Awesome-LLM-Safety	link

📚Resource📚

Toxicity - RealToxicityPrompts datasets
Truthfulness - TruthfulQA datasets

Other

👉Latest&Comprehensive datasets & Benchmark Paper

🧑‍🎓Author

🤗If you have any questions, please contact our authors!🤗

✉️: ydyjya ➡️ zhouzhenhong@bupt.edu.cn

💬: LLM Safety Discussion

Wechat Group | My Wechat

⬆ Back to ToC