🛡️Awesome LLM-Safety🛡️[](https://awesome.re)

October 12, 2024 · View on GitHub

GitHub stars GitHub forks GitHub issues GitHub Last commit

English | 中文

🤗Introduction

Welcome to our Awesome-llm-safety repository! 🥰🥰🥰

🔥 News

  • 2024.05 update NAACL 2024 Papers Collection, thanks @zhrli324, @feqHe!

🧑‍💻 Our Work

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.

If a resource is relevant to multiple subcategories, we place it under each applicable section. For instance, the "Awesome-LLM-Safety" repository will be listed under each subcategory to which it pertains🤩!.

✔️ Perfect for Majority

  • For beginners curious about llm-safety, our repository serves as a compass for grasping the big picture and diving into the details. Classic or influential papers retained in the README provide a beginner-friendly navigation through interesting directions in the field;
  • For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge. Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work. Our thorough compilation and careful selection are time-savers for you.

🧭 How to Use this Guide

  • Quick Start: In the README, users can find a curated list of select information sorted by date, along with links to various consultations.
  • In-Depth Exploration: If you have a special interest in a particular subtopic, delve into the "subtopic" folder for more. Each item, be it an article or piece of news, comes with a brief introduction, allowing researchers to swiftly zero in on relevant content.

💼 How to Contribution

If you have completed an insightful work or carefully compiled conference papers, we would love to add your work to the repository.

  • For individual papers, you can raise an issue, and we will quickly add your paper under the corresponding subtopic.
  • If you have compiled a collection of papers for a conference, you are welcome to submit a pull request directly. We would greatly appreciate your contribution. Please note that these pull requests need to be consistent with our existing format.

📜Advertisement

🌱 If you would like more people to read your recent insightful work, please contact me via email. I can offer you a promotional spot here for up to one month.

Let’s start LLM Safety tutorial!


🚀Table of Contents


🤔AI Safety & Security Discussions

DateLinkPublicationAuthors
2024/5/20Managing extreme AI risks amid rapid progressYoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören MindermannScience

🔐Security & Discussion

📑Papers

DateInstitutePublicationPaper
20.10Facebook AI ResearcharxivRecipes for Safety in Open-domain Chatbots
22.03OpenAINIPS2022Training language models to follow instructions with human feedback
23.07UC BerkeleyNIPS2023Jailbroken: How Does LLM Safety Training Fail?
23.12OpenAIOpen AIPractices for Governing Agentic AI Systems

📖Tutorials, Articles, Presentations and Talks

DateTypeTitleURL
22.02Toxicity Detection APIPerspective APIlink
paper
23.07RepositoryAwesome LLM Securitylink
23.10TutorialsAwesome-LLM-Safetylink
24.01TutorialsAwesome-LM-SSPlink

Other

👉Latest&Comprehensive Security Paper


🔏Privacy

📑Papers

DateInstitutePublicationPaper
19.12MicrosoftCCS2020Analyzing Information Leakage of Updates to Natural Language Models
21.07Google ResearchACL2022Deduplicating Training Data Makes Language Models Better
21.10StanfordICLR2022Large language models can be strong differentially private learners
22.02Google ResearchICLR2023Quantifying Memorization Across Neural Language Models
22.02UNC Chapel HillICML2022Deduplicating Training Data Mitigates Privacy Risks in Language Models

📖Tutorials, Articles, Presentations and Talks

DateTypeTitleURL
23.10TutorialsAwesome-LLM-Safetylink
24.01TutorialsAwesome-LM-SSPlink

Other

👉Latest&Comprehensive Privacy Paper


📰Truthfulness & Misinformation

📑Papers

DateInstitutePublicationPaper
21.09University of OxfordACL2022TruthfulQA: Measuring How Models Mimic Human Falsehoods
23.11Harbin Institute of TechnologyarxivA Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
23.11Arizona State UniversityarxivCan Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey

📖Tutorials, Articles, Presentations and Talks

DateTypeTitleURL
23.07Repositoryllm-hallucination-surveylink
23.10RepositoryLLM-Factuality-Surveylink
23.10TutorialsAwesome-LLM-Safetylink

Other

👉Latest&Comprehensive Truthfulness&Misinformation Paper


😈JailBreak & Attacks

📑Papers

DateInstitutePublicationPaper
20.12GoogleUSENIX Security 2021Extracting Training Data from Large Language Models
22.11AE StudioNIPS2022(ML Safety Workshop)Ignore Previous Prompt: Attack Techniques For Language Models
23.06GooglearxivAre aligned neural networks adversarially aligned?
23.07CMUarxivUniversal and Transferable Adversarial Attacks on Aligned Language Models
23.10University of PennsylvaniaarxivJailbreaking Black Box Large Language Models in Twenty Queries

📖Tutorials, Articles, Presentations and Talks

DateTypeTitleURL
23.01CommunityReddit/ChatGPTJailbreklink
23.02Resource&TutorialsLatest Jailbreak Promptslink
23.10TutorialsAwesome-LLM-Safetylink
23.10ArticleAdversarial Attacks on LLMs(Author: Lilian Weng)link
23.11Video[1hr Talk] Intro to Large Language Models
From 45:45(Author: Andrej Karpathy)
link
24.09Repoawesome_LLM-harmful-fine-tuning-paperslink
12.10ResourceJailbreak Commuinitieslink
12.10ArticleJailbreak Techniques and Safeguardslink

Other

👉Latest&Comprehensive JailBreak & Attacks Paper


🛡️Defenses & Mitigation

📑Papers

DateInstitutePublicationPaper
21.07Google ResearchACL2022Deduplicating Training Data Makes Language Models Better
22.04AnthropicarxivTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

📖Tutorials, Articles, Presentations and Talks

DateTypeTitleURL
23.10TutorialsAwesome-LLM-Safetylink

Other

👉Latest&Comprehensive Defenses Paper


💯Datasets & Benchmark

📑Papers

DateInstitutePublicationPaper
20.09University of WashingtonEMNLP2020(findings)RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
21.09University of OxfordACL2022TruthfulQA: Measuring How Models Mimic Human Falsehoods
22.03MITACL2022ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection

📖Tutorials, Articles, Presentations and Talks

DateTypeTitleURL
23.10TutorialsAwesome-LLM-Safetylink

📚Resource📚

Other

👉Latest&Comprehensive datasets & Benchmark Paper


🧑‍🎓Author

🤗If you have any questions, please contact our authors!🤗

✉️: ydyjya ➡️ zhouzhenhong@bupt.edu.cn

💬: LLM Safety Discussion

Wechat Group | My Wechat


Star History Chart

⬆ Back to ToC