Datasets & Benchmark
May 30, 2026 · View on GitHub
đź“‘Papers
| Date | Institute | Publication | Paper | Keywords |
|---|---|---|---|---|
| 20.09 | University of Washington | EMNLP2020(findings) | RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models | Toxicity |
| 21.09 | University of Oxford | ACL2022 | TruthfulQA: Measuring How Models Mimic Human Falsehoods | Truthfulness |
| 22.03 | MIT | ACL2022 | ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection | Toxicity |
| 23.07 | Zhejiang University; School of Engineering Westlake University | arxiv | Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models | Text Safety&Benchmark&Jailbreaking |
| 23.07 | Stevens Institute of Technology | NAACL2024(findings) | HateModerate: Testing Hate Speech Detectors against Content Moderation Policies | Hate Speech Detection&Content Moderation&Machine Learning |
| 23.08 | Meta Reality Labs | NAACL2024 | Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? | Large Language Models&Knowledge Graphs&Question Answering |
| 23.08 | Bocconi University | NAACL2024 | XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models | Large Language Models&Safety Behaviours&Test Suite |
| 23.09 | LibrAI, MBZUAI, The University of Melbourne | arxiv | Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs | Safety Evaluation&Safeguards |
| 23.10 | University of Edinburgh, Huawei Technologies Co., Ltd. | NAACL2024 | Assessing the Reliability of Large Language Model Knowledge | Large Language Models&Factual Knowledge&Knowledge Probing |
| 23.10 | University of Pennsylvania | NAACL2024(findings) | Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks | Hallucination Assessment&Adversarial Attacks&Large Language Models |
| 23.11 | Fudan University | arxiv | JADE: A Linguistic-based Safety Evaluation Platform for LLM | Safety Benchmarks |
| 23.11 | UNC-Chapel Hill | arxiv | Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | Hallucination&Benchmark&Multimodal |
| 23.11 | IBM Research AI | EMNLP2023(GEM workshop) | Unveiling Safety Vulnerabilities of Large Language Models | Adversarial Examples&Clustering&Automatically Identifying |
| 23.11 | The Hong Kong University of Science and Technology | arxiv | P-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models | Differential Privacy&Privacy Evaluation |
| 23.11 | UC Berkeley | arxiv | CAN LLMS FOLLOW SIMPLE RULES | Evaluation&Attack Strategies |
| 23.11 | University of Central Florida | arxiv | THOS: A Benchmark Dataset for Targeted Hate and Offensive Speech | Hate Speech&Offensive Speech&Dataset |
| 23.11 | Beijing Jiaotong University; DAMO Academy, Alibaba Group, Peng Cheng Lab | arXiv | AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | Multi-modal Large Language Models&Hallucination&Benchmark |
| 23.11 | Patronus AI, University of Oxford, Bocconi University | arxiv | SIMPLESAFETYTESTS: a Test Suite for Identifying Critical Safety Risks in Large Language Models | Safety Risks&Test Suite&Evaluation |
| 23.11 | University of Southern California, University of Pennsylvania, University of California Davis | arxiv | Deceiving Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination? | Hallucinations&Semantic Associations&Benchmark |
| 23.11 | Seoul National University, Chung-Ang University, NAVER AI Lab, NAVER Cloud, University of Richmond | arxiv | LifeTox: Unveiling Implicit Toxicity in Life Advice | LifeTox Dataset&Toxicity Detection&Social Media Analysis |
| 23.11 | School of Information Renmin University of China | arxiv | UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation | Hallucination&Evaluation Benchmark |
| 23.11 | UC Santa Cruz, UNC-Chapel Hill | arxiv | How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Vision Large Language Models&Safety Evaluation&Adversarial Robustness |
| 23.11 | Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Baidu Inc. | arxiv | FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality Fairness Toxicity | Harmlessness Evaluation |
| 23.11 | Fudan University&Shanghai Artificial Intelligence Laboratory | NAACL2024 | Fake Alignment: Are LLMs Really Aligned Well? | Large Language Models&Safety Evaluation&Fake Alignment |
| 23.11 | Kahlert School of Computing | NAACL2024 | Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness | NLP Robustness&Out-of-Domain Evaluation&Adversarial Evaluation |
| 23.11 | Shanghai Jiao Tong University | NAACL2024(findings) | CLEAN–EVAL: Clean Evaluation on Contaminated Large Language Models | Clean Evaluation&Data Contamination&Large Language Models |
| 23.12 | Meta | arxiv | Purple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language Models | Safety&Cybersecurity&Code Security Benchmark |
| 23.12 | University of Illinois Chicago, Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI), UNC Chapel-Hill | arxiv | DELUCIONQA: Detecting Hallucinations in Domain-specific Question Answering | Hallucination Detection&Domain-specific QA&Retrieval-augmented LLMs |
| 23.12 | University of Science and Technology of China, Hong Kong University of Science and Technology, Microsoft | arxiv | Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models | Indirect Prompt Injection Attacks&BIPIA Benchmark&Defense |
| 24.01 | NewsBreak, University of Illinois Urbana-Champaign | arxiv | RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models | Retrieval-Augmented Generation&Hallucination Detection&Dataset |
| 24.01 | University of Notre Dame, Lehigh University, Illinois Institute of Technology, Institut Polytechnique de Paris, William & Mary, Texas A&M University, Samsung Research America, Stanford University | ICML 2024 | TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS | Trustworthiness&Benchmark Evaluation |
| 24.01 | University College London | arxiv | Hallucination Benchmark in Medical Visual Question Answering | Medical Visual Question Answering&Hallucination Benchmark |
| 24.01 | Carnegie Mellon University | arxiv | TOFU: A Task of Fictitious Unlearning for LLMs | Data Privacy&Ethical Concerns&Unlearning |
| 24.01 | IRLab CITIC Research Centre, Universidade da Coruña | arxiv | MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection | Hate Speech Detection&Social Media |
| 24.01 | Northwestern University, New York University, University of Liverpool, Rutgers University | arxiv | AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models | Jailbreak Attack&Evaluation Frameworks&Ground Truth Dataset |
| 24.01 | Shanghai Jiao Tong University | arxiv | R-Judge: Benchmarking Safety Risk Awareness for LLM Agents | LLM Agents&Safety Risk Awareness&Benchmark |
| 24.02 | University of Illinois Urbana-Champaign, Center for AI Safety, Carnegie Mellon University, UC Berkeley, Microsoft | arxiv | HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal | Automated Red Teaming&Robust Refusal |
| 24.02 | Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong | arxiv | SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models | Safety Benchmark&Safety Evaluation**&Hierarchical Taxonomy |
| 24.02 | Middle East Technical University | arxiv | HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs | Hallucination&Benchmarking Dataset |
| 24.02 | Indian Institute of Technology Kharagpur | arxiv | How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries | Instruction-centric Responses&Ethical Vulnerabilities |
| 24.03 | East China Normal University | arxiv | DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models | Dialogue-level Hallucination&Benchmarking&Human-machine Interaction |
| 24.03 | Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology | arxiv | OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety | Chinese LLMs&Benchmarking&Safety |
| 24.04 | University of Pennsylvania, ETH Zurich, EPFL, Sony AI | arxiv | JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models | Jailbreaking Attacks&Robustness Benchmark |
| 24.04 | Vector Institute for Artificial Intelligence, University of Limerick | arxiv | Developing Safe and Responsible Large Language Models - A Comprehensive Framework | Responsible AI&AI Safety&Generative AI |
| 24.04 | LMU Munich, University of Oxford, Siemens AG, Munich Center for Machine Learning (MCML), Wuhan University | arxiv | RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS? | Jailbreak Attacks&GPT-4V&Evaluation Benchmark&Robustness |
| 24.04 | Bocconi University, University of Oxford | arxiv | SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety | LLM Safety&Open Datasets&Systematic Review |
| 24.04 | University of Alberta&The University of Tokyo | arxiv | Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward | LLM Safety&Online Safety Analysis&Benchmark |
| 24.04 | Technion – Israel Institute of Technology, Google Research | arxiv | Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs | Hallucinations&Benchmarks |
| 24.05 | Carnegie Mellon University | arxiv | PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models | Multilingual Evaluation&*Datasets |
| 24.05 | Paul G. Allen School of Computer Science & Engineering | arxiv | MASSIVE Multilingual Abstract Meaning Representation: A Dataset and Baselines for Hallucination Detection | Hallucination Detection&Multilingual AMR&Dataset |
| 24.05 | University of California, Riverside | arxiv | Cross-Task Defense: Instruction-Tuning LLMs for Content Safety | Instruction-Tuning&LLM Safety&Content Safety |
| 24.06 | University of Waterloo | arxiv | TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability | Truthfulness&Reliability |
| 24.06 | Rutgers University | arxiv | MoralBench: Moral Evaluation of LLMs | Moral Evaluation&MoralBench |
| 24.06 | Tsinghua University | arxiv | Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study | Trustworthiness&MLLMs&Benchmark |
| 24.06 | Beijing Academy of Artificial Intelligence | arxiv | HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation | Hallucination Evaluation&Dialogue-Level&HalluDial |
| 24.06 | Sichuan University | arxiv | LEGEND: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets | Safety Margin&Preference Datasets&Representation Engineering |
| 24.06 | The Hong Kong University of Science and Technology (Guangzhou) | arxiv | Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs | Jailbreak Attacks&Benchmarking |
| 24.06 | AI Innovation Center, China Unicom | arxiv | CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models | Chinese Hierarchical Safety Benchmark&Large Language Models&Automatic Evaluation |
| 24.06 | arxiv | Supporting Human Raters with the Detection of Harmful Content using Large Language Models | Harmful Content Detection&Hate Speech | |
| 24.06 | South China University of Technology, Pazhou Laboratory, University of Maryland, Baltimore County | arxiv | GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models | Gender Bias Mitigation&Alignment Dataset&Bias Categories |
| 24.06 | Center for AI Safety and Governance, Institute for AI, Peking University | arxiv | SAFESORA: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset | Safety Alignment&Text2Video Generation |
| 24.06 | Fudan University | arxiv | Cross-Modality Safety Alignment | Multimodal Safety&Large Vision-Language Models&SIUO Benchmark |
| 24.06 | KAIST | arxiv | CSRT: Evaluation and Analysis of LLMs using Code-Switching Red-Teaming Dataset | Code-Switching&Red-Teaming&Multilingualism |
| 24.06 | University College London | arxiv | JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models | Gender Bias&Hiring Bias&Benchmarking |
| 24.06 | Peking University | arxiv | PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models | Safety Alignment&Preference Dataset |
| 24.06 | University of California, Los Angeles | arxiv | MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries? | Multimodal Language Models&Oversensitivity&Safety Mechanisms |
| 24.06 | Allen Institute for AI | arxiv | WILDGUARD: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs | Safety Moderation&Jailbreak Attacks&Moderation Tools |
| 24.06 | University of Washington | arxiv | WILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models | Jailbreaking&Safety Training&Adversarial Attacks |
| 24.07 | Beijing Jiaotong University | arxiv | KG-FPQ: Evaluating Factuality Hallucination in LLMs with Knowledge Graph-based False Premise Questions | Factuality Hallucination&Knowledge Graph&False Premise Questions |
| 24.07 | Chinese Academy of Sciences | arxiv | T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models | Text-to-Video Generation&Safety Evaluation&Generative Models |
| 24.07 | Patronus AI | arxiv | Lynx: An Open Source Hallucination Evaluation Model | Hallucination Detection&RAG&Evaluation Model |
| 24.07 | Virginia Tech | arxiv | AIR-BENCH 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies | AI Safety&Regulations&Policies&Risk Categories |
| 24.07 | Columbia University | ECCV 2024 | HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Hallucination&Vision-Language Models&Datasets |
| 24.07 | Center for AI Safety | arxiv | Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? | AI Safety&Benchmarks |
| 24.08 | Walled AI Labs | arxiv | WALLEDEVAL: A Comprehensive Safety Evaluation Toolkit for Large Language Models | AI Safety&Prompt Injection |
| 24.08 | ShanghaiTech University | arxiv | MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models | Jailbreak Attacks&Vision-Language Models&Security |
| 24.08 | Stanford University | arxiv | Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models | Cybersecurity&Capture the Flag |
| 24.08 | Zhejiang University | arxiv | Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks | Jailbreak Attacks&LLM Reliability&Evaluation Framework |
| 24.08 | Enkrypt AI | arxiv | SAGE-RT: Synthetic Alignment Data Generation for Safety Evaluation and Red Teaming | Synthetic Data Generation&Safety Evaluation&Red Teaming |
| 24.08 | Tianjin University | Findings of ACL 2024 | CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models | Moral Evaluation&Moral Dilemma |
| 24.08 | University of Surrey | IJCAI 2024 | CodeMirage: Hallucinations in Code Generated by Large Language Models | Code Hallucinations&CodeMirage Dataset |
| 24.08 | Chalmers University of Technology | arxiv | LLMSecCode: Evaluating Large Language Models for Secure Coding | Secure Coding&Evaluation Framework |
| 24.09 | The Chinese University of Hong Kong | arxiv | Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness | Correctness&Non-Toxicity&Fairness |
| 24.09 | KAIST | arxiv | Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering | Image Hallucination&Text-to-Image Generation&Question-Answering |
| 24.09 | Zhejiang University | arxiv | GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks | Prompt Injection&LLM Safety&Benchmarking |
| 24.10 | Zhejiang University | arxiv | AGENT SECURITY BENCH (ASB): FORMALIZING AND BENCHMARKING ATTACKS AND DEFENSES IN LLM-BASED AGENTS | LLM-based Agents&Security Benchmarks&Adversarial Attacks |
| 24.10 | Zhejiang University, Duke University | arxiv | SCISAFEEVAL: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks | Safety Alignment&Scientific Tasks |
| 24.10 | The Chinese University of Hong Kong, Tencent AI Lab | arxiv | Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step | Chain-of-Jailbreak&Image Generation Models&Safety |
| 24.10 | University of California, Santa Cruz, University of California, Berkeley | arxiv | Multimodal Situational Safety: A Benchmark for Large Language Models | Multimodal Situational Safety&MLLMs&Safety Benchmark |
| 24.10 | IBM Research | arxiv | ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents | Web Agents&Safety&Trustworthiness |
| 24.10 | Renmin University of China, Anthropic, University of Oxford, University of Edinburgh, Mila, Tangentic | arxiv | POISONBENCH: Assessing Large Language Model Vulnerability to Data Poisoning | Data poisoning&LLM vulnerability&Preference learning |
| 24.10 | Gray Swan AI, UK AI Safety Institute | arxiv | AGENTHARM: A Benchmark for Measuring Harmfulness of LLM Agents | Jailbreaking&LLM agents&Harmful agent tasks |
| 24.10 | Purdue University | arxiv | COLLU-BENCH: A Benchmark for Predicting Language Model Hallucinations in Code | Code hallucinations&Code generation&Automated program repair |
| 24.10 | The Hong Kong University of Science and Technology (Guangzhou), University of Birmingham, Baidu Inc. | arxiv | JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework | Jailbreak judge&Multi-agent framework |
| 24.10 | University of Notre Dame, IBM Research | arxiv | BenchmarkCards: Large Language Model and Risk Reporting | BenchmarkCards&Bias&Fairness |
| 24.10 | Vectara, Inc., Iowa State University, University of Southern California, Entropy Technologies, University of Waterloo, Funix.io, University of Wisconsin, Madison | arxiv | FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs | Hallucination detection&Human-annotated benchmark&Faithfulness |
| 24.10 | Southern University of Science and Technology | arxiv | ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models | ChineseSafe&Content Safety&LLM Evaluation |
| 24.10 | Beihang University | arxiv | SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models | Multimodal Large Language Models&Safety Evaluation Framework&Risk Assessment |
| 24.10 | University of Washington-Madison | arxiv | CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs | Safety Assessment&LLM Evaluation&Instruction Attacks |
| 24.10 | University of Pennsylvania | arxiv | Benchmarking LLM Guardrails in Handling Multilingual Toxicity | Multilingual Toxicity Detection&Guardrails&Jailbreaking Attacks |
| 24.10 | University of Wisconsin-Madison | arxiv | InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models | Prompt Injection Defense&Over-defense Detection&Guardrail Models |
| 24.10 | National Engineering Research Center for Software Engineering, Peking University | NeurIPS 2024 | SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types | LLM Safety&Prompt Engineering&Jailbreak Attacks |
| 24.10 | Alan Turing Institute | arxiv | AI Cyber Risk Benchmark: Automated Exploitation Capabilities | Cybersecurity&LLMs&Automated Exploitation |
| 24.11 | Fudan University | arXiv | LONGSAFETYBENCH: LONG-CONTEXT LLMS STRUGGLE WITH SAFETY ISSUES | Long-Context Models&Safety Evaluation&Benchmarking |
| 24.11 | Anthropic | arXiv | Rapid Response: Mitigating LLM Jailbreaks with a Few Examples | Jailbreak Defense&Rapid Response |
| 24.11 | Texas A&M University | arXiv | Responsible AI in Construction Safety: Systematic Evaluation of Large Language Models and Prompt Engineering | Construction Safety&Prompt Engineering&LLM Evaluation |
| 24.11 | IBM Research Europe | NeurIPS 2024 SafeGenAI Workshop | HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment | Jailbreaking Techniques&LLM Vulnerability&Quantization Impact |
| 24.11 | Peking University | arxiv | ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain | LLM Safety&Chemistry Domain&Benchmarking |
| 24.11 | New York University, JPMorgan Chase, Cornell Tech, Northeastern University | arxiv | Assessment of LLM Responses to End-user Security Questions | LLM Evaluation&End-user Security&Information Integrity |
| 24.11 | National Library of Medicine, NIH&University of Maryland&University of Virginia&Universidad de Chile | arxiv | Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine | Medical AI&LLM Safety&MedGuard Benchmark |
| 24.11 | European Commission Joint Research Centre | EMNLP 2024 | GuardBench: A Large-Scale Benchmark for Guardrail Models | guardrail models&benchmark&evaluation |
| 24.12 | Vizuara AI Labs | arxiv | CBEVAL: A Framework for Evaluating and Interpreting Cognitive Biases in LLMs | Cognitive Biases&LLM Evaluation&Reasoning Limitations |
| 24.12 | Beijing Institute of Technology, Beihang University | arxiv | REFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks | Format Faithfulness&Benchmark |
| 24.12 | UCLA, Salesforce AI Research | NeurIPS 2024 | SAFEWORLD: Geo-Diverse Safety Alignment | Geo-Diverse Alignment&Safety Evaluation&Legal Compliance |
| 24.12 | Shanghai Jiao Tong University | arxiv | SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents | Safety-Aware Task Planning&Embodied LLM Agents&Hazard Mitigation |
| 24.12 | Tsinghua University | arxiv | AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents | Agent Safety&Risk Awareness&Interactive Evaluation |
| 24.12 | TU Darmstadt | arxiv | LLMs Lost in Translation: M-ALERT Uncovers Cross-Linguistic Safety Gaps | Cross-Linguistic Safety&Multilingual Benchmark&LLM Alignment |
| 24.12 | Alibaba, China Academy of Information and Communications Technology | arxiv | Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models | Safety Benchmark&Factuality Evaluation |
| 24.12 | University of Warwick, Cranfield University | arxiv | MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models | Medical Hallucinations&Benchmark&RLHF |
| 24.12 | The Hong Kong Polytechnic University | arxiv | SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Cybersecurity Benchmark&Large Language Models&Dataset Evaluation |
| 25.01 | KTH Royal Institute of Technology | arxiv | CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models | Cybersecurity Benchmark&Jailbreaking&Prompt Dataset |
| 25.01 | Shahjalal University of Science and Technology | arxiv | From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs | Fake News Detection&Bangla&Low-Resource Languages |
| 25.01 | NVIDIA | arxiv | AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails | AI Safety&Content Moderation Dataset&LLM Risk Taxonomy |
| 25.01 | Georgia Institute of Technology | arxiv | On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena | Cultural Bias in LLMs&Cross-Linguistic Analysis&Arabic-English Benchmarks |
| 25.01 | Bocconi University | arxiv | MSTS: A Multimodal Safety Test Suite for Vision-Language Models | Multimodal Safety&Vision-Language Models |
| 25.01 | Fudan University | arxiv | You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense | Jailbreak Defense&LLM Performance&USEBench |
| 25.01 | McGill University | arxiv | OnionEval: A Unified Evaluation of Fact-conflicting Hallucination for Small-Large Language Models | Fact-conflicting Hallucination&Small-Large Language Models (SLLMs)&Benchmark |
| 25.01 | HKUST | arxiv | Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak | Audio Language Models&Jailbreak Vulnerabilities&Audio Modality Edits |
| 25.01 | University of Cambridge | arxiv | CASE-BENCH: Context-Aware Safety Evaluation Benchmark for Large Language Models | LLM Safety&Context-Aware Evaluation&Over-Refusal |
| 25.01 | CISPA Helmholtz Center for Information Security | USENIX Security 2025 | HATEBENCH: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns | Hate Speech Detection&LLM-Generated Content&Hate Campaigns |
| 25.01 | Shanghai Artificial Intelligence Laboratory, Tianjin University | arxiv | Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models | Vision-Language Models (VLMs)&Chain-of-Thought (CoT) |
| 25.01 | Independent Research Team “Annyeong! Luda” | PACLIC 38 | RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts | Jailbreaking&Conversational AI&User Intent Detection |
| 25.01 | Renmin University of China | arxiv | SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Models | Retrieval-Augmented Generation&Security Benchmarking&Adversarial Attacks |
| 25.02 | Rochester Institute of Technology | arxiv | HOPE VS. HATE: UNDERSTANDING USER INTERACTIONS WITH LGBTQ+ NEWS CONTENT IN MAINSTREAM US NEWS MEDIA THROUGH THE LENS OF HOPE SPEECH | Hope Speech&LGBTQ+&Political Bias |
| 25.02 | Nanjing University | arxiv | SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks | Safety Benchmark&Jailbreak Attacks&Multi-Turn Dialogues |
| 25.02 | China Unicom | arxiv | Safety Evaluation of DeepSeek Models in Chinese Contexts | DeepSeek Models&Safety Evaluation&Chinese Contexts |
| 25.02 | Hong Kong University of Science and Technology | arxiv | Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs | Multimodal LLMs&Safety Awareness&Benchmarking |
| 25.02 | Columbia University | arXiv | TREECUT: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation | Math Word Problem&LLM Hallucination&Synthetic Dataset |
| 25.02 | Barcelona Supercomputing Center (BSC) | arXiv | Efficient Safety Retrofitting Against Jailbreaking for LLMs | Model Alignment&LLM Safety&Jailbreaking |
| 25.02 | The Hong Kong University of Science and Technology | arxiv | GuidedBench: Equipping Jailbreak Evaluation with Guidelines | Jailbreak Evaluation&AI Safety&LLM Security |
| 25.02 | Wuhan University | arxiv | ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models | Backdoor Attacks&LLM Security&Benchmarking |
| 25.02 | Tsinghua University | arxiv | LongSafety: Evaluating Long-Context Safety of Large Language Models | Long-Context Safety&LLM Evaluation&Safety Benchmark |
| 25.02 | KAIST | arxiv | Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models | User-Specific Safety&LLM Evaluation&Safety Benchmark |
| 25.02 | Beijing University of Posts and Telecommunications | arxiv | JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models | Chinese Benchmark&Security Assessment&Jailbreak Attack |
| 25.02 | 34 Affiliates | arxiv | On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective | Safety Assessment&Guideline Paper |
| 25.02 | IBM Research | NeurIPS 2024, SafeGenAI Workshop | Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs | Adversarial Prompting&Jailbreak Detection&Guardrail Benchmarking |
| 25.03 | Google DeepMind, ETH Zurich | arXiv | AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses | Adversarial Example Defenses&LLM Security&Autonomous Exploitation |
| 25.03 | The Pennsylvania State University, GE Healthcare | arXiv | MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models | Medical Hallucination&Hallucination Mitigation&Vision-Language Models |
| 25.03 | Nanyang Technological University, Agency for Science, Technology and Research (A*STAR) | arXiv | Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories | Software Vulnerability Detection&LLM-based Agents&Interprocedural Analysis |
| 25.03 | Zhejiang University | arxiv | Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation | Small Language Models&Jailbreak Attack&Security Evaluation |
| 25.03 | University of Virginia | arxiv | Benchmarking Group Fairness in Reward Models | Group Fairness&Reward Models&LLM Alignment |
| 25.03 | Texas A&M University | arxiv | HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations | Hallucination Detection&Multilingual Benchmark&Fine-grained Annotation |
| 25.03 | University of Kent | arxiv | CyberLLMInstruct: A New Dataset for Analysing Safety of Fine-Tuned LLMs Using Cyber Security Data | Cyber Security&Fine-Tuning Safety&Instruction Dataset |
| 25.03 | UC San Diego | arxiv | ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content | Multimodal Jailbreaking&Extremist Content&Safety Evaluation |
| 25.03 | Beihang University | arxiv | Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings | Safety Evaluation&Jailbreak Attacks&DeepSeek Models |
| 25.03 | George Mason University | arxiv | Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack | Batch Prompting&Prompt Injection Attack&LLM Security |
| 25.03 | Arizona State University | arxiv | Personalized Attacks of Social Engineering in Multi-turn Conversations - LLM Agents for Simulation and Detection | Social Engineering&Multi-turn Conversation&LLM Agents |
| 25.03 | University of Illinois Urbana-Champaign | ICLR 2025 | MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models | Multimodal Foundation Models&Trustworthiness Evaluation&Safety Benchmark |
| 25.03 | University of Chicago | arxiv | SHIELDAGENT: Shielding Agents via Verifiable Safety Policy Reasoning | LLM Agents&Safety Policy Reasoning&Guardrails |
| 25.04 | University College London | arxiv | On Benchmarking Code LLMs for Android Malware Analysis | Code LLMs&Android Malware Analysis&Structured Evaluation |
| 25.04 | UC Santa Cruz | arxiv | STAR-1: Safer Alignment of Reasoning LLMs with 1K Data | Reasoning LLMs&Safety Alignment&High-Quality Dataset |
| 25.04 | National University of Singapore | arxiv | SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | MLRM&Safety Evaluation&Jailbreaking Attacks |
| 25.04 | Tsinghua University | arxiv | RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Safety Alignment&DeepSeek-R1&Large Reasoning Models |
| 25.04 | The Hong Kong University of Science and Technology | arxiv | Benchmarking Multi-National Value Alignment for Large Language Models | LLM Hallucination&Value Alignment&Cross-National Evaluation |
| 25.04 | Qatar Computing Research Institute | arxiv | AIXAMINE: SIMPLIFIED LLM SAFETY AND SECURITY | LLM Safety&Security Evaluation&Black-box Benchmarking |
| 25.04 | University of California, Berkeley | arxiv | JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift | Jailbreak Detection&Concept Drift&Continuous Learning |
| 25.04 | Intuit | arxiv | Security Steerability is All You Need | Security Steerability&Prompt Guardrails&LLM Robustness |
| 25.04 | University of Notre Dame | NAACL2025(System Demo) | TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustworthiness of Generative Foundation Models | Trustworthiness Evaluation Across GenFMs |
| 25.05 | Rutgers University-New Brunswick | arxiv | Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety | Jailbreaking&Image Generation&Multilingual Obfuscation |
| 25.05 | Beijing Jiaotong University | arxiv | Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model | Multimodal Large Reasoning Models&Safety Alignment&Chain-of-Thought |
| 25.05 | Shanghai Artificial Intelligence Laboratory | arxiv | Benchmarking Ethical and Safety Risks of Healthcare LLMs in China – Toward Systemic Governance under Healthy China 2030 | Medical LLMs&Ethical Risk Assessment&Governance Framework |
| 25.05 | University of Technology Sydney | arxiv | SecReEvalBench: A Security Resilient Evaluation Benchmark for Large Language Models | Prompt Attack&Security Benchmark&Adversarial Robustness |
| 25.05 | Dartmouth College | arxiv | FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | Over-Refusal Mitigation&Structured Reasoning&LLM Safety Evaluation |
| 25.05 | National University of Singapore | arxiv | GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning | VLM Guard Models&Reinforcement Learning&Reasoning Alignment |
| 25.05 | Fudan University | arxiv | Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction | LLM Agent&Behavioral Safety&Thought Correction |
| 25.05 | Giskard AI | arxiv | Phare: A Safety Probe for Large Language Models | Safety Evaluation&Hallucination Detection&Bias Diagnosis |
| 25.05 | Beijing University of Posts and Telecommunications | arxiv | Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs | Safety Evaluation&Video LVLMs&Prompt Injection |
| 25.05 | Yonsei University | arxiv | Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition | Visual Emergency Recognition&Vision-Language Models&Contextual Reasoning |
| 25.05 | POSTECH | arxiv | Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study | VLM Safety&Meme Benchmark&Multimodal Alignment |
| 25.05 | Mohamed bin Zayed University of Artificial Intelligence | arxiv | Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models | Audio-Language Models&Jailbreak Attacks&Adversarial Benchmark |
| 25.05 | Fujitsu Research of Europe | arxiv | MAPS: A Multilingual Benchmark for Global Agent Performance and Security | Agentic AI&Multilingual Evaluation&LLM Security |
| 25.05 | Nanyang Technological University | arxiv | AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models | Audio LLMs&Trustworthiness Evaluation&Multimodal Safety |
| 25.05 | University of Science and Technology of China | arxiv | From Evaluation to Defense: Advancing Safety in Video Large Language Models | Video LLMs&Safety Benchmark&Multimodal Defense |
| 25.05 | Hong Kong University of Science and Technology (Guangzhou) | arxiv | JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | Jailbreak Benchmark&Audio Language Model&Adversarial Attack |
| 25.05 | University of Washington | arxiv | SOS BENCH: Benchmarking Safety Alignment on Scientific Knowledge | Safety Alignment&Scientific Knowledge&Benchmark |
| 25.05 | John Hopkins University | arxiv | Jailbreak Distillation: Renewable Safety Benchmarking | Jailbreak Distillation&Safety Benchmark&Prompt Selection |
| 25.05 | Alibaba Group | arxiv | USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models | Multimodal LLMs&Safety Evaluation&Benchmark |
| 25.05 | Wuhan University | arxiv | TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis | LLM Safety&Red-Teaming&Data Diversity |
| 25.06 | New York University Abu Dhabi | arxiv | AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents | LLM Agent&Safety Evaluation&Security Benchmark |
| 25.06 | Sabaragamuwa University of Sri Lanka | arxiv | XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content | LLM Safety&Extremist Content&Graded Evaluation |
| 25.06 | University of Zurich | arxiv | Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution | Prompt Injection&Agent Security&Data Exfiltration |
| 25.06 | Sun Yat-sen University | arxiv | MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine | Traditional Chinese Medicine&LLM Benchmark&Safety Evaluation |
| 25.06 | NII-LLMC | arxiv | AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output | LLM Safety&Japanese Dataset&Safety Fine-Tuning |
| 25.06 | University of Oxford | arxiv | AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents | LLM Agent&Misalignment Propensity&Safety Benchmark |
| 25.06 | ETRI | arxiv | HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | VLM Safety&Safety Benchmark&Safety Meta Token |
| 25.06 | ByteDance | arxiv | SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code | Code Generation&Security Vulnerability&LLM Benchmark&Automatic Evaluation |
| 25.06 | ShanghaiTech University | arxiv | Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures | LLM Safety&Implicit Harm&JailFlipBench&Adversarial Attacks |
| 25.06 | Manipal University, GGSI University, IIITDM Kancheepuram, Meta AI, Amazon, BITS Goa | arxiv | AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement—Introducing Adversarial Vulnerability Quality Index (AVQI) | LLM Safety&Adversarial Benchmark&Latent Camouflage&Contrastive Alignment&AVQI |
| 25.06 | Microsoft, ISTA, Trend Micro, RainaResearch, University of Coimbra, Vietnamese German University, SK Shieldus, HiddenLayer | arxiv | LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge | Prompt Injection&LLM Security&Benchmark Dataset&Email Assistant&Defense Evaluation |
| 25.06 | Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; The Chinese University of Hong Kong, Shenzhen | arxiv | From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring | LLM Safety&Content Moderation&Streaming Detection&Early Stopping&Token-Level Annotation |
| 25.06 | Zhejiang University, Tencent, National University of Singapore | arxiv | ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark | Harmful Content Detection&Chinese Benchmark&LLM Safety&Knowledge Augmentation&Content Moderation |
| 25.06 | Institute of Artificial Intelligence (TeleAI), China Telecom; Beihang University; Beijing Jiaotong University; Northwestern Polytechnical University | ACL 2025 | LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges | Jailbreak Attack&Malware Generation&LLM Security&Benchmark&Code Generation |
| 25.06 | Hong Kong University of Science and Technology, Peking University | Findings of ACL 2025 | SafeLawBench: Towards Safe Alignment of Large Language Models | LLM Safety&Legal Benchmark&Risk Hierarchy&Refusal Behavior&Safety Evaluation |
| 25.06 | The University of Hong Kong | arxiv | RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking | Fact-checking&Large Language Models&Benchmark&Multimodal Reasoning&Uncertainty Evaluation |
| 25.06 | dreadnode | arxiv | AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models | AI Red Teaming&Security Benchmark&Capture-the-Flag&Autonomous Agents&Prompt Injection |
| 25.06 | Beihang University | arxiv | AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions | Embodied Agents&Safety Benchmark&Vision-Language Model&Hazardous Instructions&Jailbreak Attack |
| 25.06 | Zhejiang University | arxiv | RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments | LLM Agent&Security Evaluation&Benchmark&CWE&Tool Execution |
| 25.06 | Anthropic | arxiv | SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents | Sabotage&Monitoring&LLM Agent |
| 25.06 | NASK – National Research Institute | arxiv | PL-Guard: Benchmarking Language Model Safety for Polish | Safety Benchmark&Polish LLM&Adversarial Robustness |
| 25.06 | University of Illinois Urbana-Champaign | arxiv | GUARDSET-X: Multi-Domain, Policy-Grounded, AI Security Guardrail Benchmark | Guardrail Dataset&Safety Evaluation&Policy-Grounded |
| 25.06 | Emory University | arxiv | Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm | Model Editing&Ethical Alignment&Behavior Control |
| 25.06 | University of Illinois Urbana-Champaign | arxiv | Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation | LLM Alignment&Long-Horizon Simulation&Risk-Aware |
| 25.06 | University of Illinois Urbana-Champaign | arxiv | HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation | Hallucination&Segmentation&Counterfactual Reasoning |
| 25.07 | Inner Mongolia University | arxiv | McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models | Bias Evaluation&Chinese LLM&Benchmark |
| 25.07 | University of Edinburgh | arxiv | Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences | Privacy Profile&LLM&PII |
| 25.07 | University of California, Los Angeles | arxiv | ModelCitizens: Representing Community Voices in Online Safety | Toxicity Detection&Community Annotation&Context |
| 25.07 | Carnegie Mellon University | arxiv | OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety | Agent Safety&Evaluation&Risk |
| 25.05 | ETH Zurich | ACL 2025 Findings | ACE: Benchmarking Agent-based Code Execution Safety in LLMs | Agent Safety&Code Execution&Benchmark |
| 25.05 | National University of Singapore | ACL 2025 | DORIS: A Fine-grained Benchmark for Detecting Hallucinations in Long-form Text Generation | Hallucination Detection&Long-form Generation&Benchmark |
| 25.05 | Tsinghua University | CoLM 2025 | MEmoBench: Evaluating Memorization-based Security Risks in LLMs | Memorization&Security Risk&Benchmark |
| 25.07 | CISPA Helmholtz Center for Information Security | USENIX Security 2025 | Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities | VLM&Safety&Alignment |
| 25.07 | Maastricht University | arxiv | Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch | Social Bias&Dataset&Dutch |
| 25.07 | Ben-Gurion University of the Negev | arxiv | Tab-MIA: A Benchmark Dataset for Membership Inference Attacks on Tabular Data in LLMs | Membership Inference&Tabular Data&Privacy |
| 25.07 | RWTH Aachen University | GeBNLP @ ACL 2025 | Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language | Gender Bias&German&LLM |
| 25.07 | Chinese Academy of Sciences, University of Chinese Academy of Sciences | arxiv | Libra: Large Chinese-based Safeguard for AI Content | Safeguard&Chinese&Safety |
| 25.07 | Peking University, Peking Union Medical College | arxiv | A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains | MedicalLLM&Safety&Effectiveness |
| 25.07 | Glass Box Medicine | arxiv | Large language models provide unsafe answers to patient-posed medical questions | PatientSafety&MedicalAdvice&Chatbots |
| 25.07 | University of Cambridge | arxiv | TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law | Safety&Benchmark&Compliance |
| 25.07 | Mohamed bin Zayed University of Artificial Intelligence, University of Delhi | arxiv | UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases | SafetyAlignment&HardPrompts&Reasoning |
| 25.08 | AIM Intelligence, Seoul National University | arxiv | Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models | LLM Alignment&Misalignment&Red-Teaming&Narrative Manipulation |
| 25.08 | Shanghai Jiao Tong University | arxiv | Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation | Jailbreak Evaluation&Scenario-Adaptivity&LLM Safety |
| 25.08 | Shanghai Jiao Tong University | arxiv | MCSBench: Benchmarking Multimodal Commonsense Reasoning of Large Vision-Language Models | Multimodal Commonsense Reasoning&Large Vision-Language Models&Benchmark |
| 25.08 | Fordham University | arxiv | Towards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First Approach | AI Moderation&Bias Detection&Human-in-the-Loop |
| 25.08 | Apple | COLM 2025 | Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution | Intersectional Bias&Coreference Resolution&Fairness Evaluation |
| 25.08 | Tsinghua University | arxiv | Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models | Audio-Visual LLMs&Safety Evaluation&Cross-Modal Consistency |
| 25.08 | The Hong Kong University of Science and Technology (Guangzhou) | arxiv | Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge | Mental Health Dialogues&Safety Alignment&LLM-as-Judge |
| 25.08 | Peking University | ACM MM 2025 | Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models | Vision-Language Models&Implicit Reasoning Safety&In-Context Learning |
| 25.08 | Guizhou University | arxiv | XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs | Multimodal Misinformation Detection&MLLM Evaluation&Contemporary Dataset |
| 25.08 | Shanghai Artificial Intelligence Laboratory | arxiv | LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models | Multilingual Safety&Benchmark&LLM Evaluation |
| 25.08 | University of Science and Technology of China, Beihang University | arxiv | MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers | Tool Poisoning&LLM Agent&Benchmark |
| 25.08 | Tsinghua University, Beihang University, RealAI | arxiv | Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation | Multimodal LLMs&Trustworthiness&Benchmark |
| 25.08 | Shanghai AI Lab, East China Normal University | arxiv | SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking | LLM Safety&Automated Benchmarking&Agent System |
| 25.08 | Oregon State University | arxiv | Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents | LLM Agents&TOCTOU Vulnerabilities&Security Mitigation |
| 25.08 | University of Wisconsin-Madison | arxiv | PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality | Vision-Language Models&Multimodal Safety&Chain-of-Thought Alignment |
| 25.08 | Shandong University | arxiv | Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents | Smartphone Agents&Privacy Awareness&SAPA-Bench |
| 25.08 | CISPA Helmholtz Center for Information Security | arxiv | JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring | Jailbreak Assessment&Decompositional Scoring&Fact-Checking |
| 25.09 | Shanghai Artificial Intelligence Laboratory | EMNLP 2025 Findings | Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios | Multimodal Safety&Dataset Construction&Risk Evaluation |
| 25.09 | Sichuan University | EMNLP 2025 Findings | CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking | Misinformation Fact-Checking&Chinese LLMs&Benchmark Dataset |
| 25.09 | Beihang University | arxiv | VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples | Retrieval-Augmented Generation&Chain-of-Thought&Multimodal LLMs |
| 25.09 | Leibniz Institute for Resilience Research | arxiv | Scaling behavior of large language models in emotional safety classification across sizes and tasks | Emotional Safety&Model Scaling&Mental Health Applications |
| 25.09 | King Fahd University of Petroleum and Minerals | arxiv | AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs | Arabic LLMs&Hallucination Evaluation&Multilingual Models |
| 25.09 | Stevens Institute of Technology | arxiv | Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models | Camouflaged Jailbreaks&Benchmark Dataset&LLM Security |
| 25.09 | Sharif University of Technology | arxiv | EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models | Trustworthiness&Persian LLMs&Cultural Alignment |
| 25.09 | Beijing Institute of Technology | arxiv | SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs | Tool Safety&LLM Agents&Prospective Evaluation |
| 25.09 | Appen | arxiv | Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models | Multimodal LLMs&Red Teaming&Adversarial Prompts&Safety Evaluation |
| 25.09 | Peking University, TeleAI (China Telecom), UCAS | arxiv | When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models | Multimodal Safety&Reasoning Path Optimization&Implicit Reasoning Risk |
| 25.09 | University of Washington, IAEA | arxiv | Automated Triaging and Transfer Learning of Incident Learning Safety Reports Using Large Language Representational Models | Incident Learning Systems&Transfer Learning&BlueBERT |
| 25.09 | EPFL & Tampere University & University of Tokyo & National Institute of Informatics | EMNLP 2025 | TactfulToM: Do LLMs Have the Theory of Mind Ability to Understand White Lies? | Theory of Mind&White Lies&Social Reasoning |
| 25.09 | Singapore Management University, National University of Singapore, Monash University, Aalto University, York University, Zhejiang University | arxiv | SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios | Secure Code Generation&LLM Agents&Software Security Benchmark |
| 25.09 | Zhejiang University | arxiv | HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment | LLM Harmfulness Evaluation&Benchmark&Safety Alignment |
| 25.09 | Shanghai Jiao Tong University, China Telecom Institute of Artificial Intelligence, USTC, GigaAI, Xinjiang University | arxiv | RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration | LLM Safety Evaluation&Multi-Agent Collaboration&Risk-Aware Framework |
| 25.09 | UC Berkeley, AWS AI Labs | arxiv | STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents | LLM Agents&Sequential Tool Attacks&Multi-turn Security Vulnerability |
| 25.09 | Chinese Academy of Sciences | arxiv | SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents | Embodied LLM Agents&Safety Benchmark&Risk Mitigation |
| 25.09 | Shanghai AI Laboratory, Fudan University, The University of Hong Kong, East China Normal University | arxiv | SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs | Safety Evaluation&Multi-Agent Framework&Dynamic Benchmarking |
| 25.10 | UC San Diego, Columbia University, Zhejiang University, MIT-IBM Watson Lab | arxiv | BiasFreeBench: A Benchmark for Mitigating Bias in Large Language Model Responses | Bias Mitigation&Benchmarking&Fairness Evaluation |
| 25.10 | Duke University | arxiv | WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents | Prompt Injection Detection&Web Agents&Multimodal Security&Benchmarking |
| 25.10 | AITRICS, KAIST, Severance Hospital (Yonsei University) | EMNLP 2025 | Taxonomy of Comprehensive Safety for Clinical Agents | Clinical Safety&Taxonomy&Intent Classification&LLM Safety |
| 25.10 | Rutgers University | arxiv | Read the Scene, Not the Script: Outcome-Aware Safety for LLMs | LLM Safety Alignment&Consequence-Blindness&Outcome-Aware Reasoning |
| 25.10 | University of California, San Diego | arxiv | When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation | Benchmark Aging&Factuality Evaluation&Temporal Misalignment |
| 25.10 | Beihang University & Chinese Academy of Sciences & China University of Petroleum (East China) & Zhejiang University of Technology & 360 AI Security Lab & University of Sydney | arxiv | SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents | Web Agent Security&LVLM Benchmark&Adversarial Evaluation&Prompt Injection |
| 25.10 | Carnegie Mellon University & Amazon AGI | arxiv | RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models | Selective Refusal&RAG Evaluation&Generative Benchmark |
| 25.10 | Beijing University of Posts and Telecommunications, University of California Santa Barbara | arxiv | MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents | Model Context Protocol&Agent Security&Benchmark |
| 25.10 | National Taiwan University, NVIDIA | arxiv | Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations | Audio-Language Models&Safety Alignment&Emotional Variation |
| 25.10 | Apple | arxiv | VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety | Multimodal Safety&Vision-Language Models&Benchmark |
| 25.10 | Amirkabir University of Technology, KCL, Politecnico di Milano, KTH | arxiv | PBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models | Bias Benchmark&Persian LLM&Human-AI Collaboration |
| 25.10 | NYU Shanghai, NUS, Yale University, NYU Center for Data Science | EMNLP 2025 | FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain | Trustworthiness Benchmark&Financial LLM&Ethical Alignment |
| 25.10 | Beihang University, Tsinghua University, Shanghai Qi Zhi Institute | NeurIPS 2025 | DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios | Deception Behavior&Benchmark&LLM Safety |
| 25.10 | Manchester Metropolitan University | arxiv | Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts | Large Language Models&Adversarial Attacks&AI Safety |
| 25.10 | Lakera AI, ETH ZĂĽrich, UK AI Security Institute, University of Oxford | arxiv | Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents | AI Agent Security&Backbone LLM Benchmarking&Threat Modeling |
| 25.10 | William & Mary | arxiv | Measuring the Security of Mobile LLM Agents under Adversarial Prompts from Untrusted Third-Party Channels | Mobile LLM Agent&Adversarial Prompt&Security Evaluation |
| 25.10 | Hong Kong Baptist University | arxiv | MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models | Multimodal Large Language Model&Harmfulness Evaluation&Agent-based Framework |
| 25.11 | Penn State University | arxiv | Do Methods to Jailbreak and Defend LLMs Generalize Across Languages? | Multilingual Jailbreak&Cross-lingual Safety&Defense Robustness |
| 25.11 | Tsinghua University | arxiv | LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context | AI Safety Benchmark&Chinese LLM Evaluation&Cultural Alignment |
| 25.11 | Nanjing University | arxiv | Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation | Factual Hallucination&New Knowledge&Attention Mechanism |
| 25.11 | German Research Center for Artificial Intelligence | arxiv | CONVERSE: Benchmarking Contextual Safety in Agent-to-Agent Conversations | Agent-to-Agent Safety&Privacy Leakage&Security Attacks |
| 25.11 | Southern University of Science and Technology | arxiv | MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs? | MCP Security&System Log Analysis&Risk Detection Benchmark |
| 25.11 | Beijing Institute of AI Safety and Governance | arxiv | Efficient LLM Safety Evaluation through Multi-Agent Debate | LLM Safety Evaluation&Multi-Agent Debate&Jailbreak Benchmark |
| 25.11 | Xi’an Jiaotong University | arxiv | How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity | Agent Safety&Intent Concealment&Task Complexity |
| 25.11 | Fudan University / City University of Hong Kong / Singapore Management University | arxiv | AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision–Language–Action Models | Vision–Language–Action Models&Adversarial Attacks&Backdoor Attacks |
| 25.11 | Nanjing University | arxiv | SAFERBENCH: A COMPREHENSIVE BENCHMARK FOR SAFETY ASSESSMENT IN LARGE REASONING MODELS | Safety Benchmark&Large Reasoning Models&Chain-of-Thought Safety |
| 25.11 | École Polytechnique, Sorbonne Paris Nord University, Crédit Agricole SA | arxiv | MUCH: A Multilingual Claim Hallucination Benchmark | Uncertainty Quantification&Hallucination Detection&Multilingual Benchmark |
| 25.11 | Intuit AI Security Research | arxiv | ASTRA: Agentic Steerability and Risk Assessment Framework | Agentic AI Security&Steerability Evaluation&Risk Assessment Framework |
| 25.11 | Scale AI, University of Maryland, UNC, Google DeepMind, Netflix, UT Austin | arxiv | PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach | LLM Safety&Risk Propensity&Agentic Evaluation |
| 25.11 | University at Buffalo, SUNY | arxiv | DUALGAUGE: Automated Joint Security–Functionality Benchmarking for Secure Code Generation | Secure Code Generation&Benchmarking&LLM Evaluation |
| 25.12 | Bangladesh University of Engineering and Technology | arxiv | When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals | Safety Alignment&Semantic Confusion&False Refusal Analysis |
| 25.12 | BITS Pilani, Hyderabad Campus | arxiv | BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages | Multilingual Hallucination Detection&Indic Languages&Benchmark Dataset |
| 25.12 | University of Leeds, Ben-Gurion University of the Negev | arxiv | DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses | Psychosocial Safety&Multi-Agent Evaluation&LLM-as-a-Judge |
| 25.12 | Carnegie Mellon University, Columbia University, Johns Hopkins University, HydroX AI | arxiv | Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks | Vibe Coding&AI Agents&Secure Code Generation&Benchmarking |
| 25.12 | Harbin Institute of Technology (Shenzhen) | NeurIPS 2025 / arxiv | Context-Aware Hierarchical Learning: A Two-Step Paradigm Towards Safer LLMs | LLM Security&Prompt Injection Defense&Tool-Augmented Models&Context-Aware Hierarchical Learning (CAHL) |
| 25.12 | Institute of Artificial Intelligence (TeleAI) of China Telecom | arxiv | TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations | Large Language Model&Jailbreak&Safety Benchmark |
| 25.12 | Vidyasirimedhi Institute of Science and Technology | arxiv | SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures | AI Safety Benchmark&Southeast Asian Languages&Cultural Safety |
| 25.12 | Nanyang Technological University | arxiv | OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack–Defense Evaluation | Multimodal LLM Safety&Jailbreak Benchmark&Attack–Defense Evaluation |
| 25.12 | East China Normal University | arxiv | MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers | MCPSecurity&SafetyBenchmark&LLMAgents |
| 25.12 | BRAC University, Bangladesh | arxiv | Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks | PromptInjection&DefenseGeneration&LLMSecurity |
| 25.12 | i.AI | arxiv | A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care | Medical Safety&Real-World Evaluation&Failure Analysis |
| 25.12 | Western Sydney University | arxiv | Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark | Over-Generation&Denial-of-Service&Prompt Attacks |
| 25.12 | International Institute of Information Technology Hyderabad | arxiv | PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI | Privacy Benchmark&Personalized AI&RAG Systems |
| 26.01 | Xidian University | arxiv | Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak | Algorithm Design&Jailbreak&Optimization Safety |
| 26.01 | Nanyang Technological University | arxiv | CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns | Chinese Safety&Adversarial Evaluation&Over-Refusal |
| 26.01 | Kyoto University | arxiv | JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models | Medical Safety&Multi-Turn Evaluation&Japanese Benchmark |
| 26.01 | Indian Institute of Technology Indore | AAAI 2026 | X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework | Explainable AI&Hate Speech Detection&Multilingual NLP |
| 26.01 | Fordham University | arxiv | GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators | LLM Moderation&Safety Benchmark&Fairness Robustness |
| 26.01 | United Arab Emirates University | arxiv | α³-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G Networks | UAV Agents&6G Networks&Safety Benchmark |
| 26.01 | Monash University | arxiv | PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics | Mental Health Ethics&Ethical Benchmark&Jurisdiction Awareness |
| 26.01 | VNU University of Science | arxiv | RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models | Red Teaming&Safety Benchmark&Risk Taxonomy |
| 26.01 | Xidian University | arxiv | HearSay Benchmark: Do Audio LLMs Leak What They Hear? | Audio Privacy&Voiceprint Leakage&Safety Benchmark |
| 26.01 | Beijing Jiaotong University | arxiv | When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life | Multimodal Safety&Daily Life Risks&Safety Benchmark |
| 26.01 | National University of Defense Technology | arxiv | The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs | Medical Safety&Parameter Alignment&Safety Grafting |
| 26.01 | Federal University of Goiás | arxiv | MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking | Demographic Bias&Hate Speech Safety&Jailbreak Benchmark |
| 26.01 | Dongguk University | arxiv | Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making | RoboticsSafety&SpatialReasoning&LLMReliability |
| 26.01 | University of California, Riverside | arxiv | PII-VisBench: Evaluating Personally Identifiable Information Safety in Vision Language Models Along a Continuum of Visibility | PIISafety&VLMPrivacy&Benchmark |
| 26.01 | University of California, Santa Cruz | arxiv | SafePro: Evaluating the Safety of Professional-Level AI Agents | AgentSafety&ProfessionalTasks&Benchmark |
| 26.01 | University of Notre Dame | arxiv | MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues | MultimodalSafety&MultiTurn&Benchmark |
| 26.01 | Shanghai University of Finance and Economics | arxiv | FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments | FinancialAgents&SecurityBenchmark&AgentSafety |
| 26.01 | Fudan University | arxiv | WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents | WebAgents&SecurityEvaluation&Benchmark |
| 26.01 | The University of Texas at Austin | arxiv | Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence | MedicalSafety&CounterfactualEvidence&LLMBehavior |
| 26.01 | The Alan Turing Institute | arxiv | DriveSafe: A Hierarchical Risk Taxonomy for Safety-Critical LLM-Based Driving Assistants | DrivingSafety&RiskTaxonomy&LLMEvaluation |
| 26.01 | Brown University | arxiv | UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages | MultilingualSafety&CulturalGrounding&GuardianModels |
| 26.01 | INSAIT | arxiv | ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation | ChartSecurity&MultimodalJailbreak&VisualizationAttacks |
| 26.01 | Shanghai Qi Zhi Institute | arxiv | VirtualCrime: Evaluating Criminal Potential of Large Language Models via Sandbox Simulation | AgenticCrime&SandboxSimulation&LLMSafety |
| 26.01 | Tsinghua University | arxiv | The Side Effects of Being Smart: Safety Risks in MLLMs’ Multi-Image Reasoning | MultiImageReasoning&MLLMSafety&SafetyBenchmarks |
| 26.01 | Eindhoven University of Technology | arxiv | AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains | AdversarialFactuality&HighRiskDomains&LLMBenchmark |
| 26.01 | Nanyang Technological University | CCS 2026 | The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents | LLM Agents&Value Misalignment&Safety Benchmark |
| 26.01 | Beijing Normal University | arxiv | GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents | GUI Agents&Privacy Preservation&Agent Safety |
| 26.01 | Shanghai AI Lab | arxiv | Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs | LVLM Safety&Automated Benchmarking&Multi-agent Systems |
| 26.01 | The Chinese University of Hong Kong | arxiv | VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models | Interactional Privacy&Speech LLMs&Safety Benchmark |
| 26.01 | Unknown | arxiv | SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models | Social Harm&Tail Risk&Fairness Evaluation |
| 26.01 | Tehran Institute for Advanced Studies | arxiv | FFE-HALLU: Hallucinations in Fixed Figurative Expressions: Benchmark of Idioms and Proverbs in the Persian Language | Figurative Hallucination&Idioms Benchmark&Cultural Grounding |
| 26.01 | University of Illinois | WWW 2026 | Mind the Ambiguity: Aleatoric Uncertainty Quantification in LLMs for Safe Medical Question Answering | Aleatoric Uncertainty&Medical QA&Ambiguity Detection |
| 26.02 | National University of Singapore | arxiv | Risky-Bench: Probing Agentic Safety Risks under Real-World Deployment | Agent Safety&Deployment&Benchmark |
| 26.02 | ShanghaiTech University, Shanghai, China | arxiv | LPS-BENCH: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios | Computer-Use Agents&Planning Safety&Benchmark |
| 26.02 | Zhejiang University | arxiv | CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs | Cross-modal Safety&MLLM Reliability&Benchmark |
| 26.02 | Qatar Computing Research Institute, HBKU, Doha, Qatar | arxiv | Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models | Counterfactual Hallucination&Multilingual VLMs&Benchmark |
| 26.02 | Zhejiang University | ICLR 2026 | WHEN AGENTS “MISREMEMBER” COLLECTIVELY: EXPLORING THE MANDELA EFFECT IN LLM-BASED MULTI-AGENT SYSTEMS | Mandela Effect&Multi-Agent Systems&False Memory |
| 26.02 | Seoul National University College of Medicine | arxiv | MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs | Prompt Injection&Clinical Safety&RAG |
| 26.02 | School of Software, Shandong University, Jinan, China | arxiv | TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents | LLM Agents&Anomaly Detection&Process Supervision |
| 26.02 | Critical ML Lab | arxiv | TAMPERBENCH: SYSTEMATICALLY STRESS-TESTING LLM SAFETY UNDER FINE-TUNING AND TAMPERING | Tamper Resistance&Safety Evaluation&Fine-Tuning Attacks |
| 26.02 | Texas A&M University, College Station, TX, USA | arxiv | Halluverse-M3: A multitask multilingual benchmark for hallucination in LLMs | Hallucination Benchmark&Multilingual Evaluation&Detection |
| 26.02 | Computer Science Department, Southern Illinois University, Carbondale, IL, United States | arxiv | AGENT-FENCE: MAPPING SECURITY VULNERABILITIES ACROSS DEEP RESEARCH AGENTS | Agent Security&Trust Boundaries&Architectural Evaluation |
| 26.02 | Xidian University | arxiv | FROM ASSISTANT TO DOUBLE AGENT: FORMALIZING AND BENCHMARKING ATTACKS ON OPENCLAW FOR PERSONALIZED LOCAL AI AGENT. | Personalized Agents&Security Benchmarking&OpenClaw |
| 26.02 | University of Oregon | arxiv | Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation | Retrieval-Augmented Generation&Knowledge Extraction&Benchmarking |
| 26.02 | Sentient Labs | arxiv | CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis | Analyst Agents&Tool Use&Long-Form Evaluation |
| 26.02 | Macquarie University, Sydney, Australia | arxiv | Are Aligned Large Language Models Still Misaligned? | Misalignment Benchmark&Safety&Culture |
| 26.02 | ETH ZĂĽrich | arxiv | GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory | AI Safety Benchmark&Game Theory&Multi-Agent Risks |
| 26.02 | Unknown | arxiv | ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs | Moral Reasoning&Prompt Engineering&Safety Benchmark |
| 26.02 | Northeastern University | arxiv | Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents | Tool-Using Agents&Multi-Turn Safety&Defense |
| 26.02 | Beijing Institute of AI Safety and Governance, China. | arxiv | ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI | AI Safety Benchmark&Frontier Risks&Governance |
| 26.02 | KI-Werkstatt / FB2, University of Applied Sciences Berlin | arxiv | LLMStructBench: Benchmarking Large Language Model Structured Data Extraction | Structured Extraction&Benchmark&JSON Generation |
| 26.02 | TSINGHUA UNIVERSITY, BEIJING, CHINA | arxiv | SecCodeBench-V2 Technical Report | Secure Code Generation&Benchmark&Code Security |
| 26.02 | University of TĂĽbingen & Max-Planck Institute for Intelligent Systems | arxiv | NESSIE: THE NECESSARY SAFETY BENCHMARK - IDENTIFYING ERRORS THAT SHOULD NOT EXIST | Safety Benchmark&Instruction Following&Necessary Condition |
| 26.02 | Independent Researcher | arxiv | Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents | Tool-Call Safety&Benchmark&Agent Alignment |
| 26.02 | Oregon State University, Corvallis, OR USA | arxiv | Fail-Closed Alignment for Large Language Models | Fail-Closed Safety&Refusal Robustness&Jailbreak Defense |
| 26.02 | Oracle America Inc. | EACL 2026 | IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages | Multilingual Safety&Jailbreak Benchmark&South Asian Languages |
| 26.02 | Kakaobank, South Korea | arxiv | FENCE: A Financial and Multimodal Jailbreak Detection Dataset | jailbreak detection&finance&multimodal |
| 26.02 | School of Computer Science, Carnegie Mellon University | arxiv | How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors? | user simulation&security privacy&benchmark |
| 26.02 | Southeast University | arxiv | CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents | code interpreter&benchmark&agent security |
| 26.02 | Max Planck Institute for Intelligent Systems, ELLIS Institute Tubingen, Tubingen AI Center | arxiv | SKILL-INJECT: Measuring Agent Vulnerability to Skill File Attacks | skill attack&prompt injection&agent benchmark |
| 26.02 | Harbin Institute of Technology, Shenzhen | arxiv | Evaluating Proactive Risk Awareness of Large Language Models | proactive safety&risk awareness&ecology |
| 26.03 | Shanghai AI Lab | arxiv | SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond | scientific safety&benchmark&alignment |
| 26.03 | UC Santa Cruz | arxiv | ZERODAYBENCH: EVALUATING LLM AGENTS ON UNSEEN ZERO-DAY VULNERABILITIES | agents&zero-day&benchmark |
| 26.03 | Duke University | arxiv | MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Foundation Models | multimodal safety&evaluation platform&benchmarking |
| 26.03 | Renmin University of China, China | arxiv | TRUSTMH-BENCH: A Comprehensive Benchmark for Evaluating the Trustworthiness of Mental Health LLMs | mental health&trustworthiness&benchmark |
| 26.03 | Compumacy for Artificial Intelligence Solutions, Cairo, Egypt | arxiv | SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models | Arabic LMs&safety benchmark&evaluation |
| 26.03 | CISPA Helmholtz Center for Information Security | arxiv | Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks | safety benchmarks&code quality&meta-analysis |
| 26.03 | SCB DataX | arxiv | THAISAFETYBENCH: ASSESSING LANGUAGE MODEL SAFETY IN THAI CULTURAL CONTEXTS | Thai safety&cultural context&benchmark |
| 26.03 | Mohamed bin Zayed University of Artificial Intelligence | ICLR 2026 | JAILNEWSBENCH: MULTI-LINGUAL AND REGIONAL BENCHMARK FOR FAKE NEWS GENERATION UNDER JAILBREAK ATTACKS | fake news&jailbreak&multilingual benchmark |
| 26.03 | Technical University of Munich | arxiv | A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness | LLM-as-a-Judge&adversarial robustness&evaluation reliability |
| 26.03 | Oracle | arxiv | LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models | deception evaluation&multi-agent games&alignment |
| 26.03 | MediaTek Research | arxiv | Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin | Taiwanese Mandarin&safety benchmark&cultural alignment |
| 26.03 | Singapore Management University | arxiv | Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs | beneficial backdoors&controllability&trustworthy AI |
| 26.03 | School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, USA | arxiv | Where Do LLM-based Systems Break? A System-Level Security Framework for Risk Assessment and Treatment | system security&risk assessment&attack-defense trees |
| 26.03 | Wuhan University, Wuhan, Hubei, China | KDD 2026 | AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models | cultural safety&cultural knowledge&adaptive alignment |
| 26.03 | School of Computing and Augmented Intelligence, Arizona State University, Tempe, USA | arxiv | Real-Time Trust Verification for Safe Agentic Actions using TrustBench | trust verification&agent safety&real-time intervention |
| 26.03 | Fudan University, Shanghai, China | arxiv | OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences | multimodal safety&hidden consequences&policy optimization |
| 26.03 | Department of Statistics & Data Science, Yale University | arxiv | Benchmarking Political Persuasion Risks Across Frontier Large Language Models | political persuasion&frontier models&risk benchmarking |
| 26.03 | University of Twente, Enschede, The Netherlands | arxiv | TOSSS: a CVE-based Software Security Benchmark for Large Language Models | software security benchmarks&CVE&code selection |
| 26.03 | State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences | arxiv | INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs | video-LLMs&hallucination benchmark&reliability |
| 26.03 | Fraunhofer AISEC | arxiv | You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents | privacy leakage&README injection&trusted executor dilemma |
| 26.03 | CISPA Helmholtz Center for Information Security | arxiv | Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks | harmful content&harmless tasks&content-level ethics |
| 26.03 | Renmin University of China | arxiv | HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios | household safety&vision-language models&embodied agents |
| 26.03 | Tsinghua University, Beijing, China | arxiv | LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories | laboratory safety&multimodal benchmarks&reasoning |
| 26.03 | KAIST AI | arxiv | UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models | Multimodal Safety&Benchmarking&Unified Models |
| 26.03 | Oracle America Inc. | arxiv | IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia | Multilingual Safety&Indic Languages&Safety Drift |
| 26.03 | Department of Computer Science, Stanford University, Stanford, CA, USA | arxiv | TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots | Therapy Chatbots&Clinical Safety&CBT Fidelity |
| 26.03 | King Abdullah University of Science and Technology (KAUST), Saudi Arabia | arxiv | SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues | Vision-Language Models&Semantic Steering&Multimodal Safety |
| 26.03 | Robotics Institute, Carnegie Mellon University | CVPR 2026 | Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory | Contextual Safety&Self-Reflective Memory&Multimodal Benchmark |
| 26.03 | Institute of Science Tokyo | arxiv | JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs | Social bias&Japanese LLMs&Cultural stereotypes |
| 26.03 | Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy | arxiv | SecureBreak - A dataset towards safe and secure models | Safety dataset&Unsafe outputs&Security alignment |
| 26.03 | CSIRO Data61, Australia | arxiv | Does Teaming-Up LLMs Improve Secure Code Generation? A Comprehensive Evaluation with Multi-LLMSecCodeEval | Secure code generation&Multi-LLM ensembles&Static analysis |
| 26.03 | Southern University of Science and Technology | arxiv | BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments | Agent safety&Situated agents&Behavioral benchmark |
| 26.03 | Ca' Foscari University of Venice, Venice, Italy | arxiv | Reentrancy Detection in the Age of LLMs | Smart contract security&Reentrancy detection&LLM evaluation |
| 26.03 | Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China | arxiv | CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models | VLM hallucination&Commonsense conflict&Visual fidelity |
| 26.03 | Key Laboratory of High Confidence Software Technologies, Ministry of Education, Peking University, Beijing, China | arxiv | SkillTester Benchmarking Utility and Security of Agent Skills | Agent skills&Utility evaluation&Security benchmarking |
| 26.03 | City University of Hong Kong | arxiv | TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios | Safety hazard assessment&Visual language models&Trustworthy benchmark |
| 26.04 | Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China | arxiv | Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models | Unified multimodal models&Safety benchmark&Intrinsic safety |
| 26.04 | George Mason University | arxiv | CLAWSAFETY: "Safe" LLMs, Unsafe Agents | Agent safety benchmark&Prompt injection&High-privilege agents |
| 26.04 | Carnegie Mellon University | arxiv | AGENTSOCIALBENCH: Evaluating Privacy Risks in Human-Centered Agentic Social Networks | Privacy risks&Agentic social networks&Multi-agent coordination |
| 26.04 | International Institute of Information Technology, Hyderabad | arxiv | ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues | Implicit bias&Bias benchmark&Characteristic cues |
| 26.04 | Shanghai AI Lab | arxiv | ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety | Agent safety benchmark&Trajectory evaluation&Long-horizon risks |
| 26.04 | BenchFlow | arxiv | ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces | OpenClaw&Benchmarking&Productivity Agents |
| 26.04 | Binjiang Institute of Zhejiang University, Hangzhou, China | arxiv | From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs | Hallucination&Benchmarking&Healthcare AI |
| 26.04 | Beijing University of Posts and Telecommunications, Beijing, China | arxiv | DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions | Hallucination&Benchmarking&Multimodal Safety |
| 26.04 | Graduate School of Data Science, Seoul National University | arxiv | Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA | Hallucination&Benchmarking&Emotional Attacks |
| 26.04 | Microsoft Research | arxiv | The Art of Building Verifiers for Computer Use Agents | Art&Building&Verifiers |
| 26.04 | University of Wisconsin-Madison | arxiv | WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks | Web Agents&Benchmarking&Multimodal Safety |
| 26.04 | CyCraft AI Lab, Taiwan | arxiv | TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories | Hallucination&Benchmarking&Prompt Injection |
| 26.04 | The Pennsylvania State University | ACL 2026 | PIArena: A Platform for Prompt Injection Evaluation | Benchmarking&Prompt Injection&Piarena |
| 26.04 | Zhejiang University, 2State Key Laboratory of Internet Architecture, Tsinghua University | ACL 2026 | ACIArena: Toward Unified Evaluation for Agent Cascading Injection | Benchmarking&Aciarena&Unified |
| 26.04 | University of Wisconsin–Madison | arxiv | The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents | The Blind Spot of Agent Safety&Expose Critical Vulnerabilities&Blind Spot |
| 26.04 | Department of Artificial Intelligence, Yonsei University | arxiv | PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints | PAC-BENCH&Evaluating Multi-Agent Collaboration&Evaluating Multi-Agent |
| 26.04 | Kolega.Dev | arxiv | RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code | RealVuln&General-Purpose LLM&Real-World Code |
| 26.04 | Vulcan Research, AIFT | arxiv | MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems | MCPThreatHive&Context Protocol Ecosystems&Model Context Protocol |
| 26.04 | Shanghai Artificial Intelligence Laboratory | arxiv | Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX | Trajectory Safety Evaluation&Evaluation and Diagnosis&Safety Evaluation |
| 26.04 | School of Computing, National University of Singapore | IJCNN 2026 | PilotBench: A Benchmark for General Aviation Agents with Safety Constraints | PilotBench&Safety Constraints&General Aviation Agents |
| 26.04 | CISPA Helmholtz Center for Information Security | arxiv | HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? | Agent Safety&Skill Ecosystem&Harmful Skills |
| 26.04 | The Hong Kong University of Science and Technology | arxiv | Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries | Safety Boundary&Domain Context&Jailbreak |
| 26.04 | DEXAI – Icaro Lab | arxiv | Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety | Frontier Models&Stylistic Robustness&Safety Benchmark |
| 26.04 | Korea University | arxiv | HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human–LLM Collaborative Writing | Co-Authoring&Jailbreak Benchmark&Collaborative Writing |
| 26.04 | University of Michigan | arxiv | SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models | Multimodal Planning&Safety Benchmark&Embodied AI |
| 26.04 | Institute for Automation and Applied Informatics, Karlsruhe Institute of Technology | arxiv | CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge | Cybersecurity&Certification&Benchmark |
| 26.04 | Department of Artificial Intelligence, Hanyang University | ACL 2026 | MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models | Mental Health&Safety Evaluation&Role-Aware |
| 26.04 | Shanghai Jiao Tong University | arxiv | OS-SPEAR: A Toolkit for the Safety, Performance, Efficiency, and Robustness Analysis of OS Agents | OS Agents&Safety Evaluation&Robustness |
| 26.04 | Kyushu Institute of Technology | arxiv | Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control | Healthcare Robotics&LLM Safety&Benchmark |
| 26.05 | University of Illinois Urbana-Champaign | arxiv | ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models | Multilingual Safety&Guardrails&Benchmark |
| 26.05 | Northeastern University | arxiv | Toward a Principled Framework for Agent Safety Measurement | Agent Safety&Measurement Framework&Evaluation |
| 26.05 | Swarms & AI Lab (SAIL), University of Haifa | arxiv | MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents | Coding Agents&Compositional Vulnerability&Benchmark |
| 26.05 | Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg | arxiv | Safety and accuracy follow different scaling laws in clinical large language models | Clinical LLMs&Safety Scaling&Accuracy |
| 26.05 | AIM Intelligence | arxiv | XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity | Cross-Cultural Safety&Country-Grounded Benchmark&Cultural Sensitivity |
| 26.05 | Simula Metropolitan Center for Digital Engineering | arxiv | When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels | Safety Scoring&No Ground Truth&Validation |
| 26.05 | Tsinghua University | ACL 2026 | LeakDojo: Decoding the Leakage Threats of RAG Systems | RAG Leakage&Benchmark&Privacy |
| 26.05 | Shanghai University of Finance and Economics | ACL 2026 | FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios | Financial Safety&Benchmark&LLM Evaluation |
| 26.05 | The Hong Kong Polytechnic University | arxiv | When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents | Agent Memory&State Poisoning&Personalized Agents |
| 26.05 | Tencent Hunyuan | arxiv | Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents | LLM Agents&Phone-Use Agents&Safety Evaluation |
| 26.05 | Chung-Ang University | arxiv | CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios | LLM Agents&Bias&Cybersecurity |
| 26.05 | Redwood Research | arxiv | MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring | Red Teaming&Agent Monitoring&Coding Agents |
| 26.05 | Stanford University | arxiv | CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs | Multi-Agent Systems&Privacy&Coordination |
| 26.05 | School of Computer Science | arxiv | EditRisk-Bench: Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing | Knowledge Editing&Safety Risks&Reasoning |
| 26.05 | Nanjing University of Aeronautics and Astronautics | arxiv | LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments | LLM Agents&Behavioral Jailbreak&OS Security |
| 26.05 | UC Berkeley | arxiv | MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks | Jailbreak&Multi-Turn&Benchmark |
| 26.05 | Peking University | arxiv | SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces | Agent Skills&Safety Evaluation&Attack Surfaces |
| 26.05 | University of Southern California | arxiv | PRIVACYSIM: Evaluating LLM Simulation of User Privacy Behavior | Privacy Simulation&User Behavior&LLM Evaluation |
| 26.05 | University of Southern California | arxiv | AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills | Agent Skills&Runtime Trust&Supply Chain |
| 26.05 | Carnegie Mellon University | arxiv | EXPLOITBENCH: A CAPABILITY LADDER BENCHMARK FOR LLM CYBERSECURITY AGENTS | Cybersecurity Agents&Exploitation&Benchmark |
| 26.05 | University of California, Santa Barbara | arxiv | Auditing Agent Harness Safety | Agent Harness&Safety Audit&Trajectory Evaluation |
| 26.05 | Unknown | arxiv | The Great Pretender: A Stochasticity Problem in LLM Jailbreak | Jailbreak&Stochasticity&Evaluation |
| 26.05 | Unknown | arxiv | Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models | Defense Conflicts&Sequential Deployment&LLM Safety |
| 26.05 | School of Information Studies, McGill University | arxiv | Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents | AgentSafety&BenchmarkTaxonomy&EvaluationConsistency |
| 26.05 | Anthropic Fellows Program | arxiv | SLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent Monitors | AgentMonitoring&EvasionAttacks&CodingAgents |
| 26.05 | Stanford University | arxiv | Why Do Safety Guardrails Degrade Across Languages? | MultilingualSafety&Guardrails&JailbreakEvaluation |
| 26.05 | Southern University of Science and Technology | arxiv | Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback | ToolAgents&CognitivePoisoning&AgentSecurity |
| 26.05 | ETH Zurich | arxiv | POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents | PrivacyBenchmark&LLMAgents&UtilityTradeoff |
| 26.05 | University College London | arxiv | Measuring Safety Alignment Effects in Autonomous Security Agents | SecurityAgents&SafetyAlignment&AutonomousSystems |
| 26.05 | University of Nevada Las Vegas | arxiv | Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025) | RefusalEvaluation&MaliciousCode&PromptCorpora |
| 26.05 | ELLIS Institute TĂĽbingen & MPI-IS & TĂĽbingen AI Center | arxiv | Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard | AgentSecurity&Benchmarking&Evaluation |
| 26.05 | Sapienza University of Rome | arxiv | Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety | AgenticSafety&MultiTurn&Benchmark |
| 26.05 | The Pennsylvania State University | ICML 2026 | LLM Benchmark Datasets Should Be Contamination-Resistant | BenchmarkContamination&UnlearnableDatasets&LLMEvaluation |
| 26.05 | University of Michigan | arxiv | The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models | Child Safety&Safety Benchmark&LLM Evaluation |
| 26.05 | Department of Computer Engineering, Bahcesehir University, Istanbul, Turkiye | arxiv | AgentSecBench: Measuring Prompt Injection, Privacy Leakage, and Tool-Use Integrity in LLM Agents | Prompt Injection&Privacy Leakage&Agent Security |
| 26.05 | Helivan | arxiv | Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models | Jailbreak Evaluation&Behavioral Geometry&Defense Transfer |
| 26.05 | Northwestern University in Qatar | arxiv | KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models | Multilingual Safety&Kazakh Dataset&Safety Evaluation |
| 26.05 | Northwestern University in Qatar | arxiv | AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian | Albanian Safety&Low-Resource Languages&Safety Benchmark |
| 26.05 | School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, United Kingdom | arxiv | Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals | Prompt Injection&Detection Evaluation&Structural Signals |
| 26.05 | AIM Intelligence | arxiv | When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models | Brittle Safety&Context Flip&Guardrails |
| 26.05 | Domyn, Gurugram, India | ICLR 2026 Workshop | Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation | safety guard models&content moderation&benchmark evaluation |
| 26.05 | McGill University | arxiv | SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation | Medication Safety&Clinical Agents&Knowledge Grounding |
| 26.05 | Department of Electrical and Computer Engineering, National University of Singapore | arxiv | Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents | Web Retrieval&Safety Alignment&Agent Safety |
| 26.05 | Qwen DianJin Team, Alibaba Cloud Computing | arxiv | FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions | Financial Compliance&Guard Models&Regulatory Benchmarks |
📚Resource
- Toxicity - RealToxicityPrompts datasets
- Truthfulness - TruthfulQA datasets
- TRUSTLLM - TRUSTLLM
- Protection - CircleGuardBench