Datasets & Benchmark

May 30, 2026 · View on GitHub

đź“‘Papers

DateInstitutePublicationPaperKeywords
20.09University of WashingtonEMNLP2020(findings)RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language ModelsToxicity
21.09University of OxfordACL2022TruthfulQA: Measuring How Models Mimic Human FalsehoodsTruthfulness
22.03MITACL2022ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech DetectionToxicity
23.07Zhejiang University; School of Engineering Westlake UniversityarxivLatent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language ModelsText Safety&Benchmark&Jailbreaking
23.07Stevens Institute of TechnologyNAACL2024(findings)HateModerate: Testing Hate Speech Detectors against Content Moderation PoliciesHate Speech Detection&Content Moderation&Machine Learning
23.08Meta Reality LabsNAACL2024Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)?Large Language Models&Knowledge Graphs&Question Answering
23.08Bocconi UniversityNAACL2024XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language ModelsLarge Language Models&Safety Behaviours&Test Suite
23.09LibrAI, MBZUAI, The University of MelbournearxivDo-Not-Answer: A Dataset for Evaluating Safeguards in LLMsSafety Evaluation&Safeguards
23.10University of Edinburgh, Huawei Technologies Co., Ltd.NAACL2024Assessing the Reliability of Large Language Model KnowledgeLarge Language Models&Factual Knowledge&Knowledge Probing
23.10University of PennsylvaniaNAACL2024(findings)Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial AttacksHallucination Assessment&Adversarial Attacks&Large Language Models
23.11Fudan UniversityarxivJADE: A Linguistic-based Safety Evaluation Platform for LLMSafety Benchmarks
23.11UNC-Chapel HillarxivHolistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference ChallengesHallucination&Benchmark&Multimodal
23.11IBM Research AIEMNLP2023(GEM workshop)Unveiling Safety Vulnerabilities of Large Language ModelsAdversarial Examples&Clustering&Automatically Identifying
23.11The Hong Kong University of Science and TechnologyarxivP-Bench: A Multi-level Privacy Evaluation Benchmark for Language ModelsDifferential Privacy&Privacy Evaluation
23.11UC BerkeleyarxivCAN LLMS FOLLOW SIMPLE RULESEvaluation&Attack Strategies
23.11University of Central FloridaarxivTHOS: A Benchmark Dataset for Targeted Hate and Offensive SpeechHate Speech&Offensive Speech&Dataset
23.11Beijing Jiaotong University; DAMO Academy, Alibaba Group, Peng Cheng LabarXivAMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination EvaluationMulti-modal Large Language Models&Hallucination&Benchmark
23.11Patronus AI, University of Oxford, Bocconi UniversityarxivSIMPLESAFETYTESTS: a Test Suite for Identifying Critical Safety Risks in Large Language ModelsSafety Risks&Test Suite&Evaluation
23.11University of Southern California, University of Pennsylvania, University of California DavisarxivDeceiving Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?Hallucinations&Semantic Associations&Benchmark
23.11Seoul National University, Chung-Ang University, NAVER AI Lab, NAVER Cloud, University of RichmondarxivLifeTox: Unveiling Implicit Toxicity in Life AdviceLifeTox Dataset&Toxicity Detection&Social Media Analysis
23.11School of Information Renmin University of ChinaarxivUHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained GenerationHallucination&Evaluation Benchmark
23.11UC Santa Cruz, UNC-Chapel HillarxivHow Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMsVision Large Language Models&Safety Evaluation&Adversarial Robustness
23.11Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Baidu Inc.arxivFFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality Fairness ToxicityHarmlessness Evaluation
23.11Fudan University&Shanghai Artificial Intelligence LaboratoryNAACL2024Fake Alignment: Are LLMs Really Aligned Well?Large Language Models&Safety Evaluation&Fake Alignment
23.11Kahlert School of ComputingNAACL2024Whispers of Doubt Amidst Echoes of Triumph in NLP RobustnessNLP Robustness&Out-of-Domain Evaluation&Adversarial Evaluation
23.11Shanghai Jiao Tong UniversityNAACL2024(findings)CLEAN–EVAL: Clean Evaluation on Contaminated Large Language ModelsClean Evaluation&Data Contamination&Large Language Models
23.12MetaarxivPurple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language ModelsSafety&Cybersecurity&Code Security Benchmark
23.12University of Illinois Chicago, Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI), UNC Chapel-HillarxivDELUCIONQA: Detecting Hallucinations in Domain-specific Question AnsweringHallucination Detection&Domain-specific QA&Retrieval-augmented LLMs
23.12University of Science and Technology of China, Hong Kong University of Science and Technology, MicrosoftarxivBenchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language ModelsIndirect Prompt Injection Attacks&BIPIA Benchmark&Defense
24.01NewsBreak, University of Illinois Urbana-ChampaignarxivRAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language ModelsRetrieval-Augmented Generation&Hallucination Detection&Dataset
24.01University of Notre Dame, Lehigh University, Illinois Institute of Technology, Institut Polytechnique de Paris, William & Mary, Texas A&M University, Samsung Research America, Stanford UniversityICML 2024TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELSTrustworthiness&Benchmark Evaluation
24.01University College LondonarxivHallucination Benchmark in Medical Visual Question AnsweringMedical Visual Question Answering&Hallucination Benchmark
24.01Carnegie Mellon UniversityarxivTOFU: A Task of Fictitious Unlearning for LLMsData Privacy&Ethical Concerns&Unlearning
24.01IRLab CITIC Research Centre, Universidade da CoruñaarxivMetaHate: A Dataset for Unifying Efforts on Hate Speech DetectionHate Speech Detection&Social Media
24.01Northwestern University, New York University, University of Liverpool, Rutgers UniversityarxivAttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language ModelsJailbreak Attack&Evaluation Frameworks&Ground Truth Dataset
24.01Shanghai Jiao Tong UniversityarxivR-Judge: Benchmarking Safety Risk Awareness for LLM AgentsLLM Agents&Safety Risk Awareness&Benchmark
24.02University of Illinois Urbana-Champaign, Center for AI Safety, Carnegie Mellon University, UC Berkeley, MicrosoftarxivHarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust RefusalAutomated Red Teaming&Robust Refusal
24.02Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong KongarxivSALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language ModelsSafety Benchmark&Safety Evaluation**&Hierarchical Taxonomy
24.02Middle East Technical UniversityarxivHypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMsHallucination&Benchmarking Dataset
24.02Indian Institute of Technology KharagpurarxivHow (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queriesInstruction-centric Responses&Ethical Vulnerabilities
24.03East China Normal UniversityarxivDiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language ModelsDialogue-level Hallucination&Benchmarking&Human-machine Interaction
24.03Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications TechnologyarxivOpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and SafetyChinese LLMs&Benchmarking&Safety
24.04University of Pennsylvania, ETH Zurich, EPFL, Sony AIarxivJailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language ModelsJailbreaking Attacks&Robustness Benchmark
24.04Vector Institute for Artificial Intelligence, University of LimerickarxivDeveloping Safe and Responsible Large Language Models - A Comprehensive FrameworkResponsible AI&AI Safety&Generative AI
24.04LMU Munich, University of Oxford, Siemens AG, Munich Center for Machine Learning (MCML), Wuhan UniversityarxivRED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS?Jailbreak Attacks&GPT-4V&Evaluation Benchmark&Robustness
24.04Bocconi University, University of OxfordarxivSafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model SafetyLLM Safety&Open Datasets&Systematic Review
24.04University of Alberta&The University of TokyoarxivOnline Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path ForwardLLM Safety&Online Safety Analysis&Benchmark
24.04Technion – Israel Institute of Technology, Google ResearcharxivConstructing Benchmarks and Interventions for Combating Hallucinations in LLMsHallucinations&Benchmarks
24.05Carnegie Mellon UniversityarxivPolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language ModelsMultilingual Evaluation&*Datasets
24.05Paul G. Allen School of Computer Science & EngineeringarxivMASSIVE Multilingual Abstract Meaning Representation: A Dataset and Baselines for Hallucination DetectionHallucination Detection&Multilingual AMR&Dataset
24.05University of California, RiversidearxivCross-Task Defense: Instruction-Tuning LLMs for Content SafetyInstruction-Tuning&LLM Safety&Content Safety
24.06University of WaterlooarxivTruthEval: A Dataset to Evaluate LLM Truthfulness and ReliabilityTruthfulness&Reliability
24.06Rutgers UniversityarxivMoralBench: Moral Evaluation of LLMsMoral Evaluation&MoralBench
24.06Tsinghua UniversityarxivBenchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive StudyTrustworthiness&MLLMs&Benchmark
24.06Beijing Academy of Artificial IntelligencearxivHalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination EvaluationHallucination Evaluation&Dialogue-Level&HalluDial
24.06Sichuan UniversityarxivLEGEND: Leveraging Representation Engineering to Annotate Safety Margin for Preference DatasetsSafety Margin&Preference Datasets&Representation Engineering
24.06The Hong Kong University of Science and Technology (Guangzhou)arxivBag of Tricks: Benchmarking of Jailbreak Attacks on LLMsJailbreak Attacks&Benchmarking
24.06AI Innovation Center, China UnicomarxivCHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language ModelsChinese Hierarchical Safety Benchmark&Large Language Models&Automatic Evaluation
24.06GooglearxivSupporting Human Raters with the Detection of Harmful Content using Large Language ModelsHarmful Content Detection&Hate Speech
24.06South China University of Technology, Pazhou Laboratory, University of Maryland, Baltimore CountyarxivGenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language ModelsGender Bias Mitigation&Alignment Dataset&Bias Categories
24.06Center for AI Safety and Governance, Institute for AI, Peking UniversityarxivSAFESORA: Towards Safety Alignment of Text2Video Generation via a Human Preference DatasetSafety Alignment&Text2Video Generation
24.06Fudan UniversityarxivCross-Modality Safety AlignmentMultimodal Safety&Large Vision-Language Models&SIUO Benchmark
24.06KAISTarxivCSRT: Evaluation and Analysis of LLMs using Code-Switching Red-Teaming DatasetCode-Switching&Red-Teaming&Multilingualism
24.06University College LondonarxivJobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language ModelsGender Bias&Hiring Bias&Benchmarking
24.06Peking UniversityarxivPKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family ModelsSafety Alignment&Preference Dataset
24.06University of California, Los AngelesarxivMOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?Multimodal Language Models&Oversensitivity&Safety Mechanisms
24.06Allen Institute for AIarxivWILDGUARD: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMsSafety Moderation&Jailbreak Attacks&Moderation Tools
24.06University of WashingtonarxivWILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language ModelsJailbreaking&Safety Training&Adversarial Attacks
24.07Beijing Jiaotong UniversityarxivKG-FPQ: Evaluating Factuality Hallucination in LLMs with Knowledge Graph-based False Premise QuestionsFactuality Hallucination&Knowledge Graph&False Premise Questions
24.07Chinese Academy of SciencesarxivT2VSafetyBench: Evaluating the Safety of Text-to-Video Generative ModelsText-to-Video Generation&Safety Evaluation&Generative Models
24.07Patronus AIarxivLynx: An Open Source Hallucination Evaluation ModelHallucination Detection&RAG&Evaluation Model
24.07Virginia TecharxivAIR-BENCH 2024: A Safety Benchmark Based on Risk Categories from Regulations and PoliciesAI Safety&Regulations&Policies&Risk Categories
24.07Columbia UniversityECCV 2024HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningHallucination&Vision-Language Models&Datasets
24.07Center for AI SafetyarxivSafetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?AI Safety&Benchmarks
24.08Walled AI LabsarxivWALLEDEVAL: A Comprehensive Safety Evaluation Toolkit for Large Language ModelsAI Safety&Prompt Injection
24.08ShanghaiTech UniversityarxivMMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language ModelsJailbreak Attacks&Vision-Language Models&Security
24.08Stanford UniversityarxivCybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language ModelsCybersecurity&Capture the Flag
24.08Zhejiang UniversityarxivCharacterizing and Evaluating the Reliability of LLMs against Jailbreak AttacksJailbreak Attacks&LLM Reliability&Evaluation Framework
24.08Enkrypt AIarxivSAGE-RT: Synthetic Alignment Data Generation for Safety Evaluation and Red TeamingSynthetic Data Generation&Safety Evaluation&Red Teaming
24.08Tianjin UniversityFindings of ACL 2024CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language ModelsMoral Evaluation&Moral Dilemma
24.08University of SurreyIJCAI 2024CodeMirage: Hallucinations in Code Generated by Large Language ModelsCode Hallucinations&CodeMirage Dataset
24.08Chalmers University of TechnologyarxivLLMSecCode: Evaluating Large Language Models for Secure CodingSecure Coding&Evaluation Framework
24.09The Chinese University of Hong KongarxivTesting and Evaluation of Large Language Models: Correctness, Non-Toxicity, and FairnessCorrectness&Non-Toxicity&Fairness
24.09KAISTarxivEvaluating Image Hallucination in Text-to-Image Generation with Question-AnsweringImage Hallucination&Text-to-Image Generation&Question-Answering
24.09Zhejiang UniversityarxivGenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection AttacksPrompt Injection&LLM Safety&Benchmarking
24.10Zhejiang UniversityarxivAGENT SECURITY BENCH (ASB): FORMALIZING AND BENCHMARKING ATTACKS AND DEFENSES IN LLM-BASED AGENTSLLM-based Agents&Security Benchmarks&Adversarial Attacks
24.10Zhejiang University, Duke UniversityarxivSCISAFEEVAL: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific TasksSafety Alignment&Scientific Tasks
24.10The Chinese University of Hong Kong, Tencent AI LabarxivChain-of-Jailbreak Attack for Image Generation Models via Editing Step by StepChain-of-Jailbreak&Image Generation Models&Safety
24.10University of California, Santa Cruz, University of California, BerkeleyarxivMultimodal Situational Safety: A Benchmark for Large Language ModelsMultimodal Situational Safety&MLLMs&Safety Benchmark
24.10IBM ResearcharxivST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web AgentsWeb Agents&Safety&Trustworthiness
24.10Renmin University of China, Anthropic, University of Oxford, University of Edinburgh, Mila, TangenticarxivPOISONBENCH: Assessing Large Language Model Vulnerability to Data PoisoningData poisoning&LLM vulnerability&Preference learning
24.10Gray Swan AI, UK AI Safety InstitutearxivAGENTHARM: A Benchmark for Measuring Harmfulness of LLM AgentsJailbreaking&LLM agents&Harmful agent tasks
24.10Purdue UniversityarxivCOLLU-BENCH: A Benchmark for Predicting Language Model Hallucinations in CodeCode hallucinations&Code generation&Automated program repair
24.10The Hong Kong University of Science and Technology (Guangzhou), University of Birmingham, Baidu Inc.arxivJAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation FrameworkJailbreak judge&Multi-agent framework
24.10University of Notre Dame, IBM ResearcharxivBenchmarkCards: Large Language Model and Risk ReportingBenchmarkCards&Bias&Fairness
24.10Vectara, Inc., Iowa State University, University of Southern California, Entropy Technologies, University of Waterloo, Funix.io, University of Wisconsin, MadisonarxivFaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMsHallucination detection&Human-annotated benchmark&Faithfulness
24.10Southern University of Science and TechnologyarxivChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language ModelsChineseSafe&Content Safety&LLM Evaluation
24.10Beihang UniversityarxivSafeBench: A Safety Evaluation Framework for Multimodal Large Language ModelsMultimodal Large Language Models&Safety Evaluation Framework&Risk Assessment
24.10University of Washington-MadisonarxivCFSafety: Comprehensive Fine-grained Safety Assessment for LLMsSafety Assessment&LLM Evaluation&Instruction Attacks
24.10University of PennsylvaniaarxivBenchmarking LLM Guardrails in Handling Multilingual ToxicityMultilingual Toxicity Detection&Guardrails&Jailbreaking Attacks
24.10University of Wisconsin-MadisonarxivInjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail ModelsPrompt Injection Defense&Over-defense Detection&Guardrail Models
24.10National Engineering Research Center for Software Engineering, Peking UniversityNeurIPS 2024SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt TypesLLM Safety&Prompt Engineering&Jailbreak Attacks
24.10Alan Turing InstitutearxivAI Cyber Risk Benchmark: Automated Exploitation CapabilitiesCybersecurity&LLMs&Automated Exploitation
24.11Fudan UniversityarXivLONGSAFETYBENCH: LONG-CONTEXT LLMS STRUGGLE WITH SAFETY ISSUESLong-Context Models&Safety Evaluation&Benchmarking
24.11AnthropicarXivRapid Response: Mitigating LLM Jailbreaks with a Few ExamplesJailbreak Defense&Rapid Response
24.11Texas A&M UniversityarXivResponsible AI in Construction Safety: Systematic Evaluation of Large Language Models and Prompt EngineeringConstruction Safety&Prompt Engineering&LLM Evaluation
24.11IBM Research EuropeNeurIPS 2024 SafeGenAI WorkshopHarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model AlignmentJailbreaking Techniques&LLM Vulnerability&Quantization Impact
24.11Peking UniversityarxivChemSafetyBench: Benchmarking LLM Safety on Chemistry DomainLLM Safety&Chemistry Domain&Benchmarking
24.11New York University, JPMorgan Chase, Cornell Tech, Northeastern UniversityarxivAssessment of LLM Responses to End-user Security QuestionsLLM Evaluation&End-user Security&Information Integrity
24.11National Library of Medicine, NIH&University of Maryland&University of Virginia&Universidad de ChilearxivEnsuring Safety and Trust: Analyzing the Risks of Large Language Models in MedicineMedical AI&LLM Safety&MedGuard Benchmark
24.11European Commission Joint Research CentreEMNLP 2024GuardBench: A Large-Scale Benchmark for Guardrail Modelsguardrail models&benchmark&evaluation
24.12Vizuara AI LabsarxivCBEVAL: A Framework for Evaluating and Interpreting Cognitive Biases in LLMsCognitive Biases&LLM Evaluation&Reasoning Limitations
24.12Beijing Institute of Technology, Beihang UniversityarxivREFF: Reinforcing Format Faithfulness in Language Models across Varied TasksFormat Faithfulness&Benchmark
24.12UCLA, Salesforce AI ResearchNeurIPS 2024SAFEWORLD: Geo-Diverse Safety AlignmentGeo-Diverse Alignment&Safety Evaluation&Legal Compliance
24.12Shanghai Jiao Tong UniversityarxivSafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM AgentsSafety-Aware Task Planning&Embodied LLM Agents&Hazard Mitigation
24.12Tsinghua UniversityarxivAGENT-SAFETYBENCH: Evaluating the Safety of LLM AgentsAgent Safety&Risk Awareness&Interactive Evaluation
24.12TU DarmstadtarxivLLMs Lost in Translation: M-ALERT Uncovers Cross-Linguistic Safety GapsCross-Linguistic Safety&Multilingual Benchmark&LLM Alignment
24.12Alibaba, China Academy of Information and Communications TechnologyarxivChinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language ModelsSafety Benchmark&Factuality Evaluation
24.12University of Warwick, Cranfield UniversityarxivMedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language ModelsMedical Hallucinations&Benchmark&RLHF
24.12The Hong Kong Polytechnic UniversityarxivSecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in CybersecurityCybersecurity Benchmark&Large Language Models&Dataset Evaluation
25.01KTH Royal Institute of TechnologyarxivCySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language ModelsCybersecurity Benchmark&Jailbreaking&Prompt Dataset
25.01Shahjalal University of Science and TechnologyarxivFrom Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMsFake News Detection&Bangla&Low-Resource Languages
25.01NVIDIAarxivAEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM GuardrailsAI Safety&Content Moderation Dataset&LLM Risk Taxonomy
25.01Georgia Institute of TechnologyarxivOn The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic PhenomenaCultural Bias in LLMs&Cross-Linguistic Analysis&Arabic-English Benchmarks
25.01Bocconi UniversityarxivMSTS: A Multimodal Safety Test Suite for Vision-Language ModelsMultimodal Safety&Vision-Language Models
25.01Fudan UniversityarxivYou Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak DefenseJailbreak Defense&LLM Performance&USEBench
25.01McGill UniversityarxivOnionEval: A Unified Evaluation of Fact-conflicting Hallucination for Small-Large Language ModelsFact-conflicting Hallucination&Small-Large Language Models (SLLMs)&Benchmark
25.01HKUSTarxivTune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in JailbreakAudio Language Models&Jailbreak Vulnerabilities&Audio Modality Edits
25.01University of CambridgearxivCASE-BENCH: Context-Aware Safety Evaluation Benchmark for Large Language ModelsLLM Safety&Context-Aware Evaluation&Over-Refusal
25.01CISPA Helmholtz Center for Information SecurityUSENIX Security 2025HATEBENCH: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate CampaignsHate Speech Detection&LLM-Generated Content&Hate Campaigns
25.01Shanghai Artificial Intelligence Laboratory, Tianjin UniversityarxivRethinking Bottlenecks in Safety Fine-Tuning of Vision Language ModelsVision-Language Models (VLMs)&Chain-of-Thought (CoT)
25.01Independent Research Team “Annyeong! Luda”PACLIC 38RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsJailbreaking&Conversational AI&User Intent Detection
25.01Renmin University of ChinaarxivSafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language ModelsRetrieval-Augmented Generation&Security Benchmarking&Adversarial Attacks
25.02Rochester Institute of TechnologyarxivHOPE VS. HATE: UNDERSTANDING USER INTERACTIONS WITH LGBTQ+ NEWS CONTENT IN MAINSTREAM US NEWS MEDIA THROUGH THE LENS OF HOPE SPEECHHope Speech&LGBTQ+&Political Bias
25.02Nanjing UniversityarxivSafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak AttacksSafety Benchmark&Jailbreak Attacks&Multi-Turn Dialogues
25.02China UnicomarxivSafety Evaluation of DeepSeek Models in Chinese ContextsDeepSeek Models&Safety Evaluation&Chinese Contexts
25.02Hong Kong University of Science and TechnologyarxivCan’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMsMultimodal LLMs&Safety Awareness&Benchmarking
25.02Columbia UniversityarXivTREECUT: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination EvaluationMath Word Problem&LLM Hallucination&Synthetic Dataset
25.02Barcelona Supercomputing Center (BSC)arXivEfficient Safety Retrofitting Against Jailbreaking for LLMsModel Alignment&LLM Safety&Jailbreaking
25.02The Hong Kong University of Science and TechnologyarxivGuidedBench: Equipping Jailbreak Evaluation with GuidelinesJailbreak Evaluation&AI Safety&LLM Security
25.02Wuhan UniversityarxivELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language ModelsBackdoor Attacks&LLM Security&Benchmarking
25.02Tsinghua UniversityarxivLongSafety: Evaluating Long-Context Safety of Large Language ModelsLong-Context Safety&LLM Evaluation&Safety Benchmark
25.02KAISTarxivIs Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language ModelsUser-Specific Safety&LLM Evaluation&Safety Benchmark
25.02Beijing University of Posts and TelecommunicationsarxivJailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language ModelsChinese Benchmark&Security Assessment&Jailbreak Attack
25.0234 AffiliatesarxivOn the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and PerspectiveSafety Assessment&Guideline Paper
25.02IBM ResearchNeurIPS 2024, SafeGenAI WorkshopAdversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsAdversarial Prompting&Jailbreak Detection&Guardrail Benchmarking
25.03Google DeepMind, ETH ZuricharXivAutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example DefensesAdversarial Example Defenses&LLM Security&Autonomous Exploitation
25.03The Pennsylvania State University, GE HealthcarearXivMedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language ModelsMedical Hallucination&Hallucination Mitigation&Vision-Language Models
25.03Nanyang Technological University, Agency for Science, Technology and Research (A*STAR)arXivBenchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code RepositoriesSoftware Vulnerability Detection&LLM-based Agents&Interprocedural Analysis
25.03Zhejiang UniversityarxivCan Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive EvaluationSmall Language Models&Jailbreak Attack&Security Evaluation
25.03University of VirginiaarxivBenchmarking Group Fairness in Reward ModelsGroup Fairness&Reward Models&LLM Alignment
25.03Texas A&M UniversityarxivHalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM HallucinationsHallucination Detection&Multilingual Benchmark&Fine-grained Annotation
25.03University of KentarxivCyberLLMInstruct: A New Dataset for Analysing Safety of Fine-Tuned LLMs Using Cyber Security DataCyber Security&Fine-Tuning Safety&Instruction Dataset
25.03UC San DiegoarxivExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist ContentMultimodal Jailbreaking&Extremist Content&Safety Evaluation
25.03Beihang UniversityarxivTowards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and FindingsSafety Evaluation&Jailbreak Attacks&DeepSeek Models
25.03George Mason UniversityarxivEfficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting AttackBatch Prompting&Prompt Injection Attack&LLM Security
25.03Arizona State UniversityarxivPersonalized Attacks of Social Engineering in Multi-turn Conversations - LLM Agents for Simulation and DetectionSocial Engineering&Multi-turn Conversation&LLM Agents
25.03University of Illinois Urbana-ChampaignICLR 2025MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation ModelsMultimodal Foundation Models&Trustworthiness Evaluation&Safety Benchmark
25.03University of ChicagoarxivSHIELDAGENT: Shielding Agents via Verifiable Safety Policy ReasoningLLM Agents&Safety Policy Reasoning&Guardrails
25.04University College LondonarxivOn Benchmarking Code LLMs for Android Malware AnalysisCode LLMs&Android Malware Analysis&Structured Evaluation
25.04UC Santa CruzarxivSTAR-1: Safer Alignment of Reasoning LLMs with 1K DataReasoning LLMs&Safety Alignment&High-Quality Dataset
25.04National University of SingaporearxivSafeMLRM: Demystifying Safety in Multi-modal Large Reasoning ModelsMLRM&Safety Evaluation&Jailbreaking Attacks
25.04Tsinghua UniversityarxivRealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning CapabilitySafety Alignment&DeepSeek-R1&Large Reasoning Models
25.04The Hong Kong University of Science and TechnologyarxivBenchmarking Multi-National Value Alignment for Large Language ModelsLLM Hallucination&Value Alignment&Cross-National Evaluation
25.04Qatar Computing Research InstitutearxivAIXAMINE: SIMPLIFIED LLM SAFETY AND SECURITYLLM Safety&Security Evaluation&Black-box Benchmarking
25.04University of California, BerkeleyarxivJailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution ShiftJailbreak Detection&Concept Drift&Continuous Learning
25.04IntuitarxivSecurity Steerability is All You NeedSecurity Steerability&Prompt Guardrails&LLM Robustness
25.04University of Notre DameNAACL2025(System Demo)TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustworthiness of Generative Foundation ModelsTrustworthiness Evaluation Across GenFMs
25.05Rutgers University-New BrunswickarxivUnmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content SafetyJailbreaking&Image Generation&Multilingual Obfuscation
25.05Beijing Jiaotong UniversityarxivThink in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning ModelMultimodal Large Reasoning Models&Safety Alignment&Chain-of-Thought
25.05Shanghai Artificial Intelligence LaboratoryarxivBenchmarking Ethical and Safety Risks of Healthcare LLMs in China – Toward Systemic Governance under Healthy China 2030Medical LLMs&Ethical Risk Assessment&Governance Framework
25.05University of Technology SydneyarxivSecReEvalBench: A Security Resilient Evaluation Benchmark for Large Language ModelsPrompt Attack&Security Benchmark&Adversarial Robustness
25.05Dartmouth CollegearxivFalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured ReasoningOver-Refusal Mitigation&Structured Reasoning&LLM Safety Evaluation
25.05National University of SingaporearxivGuardReasoner-VL: Safeguarding VLMs via Reinforced ReasoningVLM Guard Models&Reinforcement Learning&Reasoning Alignment
25.05Fudan UniversityarxivThink Twice Before You Act: Enhancing Agent Behavioral Safety with Thought CorrectionLLM Agent&Behavioral Safety&Thought Correction
25.05Giskard AIarxivPhare: A Safety Probe for Large Language ModelsSafety Evaluation&Hallucination Detection&Bias Diagnosis
25.05Beijing University of Posts and TelecommunicationsarxivVideo-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMsSafety Evaluation&Video LVLMs&Prompt Injection
25.05Yonsei UniversityarxivBetter Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency RecognitionVisual Emergency Recognition&Vision-Language Models&Contextual Reasoning
25.05POSTECHarxivAre Vision-Language Models Safe in the Wild? A Meme-Based Benchmark StudyVLM Safety&Meme Benchmark&Multimodal Alignment
25.05Mohamed bin Zayed University of Artificial IntelligencearxivAudio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language ModelsAudio-Language Models&Jailbreak Attacks&Adversarial Benchmark
25.05Fujitsu Research of EuropearxivMAPS: A Multilingual Benchmark for Global Agent Performance and SecurityAgentic AI&Multilingual Evaluation&LLM Security
25.05Nanyang Technological UniversityarxivAudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language ModelsAudio LLMs&Trustworthiness Evaluation&Multimodal Safety
25.05University of Science and Technology of ChinaarxivFrom Evaluation to Defense: Advancing Safety in Video Large Language ModelsVideo LLMs&Safety Benchmark&Multimodal Defense
25.05Hong Kong University of Science and Technology (Guangzhou)arxivJALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language ModelsJailbreak Benchmark&Audio Language Model&Adversarial Attack
25.05University of WashingtonarxivSOS BENCH: Benchmarking Safety Alignment on Scientific KnowledgeSafety Alignment&Scientific Knowledge&Benchmark
25.05John Hopkins UniversityarxivJailbreak Distillation: Renewable Safety BenchmarkingJailbreak Distillation&Safety Benchmark&Prompt Selection
25.05Alibaba GrouparxivUSB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language ModelsMultimodal LLMs&Safety Evaluation&Benchmark
25.05Wuhan UniversityarxivTRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisLLM Safety&Red-Teaming&Data Diversity
25.06New York University Abu DhabiarxivAgentAuditor: Human-Level Safety and Security Evaluation for LLM AgentsLLM Agent&Safety Evaluation&Security Benchmark
25.06Sabaragamuwa University of Sri LankaarxivXGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist ContentLLM Safety&Extremist Content&Graded Evaluation
25.06University of ZuricharxivSimple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task ExecutionPrompt Injection&Agent Security&Data Exfiltration
25.06Sun Yat-sen UniversityarxivMTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese MedicineTraditional Chinese Medicine&LLM Benchmark&Safety Evaluation
25.06NII-LLMCarxivAnswerCarefully: A Dataset for Improving the Safety of Japanese LLM OutputLLM Safety&Japanese Dataset&Safety Fine-Tuning
25.06University of OxfordarxivAgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based AgentsLLM Agent&Misalignment Propensity&Safety Benchmark
25.06ETRIarxivHoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language ModelVLM Safety&Safety Benchmark&Safety Meta Token
25.06ByteDancearxivSafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated CodeCode Generation&Security Vulnerability&LLM Benchmark&Automatic Evaluation
25.06ShanghaiTech UniversityarxivBeyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment FailuresLLM Safety&Implicit Harm&JailFlipBench&Adversarial Attacks
25.06Manipal University, GGSI University, IIITDM Kancheepuram, Meta AI, Amazon, BITS GoaarxivAdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement—Introducing Adversarial Vulnerability Quality Index (AVQI)LLM Safety&Adversarial Benchmark&Latent Camouflage&Contrastive Alignment&AVQI
25.06Microsoft, ISTA, Trend Micro, RainaResearch, University of Coimbra, Vietnamese German University, SK Shieldus, HiddenLayerarxivLLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection ChallengePrompt Injection&LLM Security&Benchmark Dataset&Email Assistant&Defense Evaluation
25.06Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; The Chinese University of Hong Kong, ShenzhenarxivFrom Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content MonitoringLLM Safety&Content Moderation&Streaming Detection&Early Stopping&Token-Level Annotation
25.06Zhejiang University, Tencent, National University of SingaporearxivChineseHarm-Bench: A Chinese Harmful Content Detection BenchmarkHarmful Content Detection&Chinese Benchmark&LLM Safety&Knowledge Augmentation&Content Moderation
25.06Institute of Artificial Intelligence (TeleAI), China Telecom; Beihang University; Beijing Jiaotong University; Northwestern Polytechnical UniversityACL 2025LLMs Caught in the Crossfire: Malware Requests and Jailbreak ChallengesJailbreak Attack&Malware Generation&LLM Security&Benchmark&Code Generation
25.06Hong Kong University of Science and Technology, Peking UniversityFindings of ACL 2025SafeLawBench: Towards Safe Alignment of Large Language ModelsLLM Safety&Legal Benchmark&Risk Hierarchy&Refusal Behavior&Safety Evaluation
25.06The University of Hong KongarxivRealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-CheckingFact-checking&Large Language Models&Benchmark&Multimodal Reasoning&Uncertainty Evaluation
25.06dreadnodearxivAIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language ModelsAI Red Teaming&Security Benchmark&Capture-the-Flag&Autonomous Agents&Prompt Injection
25.06Beihang UniversityarxivAGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous InstructionsEmbodied Agents&Safety Benchmark&Vision-Language Model&Hazardous Instructions&Jailbreak Attack
25.06Zhejiang UniversityarxivRAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World EnvironmentsLLM Agent&Security Evaluation&Benchmark&CWE&Tool Execution
25.06AnthropicarxivSHADE-Arena: Evaluating Sabotage and Monitoring in LLM AgentsSabotage&Monitoring&LLM Agent
25.06NASK – National Research InstitutearxivPL-Guard: Benchmarking Language Model Safety for PolishSafety Benchmark&Polish LLM&Adversarial Robustness
25.06University of Illinois Urbana-ChampaignarxivGUARDSET-X: Multi-Domain, Policy-Grounded, AI Security Guardrail BenchmarkGuardrail Dataset&Safety Evaluation&Policy-Grounded
25.06Emory UniversityarxivModel Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or HarmModel Editing&Ethical Alignment&Behavior Control
25.06University of Illinois Urbana-ChampaignarxivBeyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon SimulationLLM Alignment&Long-Horizon Simulation&Risk-Aware
25.06University of Illinois Urbana-ChampaignarxivHalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination EvaluationHallucination&Segmentation&Counterfactual Reasoning
25.07Inner Mongolia UniversityarxivMcBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language ModelsBias Evaluation&Chinese LLM&Benchmark
25.07University of EdinburgharxivControlling What You Share: Assessing Language Model Adherence to Privacy PreferencesPrivacy Profile&LLM&PII
25.07University of California, Los AngelesarxivModelCitizens: Representing Community Voices in Online SafetyToxicity Detection&Community Annotation&Context
25.07Carnegie Mellon UniversityarxivOpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent SafetyAgent Safety&Evaluation&Risk
25.05ETH ZurichACL 2025 FindingsACE: Benchmarking Agent-based Code Execution Safety in LLMsAgent Safety&Code Execution&Benchmark
25.05National University of SingaporeACL 2025DORIS: A Fine-grained Benchmark for Detecting Hallucinations in Long-form Text GenerationHallucination Detection&Long-form Generation&Benchmark
25.05Tsinghua UniversityCoLM 2025MEmoBench: Evaluating Memorization-based Security Risks in LLMsMemorization&Security Risk&Benchmark
25.07CISPA Helmholtz Center for Information SecurityUSENIX Security 2025Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across ModalitiesVLM&Safety&Alignment
25.07Maastricht UniversityarxivDutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for DutchSocial Bias&Dataset&Dutch
25.07Ben-Gurion University of the NegevarxivTab-MIA: A Benchmark Dataset for Membership Inference Attacks on Tabular Data in LLMsMembership Inference&Tabular Data&Privacy
25.07RWTH Aachen UniversityGeBNLP @ ACL 2025Exploring Gender Bias in Large Language Models: An In-depth Dive into the German LanguageGender Bias&German&LLM
25.07Chinese Academy of Sciences, University of Chinese Academy of SciencesarxivLibra: Large Chinese-based Safeguard for AI ContentSafeguard&Chinese&Safety
25.07Peking University, Peking Union Medical CollegearxivA Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical DomainsMedicalLLM&Safety&Effectiveness
25.07Glass Box MedicinearxivLarge language models provide unsafe answers to patient-posed medical questionsPatientSafety&MedicalAdvice&Chatbots
25.07University of CambridgearxivTRIDENT: Benchmarking LLM Safety in Finance, Medicine, and LawSafety&Benchmark&Compliance
25.07Mohamed bin Zayed University of Artificial Intelligence, University of DelhiarxivUnsafeChain: Enhancing Reasoning Model Safety via Hard CasesSafetyAlignment&HardPrompts&Reasoning
25.08AIM Intelligence, Seoul National UniversityarxivEliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language ModelsLLM Alignment&Misalignment&Red-Teaming&Narrative Manipulation
25.08Shanghai Jiao Tong UniversityarxivBeyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak EvaluationJailbreak Evaluation&Scenario-Adaptivity&LLM Safety
25.08Shanghai Jiao Tong UniversityarxivMCSBench: Benchmarking Multimodal Commonsense Reasoning of Large Vision-Language ModelsMultimodal Commonsense Reasoning&Large Vision-Language Models&Benchmark
25.08Fordham UniversityarxivTowards Safer AI Moderation: Evaluating LLM Moderators Through a Unified Benchmark Dataset and Advocating a Human-First ApproachAI Moderation&Bias Detection&Human-in-the-Loop
25.08AppleCOLM 2025Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference ResolutionIntersectional Bias&Coreference Resolution&Fairness Evaluation
25.08Tsinghua UniversityarxivOmni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language ModelsAudio-Visual LLMs&Safety Evaluation&Cross-Modal Consistency
25.08The Hong Kong University of Science and Technology (Guangzhou)arxivExploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-JudgeMental Health Dialogues&Safety Alignment&LLM-as-Judge
25.08Peking UniversityACM MM 2025Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language ModelsVision-Language Models&Implicit Reasoning Safety&In-Context Learning
25.08Guizhou UniversityarxivXFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMsMultimodal Misinformation Detection&MLLM Evaluation&Contemporary Dataset
25.08Shanghai Artificial Intelligence LaboratoryarxivLinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language ModelsMultilingual Safety&Benchmark&LLM Evaluation
25.08University of Science and Technology of China, Beihang UniversityarxivMCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP ServersTool Poisoning&LLM Agent&Benchmark
25.08Tsinghua University, Beihang University, RealAIarxivUnveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and MitigationMultimodal LLMs&Trustworthiness&Benchmark
25.08Shanghai AI Lab, East China Normal UniversityarxivSafetyFlow: An Agent-Flow System for Automated LLM Safety BenchmarkingLLM Safety&Automated Benchmarking&Agent System
25.08Oregon State UniversityarxivMind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled AgentsLLM Agents&TOCTOU Vulnerabilities&Security Mitigation
25.08University of Wisconsin-MadisonarxivPRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in MultimodalityVision-Language Models&Multimodal Safety&Chain-of-Thought Alignment
25.08Shandong UniversityarxivMind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone AgentsSmartphone Agents&Privacy Awareness&SAPA-Bench
25.08CISPA Helmholtz Center for Information SecurityarxivJADES: A Universal Framework for Jailbreak Assessment via Decompositional ScoringJailbreak Assessment&Decompositional Scoring&Fact-Checking
25.09Shanghai Artificial Intelligence LaboratoryEMNLP 2025 FindingsSelf-adaptive Dataset Construction for Real-World Multimodal Safety ScenariosMultimodal Safety&Dataset Construction&Risk Evaluation
25.09Sichuan UniversityEMNLP 2025 FindingsCANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-CheckingMisinformation Fact-Checking&Chinese LLMs&Benchmark Dataset
25.09Beihang UniversityarxivVaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG SamplesRetrieval-Augmented Generation&Chain-of-Thought&Multimodal LLMs
25.09Leibniz Institute for Resilience ResearcharxivScaling behavior of large language models in emotional safety classification across sizes and tasksEmotional Safety&Model Scaling&Mental Health Applications
25.09King Fahd University of Petroleum and MineralsarxivAraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMsArabic LLMs&Hallucination Evaluation&Multilingual Models
25.09Stevens Institute of TechnologyarxivBehind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language ModelsCamouflaged Jailbreaks&Benchmark Dataset&LLM Security
25.09Sharif University of TechnologyarxivEPT Benchmark: Evaluation of Persian Trustworthiness in Large Language ModelsTrustworthiness&Persian LLMs&Cultural Alignment
25.09Beijing Institute of TechnologyarxivSafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMsTool Safety&LLM Agents&Prospective Evaluation
25.09AppenarxivRed Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and ModelsMultimodal LLMs&Red Teaming&Adversarial Prompts&Safety Evaluation
25.09Peking University, TeleAI (China Telecom), UCASarxivWhen Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language ModelsMultimodal Safety&Reasoning Path Optimization&Implicit Reasoning Risk
25.09University of Washington, IAEAarxivAutomated Triaging and Transfer Learning of Incident Learning Safety Reports Using Large Language Representational ModelsIncident Learning Systems&Transfer Learning&BlueBERT
25.09EPFL & Tampere University & University of Tokyo & National Institute of InformaticsEMNLP 2025TactfulToM: Do LLMs Have the Theory of Mind Ability to Understand White Lies?Theory of Mind&White Lies&Social Reasoning
25.09Singapore Management University, National University of Singapore, Monash University, Aalto University, York University, Zhejiang UniversityarxivSecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability ScenariosSecure Code Generation&LLM Agents&Software Security Benchmark
25.09Zhejiang UniversityarxivHarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness AssessmentLLM Harmfulness Evaluation&Benchmark&Safety Alignment
25.09Shanghai Jiao Tong University, China Telecom Institute of Artificial Intelligence, USTC, GigaAI, Xinjiang UniversityarxivRADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized CollaborationLLM Safety Evaluation&Multi-Agent Collaboration&Risk-Aware Framework
25.09UC Berkeley, AWS AI LabsarxivSTAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM AgentsLLM Agents&Sequential Tool Attacks&Multi-turn Security Vulnerability
25.09Chinese Academy of SciencesarxivSafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM AgentsEmbodied LLM Agents&Safety Benchmark&Risk Mitigation
25.09Shanghai AI Laboratory, Fudan University, The University of Hong Kong, East China Normal UniversityarxivSafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMsSafety Evaluation&Multi-Agent Framework&Dynamic Benchmarking
25.10UC San Diego, Columbia University, Zhejiang University, MIT-IBM Watson LabarxivBiasFreeBench: A Benchmark for Mitigating Bias in Large Language Model ResponsesBias Mitigation&Benchmarking&Fairness Evaluation
25.10Duke UniversityarxivWAInjectBench: Benchmarking Prompt Injection Detections for Web AgentsPrompt Injection Detection&Web Agents&Multimodal Security&Benchmarking
25.10AITRICS, KAIST, Severance Hospital (Yonsei University)EMNLP 2025Taxonomy of Comprehensive Safety for Clinical AgentsClinical Safety&Taxonomy&Intent Classification&LLM Safety
25.10Rutgers UniversityarxivRead the Scene, Not the Script: Outcome-Aware Safety for LLMsLLM Safety Alignment&Consequence-Blindness&Outcome-Aware Reasoning
25.10University of California, San DiegoarxivWhen Benchmarks Age: Temporal Misalignment through Large Language Model Factuality EvaluationBenchmark Aging&Factuality Evaluation&Temporal Misalignment
25.10Beihang University & Chinese Academy of Sciences & China University of Petroleum (East China) & Zhejiang University of Technology & 360 AI Security Lab & University of SydneyarxivSecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web AgentsWeb Agent Security&LVLM Benchmark&Adversarial Evaluation&Prompt Injection
25.10Carnegie Mellon University & Amazon AGIarxivRefusalBench: Generative Evaluation of Selective Refusal in Grounded Language ModelsSelective Refusal&RAG Evaluation&Generative Benchmark
25.10Beijing University of Posts and Telecommunications, University of California Santa BarbaraarxivMCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM AgentsModel Context Protocol&Agent Security&Benchmark
25.10National Taiwan University, NVIDIAarxivInvestigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional VariationsAudio-Language Models&Safety Alignment&Emotional Variation
25.10ApplearxivVLSU: Mapping the Limits of Joint Multimodal Understanding for AI SafetyMultimodal Safety&Vision-Language Models&Benchmark
25.10Amirkabir University of Technology, KCL, Politecnico di Milano, KTHarxivPBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language ModelsBias Benchmark&Persian LLM&Human-AI Collaboration
25.10NYU Shanghai, NUS, Yale University, NYU Center for Data ScienceEMNLP 2025FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance DomainTrustworthiness Benchmark&Financial LLM&Ethical Alignment
25.10Beihang University, Tsinghua University, Shanghai Qi Zhi InstituteNeurIPS 2025DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world ScenariosDeception Behavior&Benchmark&LLM Safety
25.10Manchester Metropolitan UniversityarxivSafeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial PromptsLarge Language Models&Adversarial Attacks&AI Safety
25.10Lakera AI, ETH ZĂĽrich, UK AI Security Institute, University of OxfordarxivBreaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI AgentsAI Agent Security&Backbone LLM Benchmarking&Threat Modeling
25.10William & MaryarxivMeasuring the Security of Mobile LLM Agents under Adversarial Prompts from Untrusted Third-Party ChannelsMobile LLM Agent&Adversarial Prompt&Security Evaluation
25.10Hong Kong Baptist UniversityarxivMemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language ModelsMultimodal Large Language Model&Harmfulness Evaluation&Agent-based Framework
25.11Penn State UniversityarxivDo Methods to Jailbreak and Defend LLMs Generalize Across Languages?Multilingual Jailbreak&Cross-lingual Safety&Defense Robustness
25.11Tsinghua UniversityarxivLiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese ContextAI Safety Benchmark&Chinese LLM Evaluation&Cultural Alignment
25.11Nanjing UniversityarxivUnderstanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and InterpretationFactual Hallucination&New Knowledge&Attention Mechanism
25.11German Research Center for Artificial IntelligencearxivCONVERSE: Benchmarking Contextual Safety in Agent-to-Agent ConversationsAgent-to-Agent Safety&Privacy Leakage&Security Attacks
25.11Southern University of Science and TechnologyarxivMCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?MCP Security&System Log Analysis&Risk Detection Benchmark
25.11Beijing Institute of AI Safety and GovernancearxivEfficient LLM Safety Evaluation through Multi-Agent DebateLLM Safety Evaluation&Multi-Agent Debate&Jailbreak Benchmark
25.11Xi’an Jiaotong UniversityarxivHow Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task ComplexityAgent Safety&Intent Concealment&Task Complexity
25.11Fudan University / City University of Hong Kong / Singapore Management UniversityarxivAttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision–Language–Action ModelsVision–Language–Action Models&Adversarial Attacks&Backdoor Attacks
25.11Nanjing UniversityarxivSAFERBENCH: A COMPREHENSIVE BENCHMARK FOR SAFETY ASSESSMENT IN LARGE REASONING MODELSSafety Benchmark&Large Reasoning Models&Chain-of-Thought Safety
25.11École Polytechnique, Sorbonne Paris Nord University, Crédit Agricole SAarxivMUCH: A Multilingual Claim Hallucination BenchmarkUncertainty Quantification&Hallucination Detection&Multilingual Benchmark
25.11Intuit AI Security ResearcharxivASTRA: Agentic Steerability and Risk Assessment FrameworkAgentic AI Security&Steerability Evaluation&Risk Assessment Framework
25.11Scale AI, University of Maryland, UNC, Google DeepMind, Netflix, UT AustinarxivPropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic ApproachLLM Safety&Risk Propensity&Agentic Evaluation
25.11University at Buffalo, SUNYarxivDUALGAUGE: Automated Joint Security–Functionality Benchmarking for Secure Code GenerationSecure Code Generation&Benchmarking&LLM Evaluation
25.12Bangladesh University of Engineering and TechnologyarxivWhen Safety Blocks Sense: Measuring Semantic Confusion in LLM RefusalsSafety Alignment&Semantic Confusion&False Refusal Analysis
25.12BITS Pilani, Hyderabad CampusarxivBHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian LanguagesMultilingual Hallucination Detection&Indic Languages&Benchmark Dataset
25.12University of Leeds, Ben-Gurion University of the NegevarxivDialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM ResponsesPsychosocial Safety&Multi-Agent Evaluation&LLM-as-a-Judge
25.12Carnegie Mellon University, Columbia University, Johns Hopkins University, HydroX AIarxivIs Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World TasksVibe Coding&AI Agents&Secure Code Generation&Benchmarking
25.12Harbin Institute of Technology (Shenzhen)NeurIPS 2025 / arxivContext-Aware Hierarchical Learning: A Two-Step Paradigm Towards Safer LLMsLLM Security&Prompt Injection Defense&Tool-Augmented Models&Context-Aware Hierarchical Learning (CAHL)
25.12Institute of Artificial Intelligence (TeleAI) of China TelecomarxivTeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluationsLarge Language Model&Jailbreak&Safety Benchmark
25.12Vidyasirimedhi Institute of Science and TechnologyarxivSEA-SafeguardBench: Evaluating AI Safety in SEA Languages and CulturesAI Safety Benchmark&Southeast Asian Languages&Cultural Safety
25.12Nanyang Technological UniversityarxivOmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack–Defense EvaluationMultimodal LLM Safety&Jailbreak Benchmark&Attack–Defense Evaluation
25.12East China Normal UniversityarxivMCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP ServersMCPSecurity&SafetyBenchmark&LLMAgents
25.12BRAC University, BangladesharxivBeyond the Benchmark: Innovative Defenses Against Prompt Injection AttacksPromptInjection&DefenseGeneration&LLMSecurity
25.12i.AIarxivA Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary CareMedical Safety&Real-World Evaluation&Failure Analysis
25.12Western Sydney UniversityarxivPrompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side BenchmarkOver-Generation&Denial-of-Service&Prompt Attacks
25.12International Institute of Information Technology HyderabadarxivPrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AIPrivacy Benchmark&Personalized AI&RAG Systems
26.01Xidian UniversityarxivOverlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its JailbreakAlgorithm Design&Jailbreak&Optimization Safety
26.01Nanyang Technological UniversityarxivCSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial PatternsChinese Safety&Adversarial Evaluation&Over-Refusal
26.01Kyoto UniversityarxivJMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language ModelsMedical Safety&Multi-Turn Evaluation&Japanese Benchmark
26.01Indian Institute of Technology IndoreAAAI 2026X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation FrameworkExplainable AI&Hate Speech Detection&Multilingual NLP
26.01Fordham UniversityarxivGuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM ModeratorsLLM Moderation&Safety Benchmark&Fairness Robustness
26.01United Arab Emirates Universityarxivα³-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G NetworksUAV Agents&6G Networks&Safety Benchmark
26.01Monash UniversityarxivPsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health EthicsMental Health Ethics&Ethical Benchmark&Jurisdiction Awareness
26.01VNU University of SciencearxivRedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language ModelsRed Teaming&Safety Benchmark&Risk Taxonomy
26.01Xidian UniversityarxivHearSay Benchmark: Do Audio LLMs Leak What They Hear?Audio Privacy&Voiceprint Leakage&Safety Benchmark
26.01Beijing Jiaotong UniversityarxivWhen Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily LifeMultimodal Safety&Daily Life Risks&Safety Benchmark
26.01National University of Defense TechnologyarxivThe Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMsMedical Safety&Parameter Alignment&Safety Grafting
26.01Federal University of GoiásarxivMiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech JailbreakingDemographic Bias&Hate Speech Safety&Jailbreak Benchmark
26.01Dongguk UniversityarxivSafety Not Found (404): Hidden Risks of LLM-Based Robotics Decision MakingRoboticsSafety&SpatialReasoning&LLMReliability
26.01University of California, RiversidearxivPII-VisBench: Evaluating Personally Identifiable Information Safety in Vision Language Models Along a Continuum of VisibilityPIISafety&VLMPrivacy&Benchmark
26.01University of California, Santa CruzarxivSafePro: Evaluating the Safety of Professional-Level AI AgentsAgentSafety&ProfessionalTasks&Benchmark
26.01University of Notre DamearxivMTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn DialoguesMultimodalSafety&MultiTurn&Benchmark
26.01Shanghai University of Finance and EconomicsarxivFinVault: Benchmarking Financial Agent Safety in Execution-Grounded EnvironmentsFinancialAgents&SecurityBenchmark&AgentSafety
26.01Fudan UniversityarxivWebTrap Park: An Automated Platform for Systematic Security Evaluation of Web AgentsWebAgents&SecurityEvaluation&Benchmark
26.01The University of Texas at AustinarxivFaithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical EvidenceMedicalSafety&CounterfactualEvidence&LLMBehavior
26.01The Alan Turing InstitutearxivDriveSafe: A Hierarchical Risk Taxonomy for Safety-Critical LLM-Based Driving AssistantsDrivingSafety&RiskTaxonomy&LLMEvaluation
26.01Brown UniversityarxivUbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African LanguagesMultilingualSafety&CulturalGrounding&GuardianModels
26.01INSAITarxivChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart GenerationChartSecurity&MultimodalJailbreak&VisualizationAttacks
26.01Shanghai Qi Zhi InstitutearxivVirtualCrime: Evaluating Criminal Potential of Large Language Models via Sandbox SimulationAgenticCrime&SandboxSimulation&LLMSafety
26.01Tsinghua UniversityarxivThe Side Effects of Being Smart: Safety Risks in MLLMs’ Multi-Image ReasoningMultiImageReasoning&MLLMSafety&SafetyBenchmarks
26.01Eindhoven University of TechnologyarxivAdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk DomainsAdversarialFactuality&HighRiskDomains&LLMBenchmark
26.01Nanyang Technological UniversityCCS 2026The Shadow Self: Intrinsic Value Misalignment in Large Language Model AgentsLLM Agents&Value Misalignment&Safety Benchmark
26.01Beijing Normal UniversityarxivGUIGuard: Toward a General Framework for Privacy-Preserving GUI AgentsGUI Agents&Privacy Preservation&Agent Safety
26.01Shanghai AI LabarxivAutomated Safety Benchmarking: A Multi-agent Pipeline for LVLMsLVLM Safety&Automated Benchmarking&Multi-agent Systems
26.01The Chinese University of Hong KongarxivVoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language ModelsInteractional Privacy&Speech LLMs&Safety Benchmark
26.01UnknownarxivSHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language ModelsSocial Harm&Tail Risk&Fairness Evaluation
26.01Tehran Institute for Advanced StudiesarxivFFE-HALLU: Hallucinations in Fixed Figurative Expressions: Benchmark of Idioms and Proverbs in the Persian LanguageFigurative Hallucination&Idioms Benchmark&Cultural Grounding
26.01University of IllinoisWWW 2026Mind the Ambiguity: Aleatoric Uncertainty Quantification in LLMs for Safe Medical Question AnsweringAleatoric Uncertainty&Medical QA&Ambiguity Detection
26.02National University of SingaporearxivRisky-Bench: Probing Agentic Safety Risks under Real-World DeploymentAgent Safety&Deployment&Benchmark
26.02ShanghaiTech University, Shanghai, ChinaarxivLPS-BENCH: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial ScenariosComputer-Use Agents&Planning Safety&Benchmark
26.02Zhejiang UniversityarxivCSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMsCross-modal Safety&MLLM Reliability&Benchmark
26.02Qatar Computing Research Institute, HBKU, Doha, QatararxivOnce Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language ModelsCounterfactual Hallucination&Multilingual VLMs&Benchmark
26.02Zhejiang UniversityICLR 2026WHEN AGENTS “MISREMEMBER” COLLECTIVELY: EXPLORING THE MANDELA EFFECT IN LLM-BASED MULTI-AGENT SYSTEMSMandela Effect&Multi-Agent Systems&False Memory
26.02Seoul National University College of MedicinearxivMPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMsPrompt Injection&Clinical Safety&RAG
26.02School of Software, Shandong University, Jinan, ChinaarxivTrajAD: Trajectory Anomaly Detection for Trustworthy LLM AgentsLLM Agents&Anomaly Detection&Process Supervision
26.02Critical ML LabarxivTAMPERBENCH: SYSTEMATICALLY STRESS-TESTING LLM SAFETY UNDER FINE-TUNING AND TAMPERINGTamper Resistance&Safety Evaluation&Fine-Tuning Attacks
26.02Texas A&M University, College Station, TX, USAarxivHalluverse-M3: A multitask multilingual benchmark for hallucination in LLMsHallucination Benchmark&Multilingual Evaluation&Detection
26.02Computer Science Department, Southern Illinois University, Carbondale, IL, United StatesarxivAGENT-FENCE: MAPPING SECURITY VULNERABILITIES ACROSS DEEP RESEARCH AGENTSAgent Security&Trust Boundaries&Architectural Evaluation
26.02Xidian UniversityarxivFROM ASSISTANT TO DOUBLE AGENT: FORMALIZING AND BENCHMARKING ATTACKS ON OPENCLAW FOR PERSONALIZED LOCAL AI AGENT.Personalized Agents&Security Benchmarking&OpenClaw
26.02University of OregonarxivBenchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented GenerationRetrieval-Augmented Generation&Knowledge Extraction&Benchmarking
26.02Sentient LabsarxivCryptoAnalystBench: Failures in Multi-Tool Long-Form LLM AnalysisAnalyst Agents&Tool Use&Long-Form Evaluation
26.02Macquarie University, Sydney, AustraliaarxivAre Aligned Large Language Models Still Misaligned?Misalignment Benchmark&Safety&Culture
26.02ETH ZĂĽricharxivGT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game TheoryAI Safety Benchmark&Game Theory&Multi-Agent Risks
26.02UnknownarxivProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMsMoral Reasoning&Prompt Engineering&Safety Benchmark
26.02Northeastern UniversityarxivUnsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using AgentsTool-Using Agents&Multi-Turn Safety&Defense
26.02Beijing Institute of AI Safety and Governance, China.arxivForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AIAI Safety Benchmark&Frontier Risks&Governance
26.02KI-Werkstatt / FB2, University of Applied Sciences BerlinarxivLLMStructBench: Benchmarking Large Language Model Structured Data ExtractionStructured Extraction&Benchmark&JSON Generation
26.02TSINGHUA UNIVERSITY, BEIJING, CHINAarxivSecCodeBench-V2 Technical ReportSecure Code Generation&Benchmark&Code Security
26.02University of TĂĽbingen & Max-Planck Institute for Intelligent SystemsarxivNESSIE: THE NECESSARY SAFETY BENCHMARK - IDENTIFYING ERRORS THAT SHOULD NOT EXISTSafety Benchmark&Instruction Following&Necessary Condition
26.02Independent ResearcherarxivMind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM AgentsTool-Call Safety&Benchmark&Agent Alignment
26.02Oregon State University, Corvallis, OR USAarxivFail-Closed Alignment for Large Language ModelsFail-Closed Safety&Refusal Robustness&Jailbreak Defense
26.02Oracle America Inc.EACL 2026IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian LanguagesMultilingual Safety&Jailbreak Benchmark&South Asian Languages
26.02Kakaobank, South KoreaarxivFENCE: A Financial and Multimodal Jailbreak Detection Datasetjailbreak detection&finance&multimodal
26.02School of Computer Science, Carnegie Mellon UniversityarxivHow Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?user simulation&security privacy&benchmark
26.02Southeast UniversityarxivCIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agentscode interpreter&benchmark&agent security
26.02Max Planck Institute for Intelligent Systems, ELLIS Institute Tubingen, Tubingen AI CenterarxivSKILL-INJECT: Measuring Agent Vulnerability to Skill File Attacksskill attack&prompt injection&agent benchmark
26.02Harbin Institute of Technology, ShenzhenarxivEvaluating Proactive Risk Awareness of Large Language Modelsproactive safety&risk awareness&ecology
26.03Shanghai AI LabarxivSafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyondscientific safety&benchmark&alignment
26.03UC Santa CruzarxivZERODAYBENCH: EVALUATING LLM AGENTS ON UNSEEN ZERO-DAY VULNERABILITIESagents&zero-day&benchmark
26.03Duke UniversityarxivMUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Foundation Modelsmultimodal safety&evaluation platform&benchmarking
26.03Renmin University of China, ChinaarxivTRUSTMH-BENCH: A Comprehensive Benchmark for Evaluating the Trustworthiness of Mental Health LLMsmental health&trustworthiness&benchmark
26.03Compumacy for Artificial Intelligence Solutions, Cairo, EgyptarxivSalamahBench: Toward Standardized Safety Evaluation for Arabic Language ModelsArabic LMs&safety benchmark&evaluation
26.03CISPA Helmholtz Center for Information SecurityarxivBenchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarkssafety benchmarks&code quality&meta-analysis
26.03SCB DataXarxivTHAISAFETYBENCH: ASSESSING LANGUAGE MODEL SAFETY IN THAI CULTURAL CONTEXTSThai safety&cultural context&benchmark
26.03Mohamed bin Zayed University of Artificial IntelligenceICLR 2026JAILNEWSBENCH: MULTI-LINGUAL AND REGIONAL BENCHMARK FOR FAKE NEWS GENERATION UNDER JAILBREAK ATTACKSfake news&jailbreak&multilingual benchmark
26.03Technical University of MunicharxivA Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial RobustnessLLM-as-a-Judge&adversarial robustness&evaluation reliability
26.03OraclearxivLieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Modelsdeception evaluation&multi-agent games&alignment
26.03MediaTek ResearcharxivTaiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese MandarinTaiwanese Mandarin&safety benchmark&cultural alignment
26.03Singapore Management UniversityarxivBackdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMsbeneficial backdoors&controllability&trustworthy AI
26.03School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, USAarxivWhere Do LLM-based Systems Break? A System-Level Security Framework for Risk Assessment and Treatmentsystem security&risk assessment&attack-defense trees
26.03Wuhan University, Wuhan, Hubei, ChinaKDD 2026AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Modelscultural safety&cultural knowledge&adaptive alignment
26.03School of Computing and Augmented Intelligence, Arizona State University, Tempe, USAarxivReal-Time Trust Verification for Safe Agentic Actions using TrustBenchtrust verification&agent safety&real-time intervention
26.03Fudan University, Shanghai, ChinaarxivOOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequencesmultimodal safety&hidden consequences&policy optimization
26.03Department of Statistics & Data Science, Yale UniversityarxivBenchmarking Political Persuasion Risks Across Frontier Large Language Modelspolitical persuasion&frontier models&risk benchmarking
26.03University of Twente, Enschede, The NetherlandsarxivTOSSS: a CVE-based Software Security Benchmark for Large Language Modelssoftware security benchmarks&CVE&code selection
26.03State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of SciencesarxivINFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMsvideo-LLMs&hallucination benchmark&reliability
26.03Fraunhofer AISECarxivYou Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agentsprivacy leakage&README injection&trusted executor dilemma
26.03CISPA Helmholtz Center for Information SecurityarxivUnderstanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasksharmful content&harmless tasks&content-level ethics
26.03Renmin University of ChinaarxivHomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarioshousehold safety&vision-language models&embodied agents
26.03Tsinghua University, Beijing, ChinaarxivLABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratorieslaboratory safety&multimodal benchmarks&reasoning
26.03KAIST AIarxivUniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal ModelsMultimodal Safety&Benchmarking&Unified Models
26.03Oracle America Inc.arxivIndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South AsiaMultilingual Safety&Indic Languages&Safety Drift
26.03Department of Computer Science, Stanford University, Stanford, CA, USAarxivTherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy ChatbotsTherapy Chatbots&Clinical Safety&CBT Fidelity
26.03King Abdullah University of Science and Technology (KAUST), Saudi ArabiaarxivSAVeS: Steering Safety Judgments in Vision-Language Models via Semantic CuesVision-Language Models&Semantic Steering&Multimodal Safety
26.03Robotics Institute, Carnegie Mellon UniversityCVPR 2026Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective MemoryContextual Safety&Self-Reflective Memory&Multimodal Benchmark
26.03Institute of Science TokyoarxivJUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMsSocial bias&Japanese LLMs&Cultural stereotypes
26.03Department of Electrical, Computer and Biomedical Engineering, University of Pavia, ItalyarxivSecureBreak - A dataset towards safe and secure modelsSafety dataset&Unsafe outputs&Security alignment
26.03CSIRO Data61, AustraliaarxivDoes Teaming-Up LLMs Improve Secure Code Generation? A Comprehensive Evaluation with Multi-LLMSecCodeEvalSecure code generation&Multi-LLM ensembles&Static analysis
26.03Southern University of Science and TechnologyarxivBeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional EnvironmentsAgent safety&Situated agents&Behavioral benchmark
26.03Ca' Foscari University of Venice, Venice, ItalyarxivReentrancy Detection in the Age of LLMsSmart contract security&Reentrancy detection&LLM evaluation
26.03Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, ChinaarxivCDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language ModelsVLM hallucination&Commonsense conflict&Visual fidelity
26.03Key Laboratory of High Confidence Software Technologies, Ministry of Education, Peking University, Beijing, ChinaarxivSkillTester Benchmarking Utility and Security of Agent SkillsAgent skills&Utility evaluation&Security benchmarking
26.03City University of Hong KongarxivTSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment ScenariosSafety hazard assessment&Visual language models&Trustworthy benchmark
26.04Institute of Information Engineering, Chinese Academy of Sciences, Beijing, ChinaarxivDoes Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large ModelsUnified multimodal models&Safety benchmark&Intrinsic safety
26.04George Mason UniversityarxivCLAWSAFETY: "Safe" LLMs, Unsafe AgentsAgent safety benchmark&Prompt injection&High-privilege agents
26.04Carnegie Mellon UniversityarxivAGENTSOCIALBENCH: Evaluating Privacy Risks in Human-Centered Agentic Social NetworksPrivacy risks&Agentic social networks&Multi-agent coordination
26.04International Institute of Information Technology, HyderabadarxivImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based CuesImplicit bias&Bias benchmark&Characteristic cues
26.04Shanghai AI LabarxivATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent SafetyAgent safety benchmark&Trajectory evaluation&Long-horizon risks
26.04BenchFlowarxivClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated WorkspacesOpenClaw&Benchmarking&Productivity Agents
26.04Binjiang Institute of Zhejiang University, Hangzhou, ChinaarxivFrom Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMsHallucination&Benchmarking&Healthcare AI
26.04Beijing University of Posts and Telecommunications, Beijing, ChinaarxivDetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image CaptionsHallucination&Benchmarking&Multimodal Safety
26.04Graduate School of Data Science, Seoul National UniversityarxivBeyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QAHallucination&Benchmarking&Emotional Attacks
26.04Microsoft ResearcharxivThe Art of Building Verifiers for Computer Use AgentsArt&Building&Verifiers
26.04University of Wisconsin-MadisonarxivWebSP-Eval: Evaluating Web Agents on Website Security and Privacy TasksWeb Agents&Benchmarking&Multimodal Safety
26.04CyCraft AI Lab, TaiwanarxivTraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling TrajectoriesHallucination&Benchmarking&Prompt Injection
26.04The Pennsylvania State UniversityACL 2026PIArena: A Platform for Prompt Injection EvaluationBenchmarking&Prompt Injection&Piarena
26.04Zhejiang University, 2State Key Laboratory of Internet Architecture, Tsinghua UniversityACL 2026ACIArena: Toward Unified Evaluation for Agent Cascading InjectionBenchmarking&Aciarena&Unified
26.04University of Wisconsin–MadisonarxivThe Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsThe Blind Spot of Agent Safety&Expose Critical Vulnerabilities&Blind Spot
26.04Department of Artificial Intelligence, Yonsei UniversityarxivPAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy ConstraintsPAC-BENCH&Evaluating Multi-Agent Collaboration&Evaluating Multi-Agent
26.04Kolega.DevarxivRealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World CodeRealVuln&General-Purpose LLM&Real-World Code
26.04Vulcan Research, AIFTarxivMCPThreatHive: Automated Threat Intelligence for Model Context Protocol EcosystemsMCPThreatHive&Context Protocol Ecosystems&Model Context Protocol
26.04Shanghai Artificial Intelligence LaboratoryarxivBenchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeXTrajectory Safety Evaluation&Evaluation and Diagnosis&Safety Evaluation
26.04School of Computing, National University of SingaporeIJCNN 2026PilotBench: A Benchmark for General Aviation Agents with Safety ConstraintsPilotBench&Safety Constraints&General Aviation Agents
26.04CISPA Helmholtz Center for Information SecurityarxivHarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?Agent Safety&Skill Ecosystem&Harmful Skills
26.04The Hong Kong University of Science and TechnologyarxivInto the Gray Zone: Domain Contexts Can Blur LLM Safety BoundariesSafety Boundary&Domain Context&Jailbreak
26.04DEXAI – Icaro LabarxivAdversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model SafetyFrontier Models&Stylistic Robustness&Safety Benchmark
26.04Korea UniversityarxivHarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human–LLM Collaborative WritingCo-Authoring&Jailbreak Benchmark&Collaborative Writing
26.04University of MichiganarxivSafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language ModelsMultimodal Planning&Safety Benchmark&Embodied AI
26.04Institute for Automation and Applied Informatics, Karlsruhe Institute of TechnologyarxivCyberCertBench: Evaluating LLMs in Cybersecurity Certification KnowledgeCybersecurity&Certification&Benchmark
26.04Department of Artificial Intelligence, Hanyang UniversityACL 2026MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language ModelsMental Health&Safety Evaluation&Role-Aware
26.04Shanghai Jiao Tong UniversityarxivOS-SPEAR: A Toolkit for the Safety, Performance, Efficiency, and Robustness Analysis of OS AgentsOS Agents&Safety Evaluation&Robustness
26.04Kyushu Institute of TechnologyarxivBenchmarking the Safety of Large Language Models for Robotic Health Attendant ControlHealthcare Robotics&LLM Safety&Benchmark
26.05University of Illinois Urbana-ChampaignarxivML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language ModelsMultilingual Safety&Guardrails&Benchmark
26.05Northeastern UniversityarxivToward a Principled Framework for Agent Safety MeasurementAgent Safety&Measurement Framework&Evaluation
26.05Swarms & AI Lab (SAIL), University of HaifaarxivMOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding AgentsCoding Agents&Compositional Vulnerability&Benchmark
26.05Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-NürnbergarxivSafety and accuracy follow different scaling laws in clinical large language modelsClinical LLMs&Safety Scaling&Accuracy
26.05AIM IntelligencearxivXL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural SensitivityCross-Cultural Safety&Country-Grounded Benchmark&Cultural Sensitivity
26.05Simula Metropolitan Center for Digital EngineeringarxivWhen No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth LabelsSafety Scoring&No Ground Truth&Validation
26.05Tsinghua UniversityACL 2026LeakDojo: Decoding the Leakage Threats of RAG SystemsRAG Leakage&Benchmark&Privacy
26.05Shanghai University of Finance and EconomicsACL 2026FinSafetyBench: Evaluating LLM Safety in Real-World Financial ScenariosFinancial Safety&Benchmark&LLM Evaluation
26.05The Hong Kong Polytechnic UniversityarxivWhen Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized AgentsAgent Memory&State Poisoning&Personalized Agents
26.05Tencent HunyuanarxivSafe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use AgentsLLM Agents&Phone-Use Agents&Safety Evaluation
26.05Chung-Ang UniversityarxivCyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack ScenariosLLM Agents&Bias&Cybersecurity
26.05Redwood ResearcharxivMonitoringBench: Semi-Automated Red-Teaming for Agent MonitoringRed Teaming&Agent Monitoring&Coding Agents
26.05Stanford UniversityarxivCalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMsMulti-Agent Systems&Privacy&Coordination
26.05School of Computer SciencearxivEditRisk-Bench: Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge EditingKnowledge Editing&Safety Risks&Reasoning
26.05Nanjing University of Aeronautics and AstronauticsarxivLITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS EnvironmentsLLM Agents&Behavioral Jailbreak&OS Security
26.05UC BerkeleyarxivMT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak AttacksJailbreak&Multi-Turn&Benchmark
26.05Peking UniversityarxivSkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack SurfacesAgent Skills&Safety Evaluation&Attack Surfaces
26.05University of Southern CaliforniaarxivPRIVACYSIM: Evaluating LLM Simulation of User Privacy BehaviorPrivacy Simulation&User Behavior&LLM Evaluation
26.05University of Southern CaliforniaarxivAgentTrap: Measuring Runtime Trust Failures in Third-Party Agent SkillsAgent Skills&Runtime Trust&Supply Chain
26.05Carnegie Mellon UniversityarxivEXPLOITBENCH: A CAPABILITY LADDER BENCHMARK FOR LLM CYBERSECURITY AGENTSCybersecurity Agents&Exploitation&Benchmark
26.05University of California, Santa BarbaraarxivAuditing Agent Harness SafetyAgent Harness&Safety Audit&Trajectory Evaluation
26.05UnknownarxivThe Great Pretender: A Stochasticity Problem in LLM JailbreakJailbreak&Stochasticity&Evaluation
26.05UnknownarxivDefenses at Odds: Measuring and Explaining Defense Conflicts in Large Language ModelsDefense Conflicts&Sequential Deployment&LLM Safety
26.05School of Information Studies, McGill UniversityarxivTaxonomy and Consistency Analysis of Safety Benchmarks for AI AgentsAgentSafety&BenchmarkTaxonomy&EvaluationConsistency
26.05Anthropic Fellows ProgramarxivSLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent MonitorsAgentMonitoring&EvasionAttacks&CodingAgents
26.05Stanford UniversityarxivWhy Do Safety Guardrails Degrade Across Languages?MultilingualSafety&Guardrails&JailbreakEvaluation
26.05Southern University of Science and TechnologyarxivTrust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool FeedbackToolAgents&CognitivePoisoning&AgentSecurity
26.05ETH ZuricharxivPOLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM AgentsPrivacyBenchmark&LLMAgents&UtilityTradeoff
26.05University College LondonarxivMeasuring Safety Alignment Effects in Autonomous Security AgentsSecurityAgents&SafetyAlignment&AutonomousSystems
26.05University of Nevada Las VegasarxivRefusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)RefusalEvaluation&MaliciousCode&PromptCorpora
26.05ELLIS Institute TĂĽbingen & MPI-IS & TĂĽbingen AI CenterarxivMeasuring Security Without Fooling Ourselves: Why Benchmarking Agents Is HardAgentSecurity&Benchmarking&Evaluation
26.05Sapienza University of RomearxivBoiling the Frog: A Multi-Turn Benchmark for Agentic SafetyAgenticSafety&MultiTurn&Benchmark
26.05The Pennsylvania State UniversityICML 2026LLM Benchmark Datasets Should Be Contamination-ResistantBenchmarkContamination&UnlearnableDatasets&LLMEvaluation
26.05University of MichiganarxivThe Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language ModelsChild Safety&Safety Benchmark&LLM Evaluation
26.05Department of Computer Engineering, Bahcesehir University, Istanbul, TurkiyearxivAgentSecBench: Measuring Prompt Injection, Privacy Leakage, and Tool-Use Integrity in LLM AgentsPrompt Injection&Privacy Leakage&Agent Security
26.05HelivanarxivJailbreak susceptibility prediction and mitigation via the behavioral geometry of modelsJailbreak Evaluation&Behavioral Geometry&Defense Transfer
26.05Northwestern University in QatararxivKZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language ModelsMultilingual Safety&Kazakh Dataset&Safety Evaluation
26.05Northwestern University in QatararxivAlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in AlbanianAlbanian Safety&Low-Resource Languages&Safety Benchmark
26.05School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, United KingdomarxivPrompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural SignalsPrompt Injection&Detection Evaluation&Structural Signals
26.05AIM IntelligencearxivWhen Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language ModelsBrittle Safety&Context Flip&Guardrails
26.05Domyn, Gurugram, IndiaICLR 2026 WorkshopBenchmarking Open-Source Safety Guard Models: A Comprehensive Evaluationsafety guard models&content moderation&benchmark evaluation
26.05McGill UniversityarxivSafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication RecommendationMedication Safety&Clinical Agents&Knowledge Grounding
26.05Department of Electrical and Computer Engineering, National University of SingaporearxivRelevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM AgentsWeb Retrieval&Safety Alignment&Agent Safety
26.05Qwen DianJin Team, Alibaba Cloud ComputingarxivFinGuard: Detecting Financial Regulatory Non-Compliance in LLM InteractionsFinancial Compliance&Guard Models&Regulatory Benchmarks

📚Resource