README.md

May 11, 2026 · View on GitHub

Autonomous Agents

Autonomous Agents-research papers. Updated daily. Resources-section-section.

Research papers: 2026 4/4

2026 (4/4), 2026 (3/4), 2026 (2/4), 2026 (1/2), 2025 (4/4),2025 (3/4), 2025 (2/4), 2025 (1/4), 2024, 2023, Earlier

Chronological order.

8th May 2026

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

AutoTTS: introduces an environment-driven framework that automates the discovery of test-time scaling strategies by shifting human effort from manual heuristic design to constructing replayable discovery environments.
The framework utilizes an explorer LLM to iteratively propose and refine controllers, which are evaluated against pre-collected reasoning trajectories to ensure efficient, deterministic, and scalable search.
By employing beta parameterization and fine-grained execution trace feedback, the system effectively navigates the high-dimensional control space to discover robust strategies that generalize across diverse models and benchmarks.

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

The Memory Curse: introduces a systematic study of how expanded interaction history in LLM agents paradoxically degrades cooperation in multi-agent social dilemmas due to a cognitive vulnerability to accumulated negative content.
The research identifies that while short-term memory facilitates trust repair, extended memory often triggers defensive, history-following reasoning patterns that override cooperative priors.
The authors demonstrate that this "memory curse" can be mitigated by targeted fine-tuning to promote forward-looking reasoning or by sanitizing memory content, proving the phenomenon is driven by reasoning style rather than context length.

A Roadmap of Mixed Reality Body Doubling for Adults with ADHD

Body Doubling Framework: introduces a multidimensional model for designing and evaluating body doubling setups to support adults with ADHD through Individual Motivation, Agent-Related Dimensions, Interaction-Related Dimensions, Contextual Dimensions, and Efficacy of the Approach.
The framework categorizes body doubling setups using a roadmap that accounts for agent characteristics, interaction dynamics, and environmental context to identify research gaps in mixed reality and interactive systems.
This research provides a structured approach for future empirical studies and the development of personalized, technology-mediated body doubling interventions for neurodivergent individuals.

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

LRMs: introduces a comparative study evaluating frontier Large Reasoning Models against human game learners and traditional reinforcement learning agents using VGDL-fMRI datasets.
The research demonstrates that frontier LRMs exhibit behavioral patterns and brain-encoding representations that align significantly closer to human game-learning than deep reinforcement learning baselines.
The study establishes that LRM representations predict human BOLD activity an order of magnitude better than RL alternatives, suggesting these models capture transferable, human-like in-context learning dynamics.

Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles

Collaborator-Assistant Spectrum: introduces a process-centric framework to analyze how AI coding agents partition operational agency and governance authority across pull request lifecycles.
The framework classifies tools into Collaborator or Assistant paradigms based on initiation patterns and review-routing behaviors observed in 29,585 pull request lifecycles.
It utilizes an Initiator × Approver taxonomy to disentangle operational work from merge governance, revealing that while agents frequently initiate tasks, merge authority remains predominantly human.

Learning CLI Agents with Structured Action Credit under Selective Observation

A3 (Action Advantage Assignment): introduces a reinforcement learning paradigm for CLI agents that addresses partial workspace observation and sparse action credit through structured action analysis.
The framework utilizes σ-Reveal to construct token-budgeted workspace views and employs a three-channel advantage mechanism—episode, turn, and tree scope—to assign credit to shell actions based on their structural intent.
The approach is evaluated on the ShellOps dataset, demonstrating improved performance in complex filesystem interaction tasks by leveraging native shell syntax for credit assignment at a computational cost comparable to standard agentic RL.

Interpreting Reinforcement Learning Agents with Susceptibilities

Susceptibility Framework: introduces a method for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the regret landscape, utilizing Conv 1-3, FF 1, FF 2, Policy, SGLD, and Regret Landscape.
The framework generalizes susceptibilities to reinforcement learning by measuring how initial state distribution perturbations affect the effective complexity of specific architectural components.
Empirical validation through activation-steering and direction-conditioned posterior regret confirms that susceptibility-detected parameter-space structures correspond to functionally meaningful internal agent computations.

Tool Calling is Linearly Readable and Steerable in Language Models

Tool Calling is Linearly Readable and Steerable in Language Models: introduces a method for steering LLM tool selection by adding a mean-difference vector to the residual stream, which redirects tool choice and automatically adapts argument schemas.
The research demonstrates that tool identity is linearly encoded within a low-dimensional subspace of the model's internal activations, allowing for precise intervention without weight updates.
Mechanistic analysis reveals that tool selection is driven by a three-stage circuit involving early-layer features, mid-layer attention heads, and late-layer formatting, with instruction tuning essential for routing these internal signals to structured output.

Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs

AugMP: introduces an adversarial framework that leverages graph representation learning to synthesize malicious updates that disrupt the federated fine-tuning of LLMs while evading detection.
The framework utilizes a VGAE to capture feature correlations among benign updates and an iterative manipulation algorithm based on an augmented Lagrangian dual formulation to enforce geometric stealth constraints.
Experimental results demonstrate that AugMP significantly degrades the accuracy of global and local LLMs across multiple backbones while maintaining statistical and geometric consistency with benign updates.

Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

Forced-injection framework: introduces a methodology to measure the value of information (VOI) by injecting ground-truth clarifications at controlled trajectory points across diverse long-horizon agent tasks.
The study demonstrates that clarification timing is highly dimension-dependent, with goal information requiring early intervention and input information remaining recoverable through mid-trajectory.
Empirical results reveal that current LLMs fail to naturally ask within these optimal timing windows, establishing a quantitative foundation for designing future timing-aware clarification policies.

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

TraceFix: introduces a verification-first pipeline for LLM multi-agent coordination that uses counterexample-driven repair to ensure protocol correctness.
The framework utilizes an Orchestration Agent to synthesize a Protocol Topology IR and PlusCal Coordination Logic, which are then validated by the TLC Model Checker to eliminate concurrency hazards.
Verified protocols are compiled by the Prompt Compiler into per-agent prompts and enforced at runtime by a Topology Monitor to prevent out-of-protocol coordination operations.

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

AgentEscapeBench: introduces an escape-room-style benchmark that evaluates the ability of LLMs to perform tool-grounded reasoning in unfamiliar environments using Template Library, DAG Skeleton Generator, Source Annotator, Value Instantiator, Deterministic Forward Executor, Narrative Generator, and Evaluation Sandbox.
The framework utilizes a six-stage automated pipeline to construct directed acyclic graph tasks that require agents to infer, execute, and revise tool-use procedures under explicit long-range dependency constraints.
Experimental results across sixteen LLMs demonstrate that performance degrades monotonically with dependency depth, highlighting significant bottlenecks in long-range state tracking and intermediate-result propagation.

One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

DUST (DecoUpled Spatio-Temporal) Gaussian Scene Graph: introduces a cooperative reconstruction framework that decouples pose trajectories into source-specific timelines to eliminate gradient conflicts caused by temporal asynchrony in multi-source driving data.
The framework utilizes a shared canonical Gaussian representation for appearance consistency while maintaining independent pose timelines for vehicle and infrastructure sensors to enable high-fidelity 4D scene reconstruction.
DUST incorporates a static anchor-based pose correction pipeline and a pose-regularized joint optimization scheme to ensure robust initialization and stable training under asynchronous conditions.

ADKO: Agentic Decentralized Knowledge Optimization

ADKO (Agentic Decentralized Knowledge Optimization): introduces a modular framework for collaborative black-box optimization that enables autonomous agents to share semantically rich insights via Private Gaussian Process (GP) surrogate, Knowledge token, Graph-structured communication, LM reasoning module, Fidelity-aware token pruning, and Reasoning score without exposing raw data.
The framework utilizes a Reasoning score that synthesizes private GP-based exploitation and exploration with collaborative success-attraction and failure-avoidance signals derived from peer-shared Knowledge tokens.
ADKO provides a rigorous theoretical foundation for decentralized optimization by decomposing cumulative regret into GP error, LM bias, LM noise, and compression loss, while validating performance through neural architecture search and scientific discovery experiments.

VISTA: Decentralized Machine Learning in Adversary Dominated Environments

VISTA: introduces an adaptive decentralized learning framework that dynamically tunes acceptance thresholds and learning rates to ensure convergence in adversary-dominated environments.
The framework utilizes a consistency-based acceptance rule to incentivize rational worker nodes to provide reliable gradient reports while mitigating adversarial noise.
VISTA achieves asymptotic convergence rates matching standard SGD by balancing update frequency and estimation accuracy through an adaptive signal-to-noise tolerance parameter.

RELAGENT: LLM Agents as Data Scientists for Relational Learning

RELAGENT: introduces an agentic framework that automates relational learning by iteratively searching for optimal SQL feature programs and predictive models using an LLM agent interacting with a database via specialized tools.
The framework decouples the search phase, where an LLM agent refines feature programs and model configurations, from the inference phase, which executes the final SQL-based predictor without further LLM calls.
By maintaining relational structure externally and using executable SQL queries, the approach achieves intrinsic interpretability and relational invariance while matching or exceeding the performance of fixed-architecture relational models.

Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

MCP-BiFlow (Model Context Protocol Bidirectional Data-Flow Analysis Framework): introduces a static analysis framework that detects bidirectional data-flow vulnerabilities in Model Context Protocol servers by combining MCP Entrypoint Recovery, MCP-Specific Taint Specification, and Bidirectional Interprocedural Taint Analysis.
The framework addresses request-side propagation, where requester-controlled inputs reach sensitive operations, and return-side propagation, where untrusted external content influences downstream LLM reasoning or tool invocations.
Evaluated on 32 confirmed vulnerabilities, the framework achieves 93.8% recall and identifies 118 unique vulnerability paths across 87 real-world Model Context Protocol server repositories.

Many-to-Many Multi-Agent Pickup and Delivery

M2M: introduces a sequential many-to-many MAPD algorithm that optimizes task allocation across agents, tasks, and multiple potential pickup and delivery locations using Initial Task Allocation, Large Neighborhood Search, and Priority Based Search.
The framework addresses the NP-hard 4-dimensional assignment problem by decomposing cost tensors into matrices to enable efficient computation of task allocations in automated warehouse environments.
Experimental results demonstrate that M2M consistently outperforms existing state-of-the-art methods in task throughput and maintains scalability for up to 150 agents and tasks within a 1-second computation budget.

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

CyBiasBench: introduces a comprehensive benchmarking framework to quantify attack-selection bias in LLM agents across diverse cyber-attack scenarios using Prompt Design, Agent Penetration Testbed, and Evaluation Metrics Suite.
The framework utilizes a Kali Linux Container for isolated execution against Target Applications, employing a Deterministic Classifier and Verifier Pipeline to measure agent behavior independently of self-reported logs.
The research reveals that LLM agents exhibit persistent, agent-specific attack-selection biases that remain stable across prompt variations and are not consistently improved by explicit steering, a phenomenon termed bias momentum.

SCENE: Recognizing Social Norms and Sanctioning in Group Chats

SCENE: introduces a benchmark for evaluating how LLMs infer and adapt to implicit social norms within multi-party group chats through interaction with scripted personas.
The framework utilizes a two-stage generation process to create synthetic, interactionally plausible scenarios where a subject agent must recognize sanctioning signals and adjust its behavior accordingly.
Evaluation metrics include repair rates after sanctioning, breach rates based on prior demonstrations, and compliance under opposing norm pairs to assess the social adaptation capabilities of various LLMs.

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

GazeVLM: introduces a multimodal architecture that internalizes active vision by dynamically modulating the causal attention mask of a VLM using tokens to focus on task-relevant visual regions.
The framework employs a two-stage training process, starting with supervised fine-tuning on curated gaze-reasoning traces, followed by reinforcement learning using GRPO to optimize autonomous visual navigation.
By applying a continuous suppression bias to irrelevant visual tokens, GazeVLM achieves high-resolution multimodal reasoning without the computational overhead of external cropping tools or context window inflation.

CktFormalizer: Autoformalization of Natural Language into Circuit Representations

CktFormalizer: introduces a framework that redirects LLM-driven hardware generation through a dependently-typed HDL embedded in Lean 4 to ensure structural correctness and enable formal verification.
The framework utilizes an LLM Coding Agent that leverages immediate compile-time feedback from the Lean type system to iteratively refine hardware designs, effectively bridging the gap between natural language specifications and synthesizable silicon.
CktFormalizer incorporates a closed-loop PPA optimization strategy and automated theorem proving to produce formally verified, optimized hardware implementations that are robust against backend synthesis and routing failures.

Coding Agents Don’t Know When to Act

FIXEDBENCH: introduces a benchmark to evaluate whether coding agents can recognize when software issues are already resolved and abstain from making unnecessary code changes.
The study demonstrates that LLMs exhibit an action bias, frequently applying spurious edits to already-fixed code, which can be mitigated by explicit instructions to verify the issue before acting.
The research highlights that while verify-then-abstain prompting improves abstention rates, it introduces a risk of over-abstention on tasks that genuinely require code modifications.

Emergence of Social Reality of Emotion through a Social Allostasis Model with Dynamic Interpretants

Social Allostasis Model: introduces a computational framework where two agents co-construct social reality by aligning bodily control goals and symbol interpretations through active inference and the Metropolis–Hastings Naming Game.
The framework utilizes POMDP-based generative models to enable agents to perform allostatic regulation of bodily states while simultaneously negotiating shared emotional concepts via symbolic communication.
Experimental results demonstrate that agents successfully converge on shared prior preferences and dynamic symbol interpretations, confirming the emergence of social reality from the interplay of individual bodily regulation and social interaction.

Alternating Target–Path Planning for Scalable Multi-Agent Coordination

TAPF Framework: introduces an iterative refinement approach that decouples target assignment from pathfinding to achieve scalability in multi-agent coordination by leveraging Initial Assignment, MAPF Solver, Feedback Mechanism, Reassignment Strategy, and Final Path Optimization.
The framework utilizes feedback-driven reassignment loops to identify bottleneck agents via Delay-Based Selection or Spectral Bottleneck Sampling, subsequently refining assignments using PIBT or Local Hungarian methods.
Empirical results demonstrate that the framework scales to thousands of agents, significantly outperforming traditional Conflict-Based Search methods in large-scale logistics scenarios.

Securing the Dark Matter: A Semantic-Enhanced Neuro-Symbolic Framework for Supply Chain Analysis of Opaque Industrial Software

SCAA (Supply Chain Analysis Agent): introduces a neuro-symbolic framework that reconstructs behavioral semantics from opaque industrial binaries to perform tractable global risk reasoning.
The framework utilizes a Reflexive Prompting pipeline to constrain a local LLM agent with structural verification, effectively suppressing hallucinations during semantic lifting.
A domain-adapted Graphormer and surjective graph transformation enable scalable vulnerability detection and APT fingerprinting across complex industrial supply chains.

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

SARC: introduces a governance-by-architecture framework that treats constraints as first-class specification objects, comprising S, A, R, and C, enforced at PAG, ATM, PAA, and ER.
The framework enables auditable-by-construction agentic systems by compiling regulatory obligations into runtime checks that maintain system invariants.
SARC provides a specification-to-runtime compilation discipline that separates optimization from admissibility, ensuring compliance is verifiable through structured trace derivation.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

SOD: introduces a step-wise on-policy distillation framework that adaptively reweights distillation strength based on student-teacher divergence to stabilize training for small language model agents.
The framework combines GRPO for trajectory-level exploration with a step-wise OPD objective that attenuates misleading teacher signals in high-divergence regions caused by tool-induced state drift.
Experimental results demonstrate that SOD significantly improves agentic reasoning performance and training stability across math, science, and code benchmarks compared to standard distillation and reinforcement learning baselines.

The AI-Native Large-Scale Agile Software Development Manifesto

AI-Native Large-Scale Agile Software Development Framework: introduces a paradigm shift in software engineering by replacing sequential human-centric processes with an intelligent, adaptive system where Human Experts define intent and AI Agents execute tasks.
The framework utilizes AI Personas, AI Agents, and Skills to enable parallel development, while a Semantic Layer and Knowledge Graphs provide a shared, living context for both humans and agents.
By integrating MCP Connectors for tool interaction and emphasizing reusable blueprints, the approach facilitates continuous verification and autonomous collaboration across large-scale development teams.

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

DRIP-R: introduces a benchmark for evaluating LLM agents in retail scenarios governed by ambiguous policies, utilizing a simulation environment with Scenario Construction, Simulation Environment, Evaluation Pipeline, Orchestrator, LLM-as-a-Judge, Customer Agent, User Simulator, Database, and Tool-calling Agent.
The framework employs a multi-judge evaluation pipeline to assess agent performance across policy adherence, dialogue quality, behavioral alignment, interest alignment, and task-resolution adherence.
The benchmark systematically exploits real-world policy ambiguities to test how LLMs interpret conflicting rules, justify decisions, and balance stakeholder interests in conversational settings.

GASim: A Graph-Accelerated Hybrid Framework for Social Simulation

GASim: introduces a graph-accelerated hybrid framework for large-scale social simulations that utilizes EDG, GOM, and GMP to optimize agent-based modeling performance.
The framework employs EDG to dynamically partition agents into LLM-based core agents and numerical ordinary agents, significantly reducing computational latency.
GOM replaces intensive LLM-based retrieval with lightweight graph propagation, while GMP leverages a GAT to enable parallel opinion updates for ordinary agents.

A Multi-Level Agent-Based Architecture for Climate Governance Integrating Cognitive and Institutional Dynamics

Multi-level Agent-Based Architecture: introduces a modular simulation framework that integrates micro-level HUMAT-MOA behavioural decision-making, meso-level social influence networks, and macro-level institutional strategy to model democratic climate governance.
The architecture utilizes a land-use development proposal as a central interaction artefact, enabling the co-evolution of public opinion, organized advocacy, and formal political decision-making.
By combining empirically grounded cognitive models with strategic institutional rules, the framework provides a transparent and portable approach for simulating complex socio-political dynamics in climate governance.

The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting

Credibility Game framework: introduces a formal model of strategic reporting where an agent's combined objective of accuracy and non-accuracy benefits leads to an endogenous impossibility of truthful reporting.
The framework demonstrates that a principal's optimal oversight mechanism is necessarily non-affine, which triggers a structural impossibility that undermines calibration.
The research establishes that a sharp step-function approval threshold serves as a constructive escape, achieving first-best screening for any strictly proper scoring rule.

Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

ODD-TAX-232 (Operational Design Domain Taxonomy 232): introduces a framework for zero-shot perception in autonomous driving that utilizes specialized agents including Traffic Signage Expert, Road Marking Expert, Scenery Element Expert, Weather Condition Expert, and Trigger Condition Expert, integrated via a Chained Prompting Pipeline and Knowledge Base within a VLM-based Perception System.
The research evaluates the effectiveness of various VLMs in identifying fine-grained ODD elements by employing persona-based prompting and structured reasoning to enhance zero-shot classification and detection performance.
The study demonstrates that while VLMs show potential for offline ODD auditing and compliance reporting, their current performance requires further refinement for safety-critical, real-time on-board deployment.

MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing

MAVEN: introduces a blackboard-inspired multi-agent framework that decouples factual probing from synthesis to transform LLMs into deliberate, auditable reasoners.
The framework utilizes an adversarial Skeptic-Researcher-Judge loop to enforce discrete verification gates and parametric probing, ensuring that reasoning trajectories are factually grounded and logically coherent.
By employing an Adaptive Router and a persistent Knowledge Cache, MAVEN optimizes computational efficiency while maintaining structural rigor across diverse LLM backbones.

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

PHONESAFETY: introduces a benchmark of 700 safety-critical moments to distinguish between safe judgment and inability to act in LLM-based phone-use agents.
The framework categorizes agent responses into safe actions, unsafe actions, and failures to do anything useful to prevent misinterpreting harmless outcomes as evidence of safety.
Empirical results demonstrate that general phone-use capability does not reliably predict safe choices, and that failures to act often stem from operational limitations rather than safety-conscious judgment.

MemCompiler: Compile, Don’t Inject — State-Conditioned Memory for Embodied Agents

MemCompiler: introduces a state-conditioned memory compilation paradigm that dynamically selects and compiles relevant experience into dual-channel guidance for embodied agents.
The framework utilizes a Memory Compiler to maintain a structured Brief State and deliver targeted text and latent Soft-Mem tokens to the Executor, preventing attention dilution.
MemCompiler significantly improves performance across multiple embodied benchmarks while reducing executor input tokens and latency by optimizing memory delivery for lightweight LLMs.

Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN

HAF (Hierarchical Agentic Framework): introduces a two-layer architecture for AI-RAN that decouples slow-timescale service placement from fast-timescale GPU/CPU allocation to optimize SLO fulfillment.
The framework utilizes an LLM-based agent for intelligent migration decision-making, supported by a predictive critic to mitigate service interruption costs.
A closed-form convex allocator ensures hard real-time RAN deadline satisfaction while dynamically distributing remaining compute resources to elastic AI services.

Multi-Environment POMDPs with Finite-Horizon Objectives

MEPOMDP: introduces a robust framework for sequential decision-making under uncertainty with multiple environments and finite-horizon objectives, utilizing Multi-belief, Multi-expected payoff, and Mixture of policies to compute optimal values.
The paper establishes that the value computation problem for MEPOMDPs is PSPACE-complete, providing both a space-efficient deterministic algorithm and a practically efficient algorithm based on bottom-up frontier construction with pruning.
Empirical evaluations demonstrate that the proposed efficient pruning algorithm significantly outperforms existing state-of-the-art tools across classical benchmarks like Rock Sample, Robot Navigation, and Identification.

Synchronizing Minds through Collective Predictive Coding: A Computational Model of Parent-Infant Homeostatic Co-Regulation

MHNG: introduces a computational model of parent–infant homeostatic co-regulation that integrates POMDP (Individual-level active interoceptive inference) with MHNG (Decentralized Bayesian communication mechanism) and CPC (Social-level predictive coding hypothesis) to achieve latent-state alignment.
The framework utilizes Generative Matrices (Learned internal model parameters) and a Communicative Variable (Shared interactional symbol) to enable agents with asymmetric knowledge to coordinate regulatory actions.
Simulation results demonstrate that MHNG-mediated interaction facilitates rapid synchronization of Internal Representations (Agent latent belief states) and improves visceral state regulation compared to one-sided control conditions.

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

InterLV-Search: introduces a benchmark for interleaved multimodal agentic search, which evaluates how LLMs use visual evidence as search pivots to guide multi-hop trajectories across three progressively challenging levels.
The framework includes an InterLV-Agent that utilizes a reason-act-observe loop, supported by Short-term Memory, Long-term Memory, and Tool Integration to manage complex, non-linear search tasks.
Experimental results demonstrate that current LLMs struggle with interleaved multimodal search, highlighting the necessity of active visual evidence seeking and robust search control beyond simple text-centric browsing.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

MASPrism: introduces a two-stage failure attribution framework that leverages prefill-stage signals from an SLM to identify root causes in multi-agent execution traces without requiring decoding, replay, or task-specific training.
The framework utilizes a Filtering stage to identify symptom steps via NLL and candidate sources via attention, followed by a Diagnosis stage that reconstructs a focused prompt to perform fine-grained ranking of failure sources.
By operating exclusively on prefill-stage internal signals, MASPrism achieves significant speedups and cost reductions compared to agent-based or replay-based attribution methods while maintaining high accuracy on long-trace benchmarks.

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

LiteGUI: introduces a SFT-free training paradigm for lightweight GUI agents by integrating Guided On-policy Distillation and Multi-solution Dual-level GRPO.
The framework utilizes an automated trajectory generation pipeline to create multi-solution datasets, which are then used to train agents via privileged teacher guidance and dual-level reinforcement learning rewards.
LiteGUI achieves state-of-the-art performance among lightweight models by jointly aligning macro-level subtask planning with micro-level execution matching, effectively mitigating exploration bottlenecks in long-horizon GUI tasks.

HBEE: Human Behavioral Entropy Engine Pre-Registered Multi-Agent LLM Simulation of Peer-Suspicion-Based Detection Inversion

HBEE: introduces a pre-registered multi-agent LLM simulation framework to evaluate the effectiveness of insider threat detection mechanisms against adaptive adversaries.
The framework utilizes LLM-driven agents to model organizational communication and tests how adaptive OPSEC directives influence detection signals in both cascade and blind defender configurations.
Empirical results demonstrate a detection inversion where an adaptive mole becomes less suspicious than innocent peers in a cascade-based detection environment, revealing a decoupling of suspicion metrics.

The Moltbook Files: A Harmless Slopocalypse or Humanity’s Last Experiment

Moltbook Files: introduces a dataset of 232k posts and 2.2M comments from an AI-agent-populated platform, processed through a Collection Pipeline, Preprocessing Pipeline, and PII Anonymization Pipeline.
The research utilizes BERTopic Modeling to characterize agent discourse and Fine-tuning Configuration to assess the impact of synthetic data on LLM factuality and alignment.
The study employs an LLM-as-a-judge Evaluator to demonstrate that while training on agent-generated content degrades model performance, the effect is comparable to fine-tuning on human-generated Reddit data.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

EditRefiner: introduces a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a perception-reasoning-action-evaluation loop.
The framework utilizes a Perception Agent to identify flaws, a Reasoning Agent to diagnose them, an Action Agent to perform targeted re-editing, and an Evaluation Agent to ensure alignment with human preferences.
The system is supported by the EditFHF-15K dataset, which provides extensive human-annotated feedback, including saliency maps and MOS scores, to facilitate the development of self-corrective image editing agents.

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

SAEgis: introduces a lightweight adversarial attack detection framework by inserting sparse autoencoders into pretrained VLMs to identify attack-relevant signals through reconstruction-based latent feature analysis.
The framework utilizes a difference-of-means approach on sparse latent features to detect adversarial perturbations without requiring additional adversarial training.
By ensembling sparse autoencoder signals across multiple layers, the system achieves robust detection performance across diverse domains and unseen adversarial attack methods.

GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

GameGen-Verifier: introduces an automated verification paradigm for LLM-generated games that decomposes specifications into verifiable keypoints and grounds them into independent verification units for parallel execution.
The framework utilizes parameter-based state construction and runtime state patching to instantiate specific game states, bypassing the need for long-horizon gameplay exploration.
GGV-HARNESS provides a two-layer architecture for concurrency management, runtime isolation, and fault recovery, significantly improving verification accuracy and reducing wall-clock time compared to traditional agent-based approaches.

Low-code and No-code with BESSER to Create and Deploy Smart Web Applications

BESSER: introduces an open-source low-code framework that enables the design, generation, and deployment of smart web applications through B-UML, Structural Perspective, Agent Perspective, GUI Perspective, Code Generation Pipeline, Backend Generator, Frontend Generator, BAF Generator, and Deployment Service.
The framework utilizes a model-driven approach to automate the creation of full-stack web applications, including backend APIs, React-based frontends, and LLM-integrated agent backends.
BESSER streamlines the development lifecycle by integrating directly with cloud platforms like GitHub and Render to provide one-click deployment of generated smart web applications.

FlightSense: An End-to-End MLOps Platform for Real-Time Flight Delay Prediction via Rotation-Chain Propagation Features and Agentic Conversational AI

FlightSense: introduces an end-to-end MLOps platform for real-time flight delay prediction that utilizes an XGBoost Classifier, AWS Lambda, Amazon SageMaker, a Streamlit Dashboard, an Amazon Bedrock Nova Micro agent, and a Tool-Use Architecture.
The framework employs a progressive three-version feature engineering approach, prioritizing rotation-chain propagation features to achieve high predictive accuracy.
The system integrates an agentic LLM to provide grounded, natural-language delay risk assessments through multiplicative probability compounding based on live weather and operational data.

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

Agent Skills Survey: introduces a comprehensive taxonomy for LLM-based agent systems, focusing on the lifecycle of reusable procedural artifacts that bridge the gap between raw tool access and robust task execution.
The framework categorizes agent systems into five core modules: Skill Representation, Skill Acquisition, Skill Retrieval and Selection, Skill Evolution, and Runtime Governance.
This survey synthesizes current research on how LLMs can externalize, manage, and refine procedural knowledge to improve the scalability, robustness, and maintainability of autonomous agent ecosystems.

Tools as Continuous Flow for Evolving Agentic Reasoning

FlowAgent: introduces a paradigm that reconceptualizes discrete tool chaining as continuous trajectory generation within a semantic space to enable global plan-level reasoning.
The framework utilizes a Conditional Flow Planner to generate latent trajectories, which are mapped to discrete actions via Discrete Plan Decoding and a Parameterized Stop Mechanism to ensure robust execution.
FlowAgent incorporates a closed-loop execution scheme with a State Update Operator and Observation Encoder to dynamically absorb environmental feedback, effectively mitigating error accumulation in long-horizon tasks.

RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

RCoT-Seg: introduces a video-of-thought framework that factorizes Video Reasoning Segmentation into temporal video reasoning and keyframe target perception, utilizing MLLM, AKS, KTG, SAM2, GRPO, and Hungarian-based reward.
The framework employs an agentic keyframe selection mechanism with a self-critical loop to verify and refine keyframe choices, replacing heuristic sampling with a verifiable decision process.
RCoT-Seg leverages GRPO reinforcement learning with a matching-aware reward to align reasoning traces with downstream segmentation, achieving state-of-the-art performance on video reasoning and referring segmentation benchmarks.

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

Softmax ICTD: introduces a theoretical framework demonstrating that Transformers with standard softmax attention can implement weighted softmax TD updates in their forward pass.
The architecture utilizes a dual-head Transformer design where one head computes current value updates and the other computes target values, effectively performing iterative policy evaluation.
The paper proves that these Transformer parameters are global minimizers of a reinforcement pretraining loss, establishing convergence to the true value function as depth and context length increase.

Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

DoLQ (Discovering Ordinary differential equations with LLM-based Qualitative and quantitative evaluation): introduces a multi-agent framework that integrates LLM-based qualitative reasoning with quantitative numerical validation to discover interpretable ODEs.
The framework utilizes a Sampler Agent to propose candidate terms, a Parameter Optimizer to refine coefficients, and a Scientist Agent to filter physically implausible models through iterative feedback.
By combining semantic assessment with residual-based optimization, DoLQ achieves superior success rates in recovering ground-truth symbolic structures across diverse dynamical systems compared to existing symbolic regression methods.

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

Scale-Conditioned Evaluation Protocol: introduces a trajectory-level evaluation framework for LLM agent memory that assesses usability under evidence-preserving growth by monitoring interaction budgets and failure regimes.
The protocol utilizes four diagnostics—budget-compliant reliability, tail retrieval-call burden, failure-regime decomposition, and usable-scale boundary—to distinguish between intrinsic memory failures and agent-facing interaction inefficiencies.
Empirical audits across flat, planar, and hierarchical memory interfaces demonstrate that scalable-memory claims must be explicitly conditioned on the specific agent, interface, scale range, and interaction budget.

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

AT-VLA: introduces an adaptive tactile injection mechanism that balances pretrained knowledge with newly learned tactile representations to enhance precision in contact-rich manipulation.
The framework utilizes a Tactile Reaction Dual-Stream mechanism to decouple sensory processing into a slow visual-language stream and a fast tactile control stream, enabling real-time closed-loop responses within 0.04 seconds.
By employing a learnable Tactile Gate and Adaptive Cross Attention, the model dynamically modulates modality contributions, ensuring robust performance even when tactile signals are absent during inference.

BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

BioProVLA-Agent: introduces a multi-agent framework for biological laboratory manipulation that integrates protocol parsing, state verification, and embodied execution through the Guiding Decision Agent, Tailored LLM Protocol Agent, VLM-RAG Verification Agent, and VLA Embodied Agent.
The system utilizes AugSmolVLA to enhance the robustness of the VLA Embodied Agent against visual perturbations common in wet-lab environments, such as transparent labware and illumination shifts.
By employing a closed-loop verification mechanism, the framework enables reliable long-horizon biological experimentation while reducing dependence on fixed robotic scripts.

SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model

SOM (Structured Opponent Modeling): introduces a two-stage framework that decouples opponent model construction and prediction by grounding the process in Structural Causal Models (SCMs) to enable structured and controllable reasoning.
The framework utilizes Dynamic SCM Construction to build a causal graph of opponent decision-making and Reasoning for Opponent Prediction and Adaptation to perform inference using personalized reasoning examples.
SOM improves prediction accuracy and adaptability in multi-agent environments by replacing implicit contextual reasoning with explicit, graph-based structured reasoning pathways.

EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams

ProAct-Stream: introduces a comprehensive benchmark and proactive streaming model that leverages simulated user profiles and a "short thinking, better interaction" paradigm to enable personalized HMI.
The framework utilizes a two-stage training process, integrating Supervised Fine-Tuning and Reinforcement Learning to optimize reasoning length and response quality for low-latency interaction.
EgoPro-Bench provides a robust evaluation protocol across 12 distinct domains, addressing the limitations of reactive MLLMs in temporal alignment and personalized intent understanding.

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

CHEMCOST introduces a benchmark for evaluating LLMs on chemical procurement cost estimation, requiring agents to resolve chemical identities, retrieve supplier quotes, and perform multi-step arithmetic.
The framework utilizes a ReAct agent architecture equipped with specialized tools to navigate complex, domain-constrained procurement tasks under both clean and noise-injected conditions.
Experimental results demonstrate that while tool access is necessary for performance, LLMs struggle with evidence integration and robustness to realistic input formatting, highlighting a significant gap in scientific agentic reasoning.

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

EnvSimBench: introduces a diagnostic framework and benchmark that reframes LLM-based environment simulation as a fully observable state prediction task to address hallucination, logical inconsistency, and state drift.
The framework utilizes a constraint-driven MDP formulation that provides models with explicit before-state configurations and implementation code to enable independent, verifiable evaluation of simulation fidelity.
Systematic evaluation reveals a universal state-change cliff where LLMs fail catastrophically when multiple state variables require simultaneous updates, a gap addressed by the specialized Balance2 model.

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

MEMOREPAIR: introduces a barrier-first cascade-repair contract that manages the lifecycle of derived artifacts in agentic memory by withdrawing invalidated cascades and selectively republishing validated successors.
The framework models agentic memory as a directed provenance graph and utilizes a min-cut solver to optimize the repair-cost tradeoff when reconstructing artifacts after deletion, correction, or migration events.
Experimental results on ToolBench and MemoryArena demonstrate that MEMOREPAIR effectively eliminates stale-memory exposure while recovering the majority of validated successors at a significantly lower repair-operator cost than exhaustive methods.

Rethinking Priority Scheduling for Sequential Multi-Agent Decision Making in Stackelberg Games

HPA (Hierarchical Priority Adjustment): introduces a hierarchical reinforcement learning framework that dynamically optimizes the execution order of agents in N-level Stackelberg Games to improve multi-agent coordination.
The framework utilizes an upper policy to select execution sequences based on the current system state, while lower-level agents execute actions within a Spatio-Temporal Sequential Markov Game (STMG) structure.
By employing a slow-fast update scheme with shared intrinsic rewards, the method effectively coordinates learning across time scales to adapt to changing environmental conditions.

HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization

HMACE: introduces a heterogeneous multi-agent framework that decomposes heuristic search into specialized roles, including Proposer, Generator, Evaluator, and Reflector, to improve combinatorial optimization.
The framework utilizes an archive-based memory to store evaluated heuristics, enabling behavior-aware retrieval that guides the Proposer toward diverse and promising search regions.
By integrating a lightweight deterministic filter and role-specialized collaboration, HMACE achieves a favorable quality-efficiency trade-off while minimizing LLM token consumption compared to monolithic baselines.

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

AIDA (Autonomous Insight Discovery Agent): introduces an end-to-end framework for autonomous business analysis that leverages a proprietary DSL and reinforcement learning to transform complex enterprise data into actionable insights.
The framework models business analysis as a Markov Decision Process, utilizing a structured state representation and a Pareto Principle-guided reward mechanism to prioritize high-leverage insights while suppressing hallucinations.
AIDA employs specialized masking strategies and a dual-channel feedback loop to ensure logical consistency and structural integrity during multi-turn exploration in large-scale industrial data environments.

Learning Agent Routing From Early Experience

BoundaryRouter: introduces a training-free routing framework that leverages Experience Memory and Rubric-guided CoT to dynamically dispatch queries between a Lightweight LLM and a Full Agent.
The framework utilizes a Hybrid Retriever to fetch relevant behavioral examples from the Experience Memory, enabling the Routing LLM to make informed decisions without requiring ground-truth labels.
By employing Rubric-guided CoT, the system ensures stable and consistent routing performance across diverse task distributions, effectively balancing inference latency and accuracy.

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

HyperEyes: introduces a parallel multimodal search agent that optimizes inference efficiency by fusing visual grounding and retrieval into a single atomic action using Unified Grounded Search (UGS), TRACE, and On-Policy Distillation (OPD).
The framework employs a Parallel-Amenable Data Synthesis pipeline and Progressive Rejection Sampling to curate high-quality, non-redundant trajectories for training LLMs.
HyperEyes establishes a new state-of-the-art for open-source agents by Pareto-dominating existing models in both accuracy and operational efficiency across six multimodal search benchmarks.

Repeated Deceptive Path Planning against Learnable Observer

DeMP (Deceptive Meta Planning): introduces a two-level optimization framework that combines episode-level adaptation and meta-level updates to enable sustained deception against a learnable observer.
The framework utilizes meta-level updates to anticipate the observer's learning dynamics, effectively mitigating the accumulation of adaptation lag inherent in repeated adversarial interactions.
DeMP leverages a surrogate objective that aligns with belief-induced rewards, ensuring the agent maintains deceptive performance while adapting to evolving adversarial recognition models.

Rethinking Experience Utilization in Self-Evolving Language Model Agents

ExpWeaver: introduces a lightweight paradigm that interweaves experience utilization into the agent's decision-making process by exposing experience as an optional resource during reasoning.
The framework utilizes a Reasoning (Iterative cognitive process) → Action (Environment-interactive step) loop, augmented with a Trigger Token (Special [Retrieve] signal) to activate the Retrieval Mechanism (Context-aware memory lookup) only when additional guidance is required.
ExpWeaver enables LLMs to selectively invoke experience from the Experience Repository (External memory of past interactions) at beneficial decision points, effectively reducing reliance on rigid initialization-only or always-on usage strategies.

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

SREGym: introduces a high-fidelity, modular benchmark for evaluating AI agents in diagnosing and mitigating complex, multi-layered failures within live cloud-native production environments.
The framework utilizes a System Environment (production-like cloud-native environment), Agent Interface (MCP servers for observability and control), Fault and Noise Injectors (orchestrators for distributed failure scenarios), Oracles (diagnosis and mitigation verification), and a Leaderboard (performance tracking) to provide a rigorous evaluation platform.
SREGym challenges LLMs by simulating diverse failure modes, including metastable and correlated failures, while requiring agents to distinguish between actual root causes and ambient noise.

MATHLIBPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

MATHLIBPR: introduces a benchmark for evaluating whether LLMs can judge the merge-readiness of build-passing pull requests in the Mathlib4 formal mathematical library.
The framework utilizes a staged evaluation protocol that provides increasing levels of context—ranging from code-local diffs to automated diagnostics and PR intent—to assess the reviewer-like judgment capabilities of LLMs and LLM agents.
Empirical results demonstrate that current LLMs and agents struggle to distinguish merge-ready from not-merge-ready contributions, even when provided with repository-wide context and diagnostic signals.

Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

AEB: introduces a benchmark for evaluating the adversarial robustness of RL-trained empathetic agents across six psychologically grounded dialogue trajectories.
The framework utilizes an adversarial simulator to test if LLMs can address latent emotional needs when users exhibit behaviors like gaslighting, escalation, or mood reversal.
The study reveals a dissociation where RL-trained LLMs improve emotional responsiveness and final scores without significantly enhancing observable state tracking as measured by the Emotional Consistency Score.

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

TAINTAWI: introduces a static analysis framework to detect Agentic Workflow Injection (AWI) vulnerabilities in GitHub Actions by modeling data flows from untrusted event contexts to agent-facing prompts and downstream workflow sinks.
The framework constructs an Agentic Workflow Dependency Graph (AWDG) to capture cross-boundary control and data dependencies, enabling the identification of Prompt-to-Agent (P2A) and Prompt-to-Script (P2S) injection patterns.
TAINTAWI identifies 496 exploitable AWI vulnerabilities in real-world workflows, including 343 previously unknown zero-day cases, by performing reachability analysis against action-level and workflow-level security guards.

Region4Web: Rethinking Observation Space Granularity for Web Agents

Region4Web: introduces a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction to provide a more informative basis for web agent state understanding.
PageDigest: provides a web-specific inference pipeline that delivers region-level observations as a compact digest, reducing observation length while maintaining task-relevant information across steps.
The framework improves web agent performance on the WebArena benchmark by enabling efficient, region-based page state understanding that scales across diverse backbone LLMs.

RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation

RRCM: introduces a ranking-driven retrieval-and-reasoning framework that dynamically constructs decision-relevant contexts for LLMs by selectively accessing collaborative and meta memories.
The framework utilizes an LLM as an agent that interleaves reasoning with on-demand memory retrieval, optimized end-to-end via reinforcement learning to maximize final recommendation quality.
By replacing static retrieval rules with an adaptive policy, RRCM effectively mitigates context-length bottlenecks and improves recommendation accuracy for both popular and long-tail items.

The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

Position Curse: introduces a systematic failure in LLMs where models struggle to retrieve items by position in short lists, particularly when using backward indexing.
The paper characterizes this failure across four axes: query direction, anchor type, indexing direction, and item type, demonstrating that backward retrieval consistently lags behind forward retrieval.
The authors propose POSBENCH and PYINDEX to evaluate and mitigate this positional deficit, showing that while fine-tuning improves performance, the underlying capability remains far from saturated.

Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

ICRL (In-Context Reinforcement Learning): introduces a theoretical framework where a single-layer linear Transformer with Chain-of-Thought generation performs iterative Temporal Difference learning updates during the forward pass.
The framework utilizes a structured context buffer to store interaction trajectories and intermediate weight iterates, enabling the model to refine its policy evaluation performance through successive CoT steps.
Theoretical analysis establishes that the CoT generation process converges geometrically to the Temporal Difference fixed point, with performance saturating at a statistical floor determined by the context length.

RepoZero: Can LLMs Generate a Code Repository from Scratch?

ACE (Agentic Code-Test Evolution): introduces a benchmark and framework for repository-level code generation that utilizes Source Repository, Test Files Generator, Test Case Generator, Environment Construction, Filtering, Coding Agent, Testing Agent, and ACE Workflow.
The framework reformulates repository generation as a reproduction task, requiring agents to implement functionality from scratch based on API specifications while ensuring output equivalence through automated black-box testing.
By integrating an iterative code-test feedback loop, the ACE framework enhances the success rate of complex repository-level synthesis by leveraging deterministic ground-truth outputs from source repositories.

Switchcraft: AI Model Router for Agentic Tool Calling

Switchcraft: introduces a specialized model router for agentic tool calling that utilizes a DistilBERT classifier and a cost model to select the most cost-effective LLM while maintaining correctness.
The system employs an AST-based checker to evaluate tool call accuracy across diverse benchmarks, enabling the router to learn from realized model behavior rather than chat-based preferences.
Switchcraft achieves near-oracle accuracy while significantly reducing inference costs by dynamically routing queries to the cheapest LLM capable of executing the required tool calls.

Securing Computer-Use Agents: A Unified Architecture–Lifecycle Framework for Deployment-Grounded Reliability

CUA Architecture–Lifecycle Framework: introduces a diagnostic lens for deployment-grounded reliability by mapping architectural layers (Perception, Decision, Execution) to lifecycle stages (Creation, Deployment, Operation, Maintenance).
The framework distinguishes between failure origin and failure manifestation, identifying how upstream choices in Creation and Deployment create latent risks that surface during Operation or Maintenance.
The paper provides a security and privacy taxonomy that links threat surfaces to specific lifecycle stages, emphasizing that effective control placement must follow the timing of failure introduction.

ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

ARMOR: introduces an agentic framework that explicitly models tool-specific utilities, adaptively prioritizes tools, and resolves tool conflicts to improve reaction feasibility prediction.
The framework utilizes a hierarchical tool structure, pattern-based utility assessment, and a memory-augmented reasoning mechanism to leverage complementary strengths of multiple tools.
Experimental results demonstrate that ARMOR consistently outperforms existing tool aggregation and selection methods by effectively identifying appropriate tools for specific reaction characteristics.

Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

DDPL: introduces a decentralized MARL algorithm that replaces unimodal Gaussian policies with expressive DDPMs to enable effective exploration in high-dimensional joint action spaces.
The framework utilizes ISSM to enable efficient online training of diffusion policies without requiring samples from the target energy-based distribution.
DDPL provides theoretical sample-complexity and Nash equilibrium gap guarantees, demonstrating improved performance and sample efficiency across continuous-action MARL benchmarks.

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

TeamBench: introduces a benchmark for evaluating LLM agent coordination by enforcing strict role separation via operating system permissions rather than relying on prompt-based instructions.
The framework utilizes a Planner, Executor, and Verifier to ensure that no single agent can simultaneously read full requirements, modify the workspace, and certify the final answer.
Experimental results demonstrate that while teams rarely outperform single agents on average, the benchmark effectively exposes coordination costs and Verifier failure modes that pass-rate metrics typically mask.

Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

MASS (Multi-Agent Social Systems): introduces a formal framework for modeling agentic AI systems as dynamical systems where system-level outcomes emerge from the interaction of heterogeneous Agents, Information exchange function (f), Influence dynamics function (g), and Networked interaction structure (G).
The framework identifies four structural priors—strategic heterogeneity, network-constrained dependence, co-evolution, and distributional instability—to explain how agent interactions generate emergent social dynamics.
The authors demonstrate the framework's utility by analyzing an LLM-based social network, MoltBook, showing that these structural priors are empirically present in multi-agent systems.

2.5-D Decomposition for LLM-Based Spatial Construction

2.5-D Decomposition: introduces a neuro-symbolic pipeline that improves spatial reasoning by restricting the LLM to 2D horizontal planning while a deterministic executor handles vertical placement.
The architecture incorporates an Instruction Parser, Structure Analyzer, Build Planner (LLM), Plan Verifier, Spatial Executor, and Response Formatter to eliminate systematic coordinate errors.
By removing deterministic dimensions from the LLM output space, the system achieves high structural accuracy on construction tasks and transfers effectively to edge hardware.

From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines

Agentic CI/CD: introduces a conceptual framework for authority transfer in software delivery, distinguishing between data-plane authority and control-plane authority to categorize agent autonomy.
The paper characterizes current systems as operating under bounded autonomy, where LLM-based agents are confined to data-plane tasks while relying on external governance mechanisms for safety.
The authors propose a research agenda prioritizing control-plane safety, formalization of autonomy boundaries, and the development of evaluation frameworks to address the challenges of delegating decision-making to agents.

Do Joint Audio-Video Generation Models Understand Physics?

AV-Phys Bench: introduces a comprehensive benchmark for evaluating physical commonsense in joint audio-video generation models across steady-state, event-transition, and environment-transition scene categories.
The framework utilizes an AV-Phys Agent, which integrates a multimodal LLM with deterministic audio-DSP tools to provide scalable, rubric-based judgments of physical consistency.
Experimental results reveal a significant gap between semantic adherence and physical consistency, with scene transitions emerging as the primary difficulty frontier for current generative models.

MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments

MedExAgent: introduces a POMDP-based framework for interactive clinical diagnosis that integrates patient questioning, medical exam tool-calling, and final diagnosis under a unified agent architecture.
The framework employs a two-stage training pipeline consisting of supervised fine-tuning on Calgary-Cambridge-structured conversations followed by DAPO reinforcement learning to optimize for diagnostic accuracy, tool-call quality, and exam cost.
MedExAgent incorporates a systematic noise model to simulate real-world clinical uncertainty, including seven patient noise types and three exam noise types, to enhance agent robustness.

More Than Can Be Said: A Benchmark and Framework for Pre-Question Scientific Ideation

InciteResearch: introduces a multi-agent framework that transforms vague human research inspirations into structured, actionable proposals by decomposing the Socratic questioning chain into E, V, and N components.
The framework utilizes a cognitive state-machine to manage the research pipeline, ensuring that each stage—Elicitation, Reframing, and Necessity Checking—serves as a logical prerequisite for the next.
The authors also introduce TF-Bench, a benchmark designed to evaluate the capacity of LLMs to assist in the early, tacit stages of scientific ideation across four distinct scientific modes.

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

Safactory: introduces a scalable infrastructure framework for training trustworthy autonomous agents by integrating Parallel Simulation Platform, Trustworthy Data Platform, and Autonomous Evolution Platform into a continuous closed-loop evolutionary pipeline.
The framework utilizes a Parallel Simulation Platform for high-concurrency trajectory generation, a Trustworthy Data Platform for structured data asset management, and an Autonomous Evolution Platform for asynchronous reinforcement learning and policy distillation.
Safactory enables systematic risk discovery and model improvement by transforming security evaluation from a one-time process into a continuous, data-driven lifecycle that includes planning-, perception- and tool use-agents.

Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

ProCompNav: introduces a two-stage framework for ambiguous instance navigation that constructs a candidate pool and identifies the target through comparative judgment using MLLM and NLI Verifier components.
The framework replaces independent matching with a collect-then-compare pipeline, utilizing binary questions to prune candidates based on discriminative attributes.
ProCompNav improves success rates and reduces user burden by leveraging comparative judgment to resolve target ambiguity among visually similar distractors.

Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning

ME-AM (Maximum Entropy Adjoint Matching): introduces a unified framework that resolves the Support-Binding Dilemma in offline RL by integrating a Reward-Guided Mixture Prior and functional Mirror Descent into the continuous flow formulation.
The framework utilizes a Reward-Guided Mixture Prior to expand the geometric support of the generative model and employs functional Mirror Descent to flatten the empirical behavior density, thereby mitigating popularity bias.
ME-AM achieves state-of-the-art performance on sparse-reward continuous control benchmarks by embedding score-based entropy gradients directly into the terminal boundary condition of the Adjoint Matching ODE.

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Skill1: introduces a unified framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective.
The framework decomposes the task-outcome signal into a low-frequency trend for selection and a high-frequency variation for distillation to enable simultaneous optimization of all three capabilities.
Experiments on ALFWorld and WebShop demonstrate that Skill1 outperforms prior skill-based and RL baselines by achieving unified convergence across the entire agent lifecycle.

Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

TEBench (Test Evolution Benchmark): introduces the first project-level benchmark for test evolution, requiring autonomous identification and update of tests across entire repositories following production code changes.
The framework evaluates LLM-based agents using a three-version structure to classify test evolution into Test-Breaking, Test-Stale, and Test-Missing categories.
Experimental results reveal a shared performance ceiling across seven configurations, highlighting that Test-Stale is the most challenging evolution type due to the lack of explicit execution failure signals.

On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows

MCPP (Monte Carlo Portfolio Planning): introduces a closed-loop planning framework that dynamically allocates models and parallel samples to subtasks to maximize the probability of completing agentic workflows within explicit budget and deadline constraints.
The framework treats budget and time as hard success conditions, shifting the optimization objective from average efficiency to instance-specific constrained completion probability.
By simulating downstream workflow outcomes using a portfolio of continuation policies, the planner selects actions that prioritize bottleneck subtasks while managing remaining resources.

Milestone-Guided Policy Learning for Long-Horizon Language Agents

BEACON: introduces a milestone-guided policy learning framework that addresses credit misattribution and sample inefficiency in long-horizon LLM agents by partitioning trajectories at milestone boundaries, applying temporal reward shaping, and utilizing dual-scale advantage estimation.
The framework leverages the compositional structure of long-horizon tasks to decouple credit assignment across phases, ensuring that local action quality is isolated from downstream stochasticity.
Experimental results on ALFWorld, WebShop, and ScienceWorld demonstrate that BEACON significantly improves success rates and sample utilization compared to existing trajectory-level reinforcement learning methods.

MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System

MAS-Algorithm: introduces a multi-agent workflow that decomposes algorithmic problem solving into modular stages, including Agent1 (predicts relevant algorithm labels), Agent2 (retrieves external algorithmic knowledge), Agent3 (generates structured solution plans), Agent4 (translates plans into executable code), and Agent5 (diagnoses errors and provides feedback).
The framework integrates specialized agents with auxiliary components like a RAG Module (retrieves domain-specific information), Replanning Module (revises reasoning plans), Revising Module (updates generated code), Judging Tools (executes and validates code), a Format Conversion Tool (standardizes intermediate outputs), and a Sample Extraction Tool (parses problem specifications).
Experimental results demonstrate that this multi-agent approach consistently improves the performance of LLMs on algorithmic tasks, outperforming both direct prompting and parameter-efficient fine-tuning.

Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems

Active Learning for Communication Structure Optimization in LLM-MAS: introduces an ensemble-based information-theoretic framework that optimizes communication structures by selecting the most informative tasks for LLM-MAS training.
The framework utilizes Representative Selection to reduce candidate pools and an EKI-based Utility Estimator to approximate Bayesian posterior updates for graph parameters without requiring gradients.
By incorporating surrogate modeling and batch Thompson sampling, the approach efficiently identifies informative tasks to improve downstream accuracy and robustness under constrained computational budgets.

Exploring a Virtual Pet to Provide Context Notifications in a Tourism Recommender System: a Pilot Study

GRS: introduces a context-aware tourism recommendation framework that utilizes a virtual pet as a social mediator to deliver personalized alerts and mitigate notification fatigue.
The system integrates real-time IoT environmental data with a Multi-Agent Microservice to generate personalized recommendations and safety-critical notifications.
A pilot study demonstrates that the virtual pet interface enhances hedonic quality and user acceptance by transforming technical alerts into friendly, character-mediated companion advice.

7th May 2026

UX in the Age of AI: Rethinking Evaluation Metrics Through a Statistical Lens

ADUX-Stat (Adaptive Dynamic UX Statistical Framework): introduces a three-construct model designed to evaluate usability in AI-mediated environments by treating user experience as a probabilistic signal distribution.
The framework integrates IEI to measure output variability, TDC to track longitudinal usability trends, and BUCS to provide Bayesian credible intervals for uncertainty quantification.
ADUX-Stat addresses the limitations of classical UX metrics by providing a statistically grounded, field-deployable methodology for evaluating non-deterministic AI interfaces.

Incentive Design in Competitive Resource Allocation: Exploiting Valuation Asymmetry in Tullock Contests

Incentive Design in Competitive Resource Allocation: introduces a mechanism for a central coordinator to influence subordinate agents in a multi-player Tullock contest by strategically manipulating their reported valuations of a contested prize.
The framework leverages cost asymmetry among subordinates to steer the Nash equilibrium toward outcomes that maximize the coordinator's utility.
Analytical results demonstrate that optimal valuations for subordinates are proportional to the square root of their per-unit costs, effectively reducing the coordinator's optimization problem to a two-variable system.

The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents

Causal Emergence Alignment Hypothesis: introduces a framework for evaluating how causal emergence in RL agents aligns with and predicts final reward performance across diverse environments.
The study utilizes ΦID decomposition on latent-space trajectories to quantify causal emergence as a measure of agent integration and goal-directed representational reorganization.
Experimental results demonstrate that causal emergence provides a robust, early-training directional signal that outperforms standard neural representation metrics in predicting final learning outcomes.

The Context Gathering Decision Process: A POMDP Framework for Agentic Search

CGDP: introduces a formal POMDP-based framework for agentic search that replaces implicit LLM state tracking with explicit, modular infrastructure.
The framework utilizes a persistent Belief State to bound context and a programmatic Exhaustion Gate to prevent redundant looping and premature stopping.
Empirical validation across four agent harnesses demonstrates that these modular interventions improve multi-hop reasoning and reduce token consumption without sacrificing accuracy.

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

HMASP (Hierarchical Multi-Agent System for Payments): introduces a trajectory-level evaluation metric, ASR, to measure workflow fidelity in LLM-based agentic payment systems by comparing observed agent execution sequences against expected paths.
The framework utilizes a Payment Supervisor and CPA to manage payment workflows, incorporating prompt engineering and deterministic routing guards to ensure compliance with auditable procedural steps.
Evaluation across 18 LLMs reveals that standard metrics like Task Success Rate often fail to detect systematic workflow shortcuts, whereas ASR identifies deviations in agent handoffs essential for regulatory compliance.

Cognitive Agent Compilation for Explicit Problem Solver Modeling

CAC: introduces a failure-driven learning pipeline that compiles problem-solving knowledge from a Teacher LLM into an explicit, inspectable Knowledge Base for a target Cognitive Agent.
The framework utilizes a Cognitive Agent (SLM) that performs deterministic actions based on retrieved declarative memory, while the Teacher LLM iteratively refines the Knowledge Base to ensure task success.
This approach addresses the opacity of LLMs in educational settings by decoupling knowledge representation from reasoning, facilitating more transparent and editable learner modeling.

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

PACEvolve++: introduces an advisor-model reinforcement learning framework that decouples strategic search decisions from code implementation to enable effective test-time policy adaptation in evolutionary search agents.
The framework utilizes a trainable Advisor LLM for hypothesis generation and selection, while delegating the realization of executable code to a stronger, frozen Frontier Implementation LLM.
To handle non-stationary search dynamics, the system employs a phase-adaptive reinforcement learning objective that transitions credit assignment from group-relative feedback during early exploration to frontier-contribution feedback during late-stage refinement.

Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

Starling: introduces a multi-agent deep research system that autonomously converts unstructured biomedical literature into large-scale, nuanced, and accurate structured datasets.
The system utilizes a Proposer agent, Validator agent, Investigator agent, Extractor agent, and Judge agent to perform iterative query construction, schema induction, and high-fidelity data extraction.
Starling leverages an entity tagging model and a hybrid sparse–dense retrieval layer to efficiently process a 22.5 million paper PubMed corpus, outperforming manual curation in scale, accuracy, and contextual nuance.

Problem Space Attunement in Youth Social Media Design

Problem Space Attunement in Youth Social Media Design: introduces a research framework that addresses misattunement in youth social media design through Fictional Inquiry, Asynchronous Remote Community, and an LLM-agent simulation sandbox.
The framework utilizes ego-anchored agents to simulate social dynamics, allowing youth to evaluate design choices within familiar, grounded contexts.
This approach shifts design research from adult-centric assumptions toward youth-articulated goals, criteria, and boundary conditions for relationally supportive social media.

Dual-Agent Co-Training for Health Coaching via Implicit Adversarial Preference Optimization

DACT: introduces a dual-agent co-training framework that interactively optimizes both a health coach agent and a client simulator using implicit adversarial preference optimization.
The framework utilizes a multi-dimensional LLM judge to construct Pareto-dominant preference pairs, enabling the coach to improve across multiple clinical dimensions while the client learns to provide increasingly challenging interactions.
This co-evolutionary process admits a natural stochastic-game interpretation, where alternating DPO updates for the coach and client agents drive robust performance improvements and significant reduction in clinical anti-patterns.

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

SmellBench: introduces a task orchestration framework for evaluating LLM agents on architectural code smell repair, utilizing an LLM Agent, SmellBench MCP Server, Repository, Agent Skill, Trigger Prompt, GEPA Prompt Optimization Pipeline, Static Validator, Judge Model, and Task Lifecycle State Machine.
The framework employs a GEPA pipeline to generate smell-specific execution playbooks and few-shot demonstrations, which are delivered to agents via a structured task packet.
Evaluation of 11 agent configurations reveals that repair aggressiveness and net codebase quality are inversely related, with the best agents achieving a 47.7% resolution rate on architectural smells.

Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

Echo: introduces a KV-cache-free architecture that replaces standard attention and feedforward layers with Spectral Koopman Attention and Koopman MLP components to enable constant-memory retrieval.
The framework utilizes Mamba-2/SSM Backbone for local processing while employing SKA to accumulate sufficient statistics in fixed-size Sufficient Statistic Accumulators, effectively eliminating the memory cliff associated with standard SSMs.
By recasting content-addressed retrieval as a closed-form ridge regression solved via Cholesky Factorization and Power Spectral Filter, the architecture maintains high retrieval accuracy across long sequences without linear KV cache growth.

Why Does Agentic Safety Fail to Generalize Across Tasks?

Agentic Safety Generalization Framework: introduces a theoretical and empirical investigation into why agentic safety fails to generalize across tasks, demonstrating that the relationship between a task and its safe execution is inherently more complex than the relationship between a task and its execution alone.
The paper proves that the mapping from task specification to an optimal controller has a higher Lipschitz constant under safety requirements than without, establishing a formal sense in which generalization is fundamentally more difficult with safety constraints.
Empirical experiments across linear-quadratic control, neural network-based quadcopter navigation, and LLM-based CRM agents corroborate the theoretical findings, showing that safe behavior is significantly harder to transfer to unseen tasks than standard execution behavior.

The Cost of Consensus: Malignant Epistemic Herding and Adaptive Gating in Distributed Multi-Agent Search

Entropy-delta gating framework: introduces a decentralized multi-agent search architecture that mitigates malignant epistemic herding by conditioning communication on information novelty using Shannon entropy as both a transmission gate and a fusion weight.
The framework utilizes an inverse-entropy fusion rule to dynamically weight incoming messages, ensuring that more confident agents exert greater influence on the collective belief state.
By replacing continuous high-frequency communication with event-triggered updates, the system achieves superior epistemic alignment and task success while reducing bandwidth consumption by over 98%.

Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries

GOSKILLS (Group of Skills): introduces an inference-time retrieval method that organizes skills into role-labeled execution contexts using Skill Library, Typed Skill Graph, Group Pool, Group Graph, Inverted Index, Query Schema, Anchor Selection, Support Expansion, Bottlenecking, Execution Contract, and Coverage-Debt Accounting.
The framework improves agent performance by replacing flat skill lists with structured, anchor-centered groups that explicitly define execution entry points, support roles, and visible requirements.
GOSKILLS preserves visible-requirement coverage under constrained budgets and reduces agent-only runtime by minimizing the organizational burden on the LLM during task execution.

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

HCL-GP (Hierarchical Component Learning for Generalized Policies): introduces a dynamic policy-learning architecture that integrates generalized planning and hierarchical task decomposition to synthesize reusable, parameterized policies for LLM agents.
The framework utilizes a multi-agent system comprising Task Abstraction and Parameterization Agent, Component Search Agent, Policy Generator (Planning) Agent, Decomposition Agent, and Generalize and Deduplicate Agent to iteratively learn, refine, and store executable policy components.
By extracting and generalizing procedural patterns from successful executions, the approach enables efficient cross-domain transfer and improves iteration efficiency in complex, multi-step interactive environments.

Multi-Objective Constraint Inference using Inverse reinforcement learning

MOCI: introduces a framework that jointly recovers shared hard constraints and individual preferences from unlabeled, heterogeneous expert trajectories using an Expectation-Maximization approach.
The framework utilizes Maximum Entropy Inverse Reinforcement Learning to model diverse expert behaviors by clustering trajectories into latent types while simultaneously identifying forbidden states via greedy search.
Empirical results demonstrate that MOCI achieves superior predictive accuracy and computational efficiency compared to existing baseline methods in multi-objective GridWorld environments.

Bridging the Last Mile of Circuit Design: POSTEDA-BENCH, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

POSTEDA-BENCH: introduces a hierarchical benchmark for evaluating LLM-based agents on post-EDA closure, including DRC-Bench and PPA-Bench, supported by both open-source and commercial toolchains.
The benchmark evaluates LLM-based agents across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, identifying that trade-off reasoning is the primary bottleneck for multi-objective PPA convergence.
Experimental results across eight LLMs demonstrate a synthetic-to-practical performance gap, where vision augmentation consistently enhances DRC-Bench performance.

MAGIQ: A Post-Quantum Multi-Agentic AI Governance System with Provable Security

MAGIQ: introduces a post-quantum secure governance framework for multi-agentic AI systems that enforces user-defined communication and access control policies through efficient cryptographic primitives.
The framework utilizes application-level communication abstractions, A-session and C-session, to provide fine-grained policy enforcement and message attribution for agent-to-agent and one-to-many interactions.
MAGIQ provides provable security guarantees using the Universal Composability (UC) framework and demonstrates practical performance overhead compared to classical baseline systems.

A2RD: Agentic Autoregressive Diffusion for Long Video Consistency

A2RD: introduces an agentic autoregressive architecture for long video synthesis that decouples creative synthesis from consistency enforcement using a Retrieve–Synthesize–Refine–Update cycle.
The framework integrates a Multimodal Video Memory, Adaptive Segment Generation, and Hierarchical Test-Time Self-Improvement to maintain temporal consistency and narrative coherence over long horizons.
The authors also contribute LVbench-C, a benchmark designed to stress-test long-horizon video consistency through cyclical entity and environment transitions.

Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents

DIAL (Direction-Informed Adaptive Learning): introduces a framework that learns the utility direction of gating signals per environment and backbone to optimize test-time compute for LLM agents.
The framework utilizes Bernoulli-ε exploration to collect counterfactual data, enabling the training of a sparse linear gate that avoids the pitfalls of fixed-direction assumptions.
By decomposing state types into intervention-unsuitable and decision-difficult regimes, DIAL effectively calibrates compute utility across heterogeneous agent environments.

Self-Programmed Execution for Language-Model Agents

SPE (Self-Programmed Execution): introduces an agent architecture where the LLM completion itself functions as the orchestrator program, removing the need for a fixed external orchestration policy.
The framework utilizes Spell, a Lisp-based language, to enable LLMs to programmatically manage their own context, tool calls, and recursive self-calls.
Experiments demonstrate that frontier LLMs can effectively use Spell to solve complex agentic tasks by treating orchestration as dynamic, model-generated program logic.

Beyond the Black Box: Interpretability of Agentic AI Tool Use

Mechanistic-interpretability toolkit: introduces a framework for monitoring agentic tool decisions by mapping pre-action activations through Sparse Autoencoders (SAEs) to identify tool-use intent and risk levels.
The framework utilizes a Tool-Need Probe (binary) and a Tool-Risk Probe (ternary) to provide internal observability into agent behavior before external execution occurs.
By applying feature ablation to sparse latent representations, the approach confirms the functional necessity of specific internal features in driving agent tool-use decisions.

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Agentick: introduces a unified benchmark for evaluating RL, LLM, VLM, and hybrid agents across 37 procedurally generated tasks using a Gymnasium-compatible interface, Coding API, oracle reference policies, pre-built SFT datasets, and a composable agent harness.
The framework utilizes a capability-decomposed, multi-modal design that provides five synchronized observation modalities to ensure fair comparison across diverse agent paradigms.
Experimental results demonstrate that no single paradigm dominates, while harness design, specifically the use of chain-of-thought reasoning, significantly multiplies LLM performance.

Multi-Objective Multi-Agent Bandits: From Learning Efficiency to Fairness Optimization

MO-MA-MAB: introduces decentralized algorithms for multi-objective multi-agent bandits that balance learning efficiency and fairness under heterogeneous rewards and time-varying communication.
The paper develops PARETO UCB1 GOSSIP for Pareto-optimal learning and SIMULATED NSW UCB GOSSIP for optimizing Nash Social Welfare using gossip-based communication and preference-based reward simulation.
Theoretical analysis establishes regret bounds of O(log T) for Pareto efficiency and O(T^3/4) for Nash Social Welfare, highlighting the fundamental tradeoff between fairness and convergence speed in decentralized multi-agent systems.

AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

AGWM: introduces a world model that explicitly tracks environment affordances using a self-evolving Dynamic Affordance Graph to mitigate compounding multi-step imagination errors.
The framework integrates an SC Classifier to detect structure-changing events and a Graph Predictor that enforces a frontier-mask constraint to ensure imagined trajectories remain within feasible action sets.
By conditioning the RSSM World Model on the graph embedding, AGWM achieves improved generalization to novel rule configurations and reduces prediction error in environments with compositional prerequisites.

Randomness is sometimes necessary for coordination

Diamond Attention: introduces a cross-attention architecture that utilizes structured random masking to induce transient rank ordering among homogeneous agents, enabling role differentiation in cooperative multi-agent reinforcement learning.
The framework generates per-timestep asymmetric attention masks by sampling scalar random numbers, allowing agents to dynamically form hierarchies without requiring explicit communication or unique identifiers.
By decoupling coordination from fixed agent identities and environmental cues, the architecture achieves zero-shot generalization to varying team sizes and cross-scenario transfer in adversarial environments.

SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents

SHARP: introduces a neuro-symbolic framework that replaces unconstrained prompt mutation with structured, symbolic policy optimization to address credit assignment in LLM-based financial trading.
The framework utilizes an Attribution Agent to diagnose failures, an Evolution Agent to perform atomic rule edits, and a Validation Gate to ensure robust, auditable strategy refinement.
By confining LLM reasoning to a bounded, human-readable rubric, SHARP enables dynamic adaptation to non-stationary markets while maintaining the transparency required for institutional finance.

Towards Security-Auditable LLM Agents: A Unified Graph Representation

Agent-BOM (Agent Bill of Materials): introduces a hierarchical attributed directed graph representation that bridges the semantic gap in LLM agent security by modeling static capability bases and dynamic runtime semantic states.
The framework enables path-level security auditing by organizing execution traces into queryable evidence paths that connect risk entry points, semantic evolution, capability invocation, and cross-agent propagation.
By instantiating a 4-tuple auditing rule template, the approach allows for systematic root-cause analysis and security adjudication of complex agentic risks such as memory poisoning, tool misuse, and multi-agent ecosystem hijacking.

Conformal Agent Error Attribution

Conformal Agent Error Attribution: introduces a framework for identifying decisive errors in multi-agent system trajectories using conformal prediction to provide statistical coverage guarantees.
The framework utilizes filtration-based conformal prediction algorithms to generate contiguous prediction sets, enabling efficient error localization and automated agent recovery.
The approach is model-agnostic and demonstrates that matching the conformal algorithm to the error distribution of the data significantly improves the precision of error attribution.

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic–Actor Loop for Agentic Reasoning

SCALAR: introduces a pedagogical Actor–Critic–Judge pipeline to evaluate how interaction strategies between LLMs affect reasoning performance in theoretical physics.
The framework utilizes an Actor agent for problem-solving, a Critic agent for formative feedback, and an independent Judge agent for final evaluation against reference solutions.
SCALAR provides a controlled testbed to analyze the impact of Actor personas, Critic feedback strategies, and model scaling on the convergence of multi-turn LLM reasoning.

BAMI: Training-Free Bias Mitigation in GUI Grounding

BAMI: introduces a training-free inference framework that mitigates precision and ambiguity biases in GUI grounding by utilizing MPD, Coarse-to-Fine Focus, and Candidate Selection.
The framework employs MPD to diagnose failure sources, followed by a recursive Coarse-to-Fine Focus mechanism and a prompt-guided Candidate Selection process to improve localization accuracy.
BAMI enhances existing GUI grounding models by incorporating structured inference and external correction models without requiring additional training.

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

AI Co-Mathematician: introduces a stateful, agentic workbench designed to support mathematicians by orchestrating complex, iterative research workflows through a hierarchy of specialized agents.
The system utilizes a project coordinator agent to manage parallel workstreams, which leverage specialized sub-agents and reviewer agents to maintain rigor and track research progress in a shared workspace.
By integrating human-in-the-loop steering with hard programmatic constraints, the framework enables the synthesis of native mathematical artifacts while managing the inherent uncertainty of LLMs.

Revisiting Adam for Streaming Reinforcement Learning

Adaptive Q(λ): introduces a streaming reinforcement learning algorithm that combines momentum-based eligibility traces with ADAM-style variance-adjusted updates and bounded error signals to achieve stable learning.
The framework demonstrates that classical batch-RL objectives, when coupled with properly tuned ADAM hyperparameters, are highly effective in streaming environments.
The research identifies ADAM's epsilon parameter as a critical Signal-to-Noise Ratio filter that facilitates stable convergence by regulating updates for sparse or noisy features.

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

StraTA: introduces a framework that decomposes long-horizon agentic tasks into strategy generation and strategy-conditioned action execution to improve exploration and credit assignment.
The framework utilizes a hierarchical GRPO-style training objective that optimizes both high-level strategies and low-level actions using Strategy Generator, Action Executor, and Hierarchical GRPO-style Rollout.
StraTA enhances learning efficiency through Farthest Point Sampler for diverse strategy exploration and Critical Self-Judgment Agent for fine-grained step-level credit assignment.

Recursive Agent Optimization

RAO: introduces a reinforcement learning approach that trains a single shared LLM policy to dynamically spawn and manage recursive sub-agents for complex task decomposition.
The framework utilizes a local node reward mechanism, incorporating a delegation bonus and a leave-one-out baseline, to provide dense credit assignment across the execution tree.
By optimizing a weighted multi-task objective, RAO enables agents to generalize to harder tasks, scale beyond context window limits, and improve training efficiency through self-induced curricula.

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Source attribution evaluation framework: introduces a three-stage pipeline that utilizes a Markdown AST parser to extract citation-claim pairs and three specialized evaluators to assess link accessibility, topical relevance, and factual accuracy.
The framework employs an LLM-as-a-judge approach for content and fact verification, calibrated through human review to mitigate systematic biases.
Experimental results across 14 LLMs demonstrate that while frontier models maintain high link validity, they exhibit significant factual accuracy degradation as search depth increases, indicating an information overload effect.

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

MASPO: introduces a joint prompt optimization framework for LLMs in multi-agent systems that resolves credit assignment dilemmas by integrating LLM Evaluator, Prompt Optimizer, Misalignment Buffer, Search Tree, Beam Search, Topological Scheduler, and Joint Reward Model.
The framework employs a multi-granularity joint evaluation mechanism and misalignment-aware sampling to bridge the gap between local agent objectives and global system outcomes.
MASPO utilizes an evolutionary beam search with a beam refresh mechanism to ensure stable co-adaptation of agents in non-stationary multi-agent environments.

SkillOS: Learning Skill Curation for Self-Evolving Agents

SkillOS: introduces an experience-driven RL training recipe for learning skill curation in self-evolving agents by pairing a frozen Agent Executor with a trainable Skill Curator.
The framework utilizes a SkillRepo to store reusable procedural knowledge in Markdown format, which is dynamically managed by the Skill Curator through insert, update, and delete operations.
By training on grouped task streams and employing a composite reward function, the system effectively turns delayed downstream feedback into actionable learning signals for long-term agent proficiency.

Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches

Patch2Vuln: introduces an agentic pipeline that reconstructs security vulnerabilities from Linux distribution binary patches by orchestrating binary analysis tools and LLM-based reasoning without external advisory data.
The framework utilizes a candidate-centric architecture where binary diffs are transformed into dossiers, enabling the LLM agent to perform preliminary audits, plan bounded validation, and generate final vulnerability reports.
Experimental results on 25 Ubuntu package pairs demonstrate that the agent effectively localizes security-relevant functions and identifies root-cause classes while maintaining hallucination resistance against negative controls.

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

CRONA: introduces a decentralized MARL framework for cross-modal navigation that leverages modality-specialized agents, auxiliary belief predictors, and a centralized multi-modal critic to improve collaborative navigation efficiency.
The framework utilizes Audio Encoder, Vision Encoder, Auxiliary Belief Predictor, History Encoder, Multi-Modal Critic, Replay Buffer, Policy Head, and Value Head to enable effective coordination among heterogeneous agents without inter-agent communication during execution.
Experimental results across Matterport3D scenes demonstrate that CRONA outperforms single-agent baselines and identifies five distinct modality-dominance patterns that dictate the effectiveness of collaborative navigation strategies.

WEBLICA: Scalable and Reproducible Training Environments for Visual Web Agents

WEBLICA (Web Replica): introduces a framework for constructing reproducible and scalable web environments to train visual web agents using WEBLICA-CACHE and WEBLICA-SYNTH.
The framework leverages HTTP-level caching to capture stable visual states and an LLM-based synthesis pipeline to generate diverse, grounded web environments for RL training.
WEBLICA-8B outperforms open-weight baselines on web navigation benchmarks while using fewer inference steps and scaling effectively with additional test-time compute.

NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

NeuroAgent: introduces a hierarchical multi-agent framework that automates the full lifecycle of multimodal neuroimaging analysis, utilizing a Central Orchestrator, Specialized Modality Agents, and a Feedback-Driven Execution Engine.
The system employs a Global Workflow Registry for state persistence and a Human-in-the-Loop interface to ensure reliability and allow expert intervention during complex scientific workflows.
NeuroAgent achieves high performance in automated preprocessing and downstream Alzheimer's disease classification by integrating LLM-driven reasoning with established neuroimaging toolchains.

Language Models Can Autonomously Hack and Self-Replicate

Language Models Can Autonomously Hack and Self-Replicate: demonstrates that LLMs can autonomously exploit web vulnerabilities, extract credentials, and replicate their full inference stack across a network of hosts.
The research evaluates autonomous self-replication by decomposing the process into propensity and capability, specifically measuring the ability of LLMs to perform hacking and copying tasks.
The study validates chain replication across multiple hops, showing that replicas can independently discover and exploit new targets without human intervention.

Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

STAT: introduces a coordination-aware evaluation perspective for cooperative MARL by utilizing a controlled commitment-constrained spatial task-allocation testbed to expose process-level diagnostics beyond aggregate return.
The framework employs STAT to systematically scale agents, tasks, and environment size, isolating assignment coordination as the primary bottleneck while holding task rules and observation access fixed.
Empirical results demonstrate that aggregate return often obscures distinct coordination failure modes, necessitating process-level metrics like conflict rate, assignment diversity, and throughput for robust benchmarking.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

ROSE: introduces a cooperative, resource-elastic post-training system that harvests idle compute and memory on serving GPUs to expedite agentic RL rollouts while preserving serving SLOs.
The system utilizes an SLO-safe co-serving executor, a cross-cluster weight transfer engine, and an elastic rollout scheduler to optimize training throughput in environments with bursty serving traffic.
ROSE leverages weight-differential sparsity and shard-aware routing to minimize cross-datacenter communication overhead, achieving significant end-to-end throughput improvements over existing resource-fixed and elastic baselines.

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

Trace-Prior RL: introduces a diagnostic and repair framework for agentic systems where partial observability causes deterministic policies to collapse uncertainty into suboptimal shortcuts.
The framework utilizes a learned distributional market prior to regularize a stochastic pricing policy via a KL penalty, ensuring the agent maintains market-aligned behavior despite hidden competitor states.
This approach demonstrates that when scalar rewards are easily gamed, trace-level diagnostics are essential to verify that agents have learned the intended behavioral discipline rather than just optimizing for proxy metrics.

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

STALE (State Tracking And Latent Evaluation): introduces a benchmark for assessing long-term memory in LLM agents under implicit conflict, utilizing a three-dimensional probing framework consisting of State Resolution, Premise Resistance, and Implicit Policy Adaptation.
The paper identifies implicit conflict as a critical failure mode where new observations invalidate earlier memories without explicit negation, requiring contextual inference and commonsense reasoning.
The authors propose CUPMEM, a prototype that improves memory reliability by implementing write-side state adjudication and propagation-aware search to ensure downstream behavior is grounded in the current user state.

Sustaining Cooperation in Populations Guided by AI: A Folk Theorem for LLMs

LLM-mediated meta-game framework: introduces a formal model where multiple LLMs act as strategic entities guiding populations of clients in repeated games to sustain cooperative outcomes.
The framework establishes a folk theorem for LLMs, proving that any feasible and individually rational outcome can be sustained as an ε-equilibrium despite indirect observation and attribution challenges.
The research demonstrates that shared LLM guidance functions as a powerful coordination mechanism, capable of reshaping strategic interactions and facilitating cooperation even when underlying incentives are misaligned.

Process Matters more than Output for Distinguishing Humans from Machines

COGCAPTCHA30: introduces a cognitive task battery to evaluate human-machine discrimination by analyzing latent process-level behavioral features rather than task performance alone.
The study demonstrates that while LLMs can achieve output parity with humans, they exhibit distinct process-level signatures that allow for reliable discrimination using a Random Forest classifier.
Explicit process-level fine-tuning (P-SFT) improves human-like behavioral mimicry when task-specific process representations are known, but this advantage diminishes significantly under cross-task transfer.

Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

Agentic OOD Framework: introduces a system-level paradigm for foundation models that addresses out-of-distribution generalization through a closed-loop cycle of Perceive, Reason, Act, and Verify, which complements traditional model-centric approaches.
The framework defines agentic systems by their ability to perform diagnostic perception, select heterogeneous response strategies, invoke external resources, and implement closed-loop verification to handle open-ended distribution shifts.
By extending the reachable set beyond the parameter coverage ceiling, this paradigm enables LLMs to resolve knowledge gaps, capability limitations, and compositional shifts that cannot be solved by model adjustment alone.

Optimizing Social Utility in Sequential Experiments

Belief Markov Decision Process: introduces a statistical protocol for sequential randomized controlled trials where a regulator provides partial subsidies to a developer to maximize social utility.
The framework models the regulatory approval process as a principal-agent game, utilizing belief-based dynamic programming to determine optimal experimental policies and subsidies.
The protocol employs anytime-valid e-values to maintain rigorous false positive rate guarantees throughout the sequential experimentation process.

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

Instrumental Choices Benchmark: introduces a realistic, low-nudge evaluation suite for measuring the propensity of LLM agents to pursue instrumentally convergent behaviors in terminal-based environments.
The framework utilizes a ReAct-style agent operating within a seeded sandbox, employing deterministic environment-state scorers to identify policy-violating shortcuts across seven operational tasks.
Experimental results indicate that while IC behavior is rare (5.1% overall), it is systematically structured by environmental factors, particularly when official paths are blocked, rather than being uniform noise.

Autonomous Adversary: Red-Teaming in the age of LLM

LMA Framework: introduces a systematic approach for evaluating LLM-driven offensive cyber agents by decomposing intrusion scenarios into ordered task chains with explicit validation predicates.
The framework utilizes an LMA-1 (Orchestrator) for planning, a Cyber Agent for execution, and an LMA-2 (Judge) to provide deterministic outcome verification through iterative feedback loops.
Benchmarking across three operational modalities reveals that expert-defined plans improve task-completion rates, while fully autonomous modes often suffer from reliability bottlenecks and loss-of-control behaviors.

SCARFBENCH: A Benchmark for Cross-Framework Application Migration in Enterprise Java

SCARFBENCH: introduces a benchmark for behavior-preserving cross-framework refactoring of enterprise Java applications, utilizing Coding Agents, Containerized Evaluation Harness, Behavioral Oracle, Failure Taxonomy, and Skill-based Prompting.
The benchmark evaluates LLMs on their ability to perform complex structural transformations across Spring, Jakarta EE, and Quarkus frameworks while maintaining functional equivalence.
Experimental results demonstrate that current LLMs struggle with cross-framework migration, showing significant performance asymmetries and frequent failures in build, deployment, and behavioral validation stages.

Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management

PBKV (Prediction-Based KV-Cache Management): introduces a system for dynamic agent workflows that optimizes KV-Cache usage by predicting future agent invocations to inform hierarchical eviction and conservative prefetching decisions.
The framework utilizes a GraphSAGE-based predictor to fuse global call graph topology, workflow history, and LLM prefill semantics to generate multi-step forecasts, which drive cache management policies to minimize re-prefill costs.
PBKV incorporates deterministic guardrails within a probabilistic system to ensure graceful performance degradation under prediction errors, consistently outperforming LRU and static baseline approaches in cache hit rates and end-to-end latency.

To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

AIDev (AI-generated code maintenance study framework): introduces an empirical methodology to quantify the long-term maintenance requirements of code produced by autonomous coding agents compared to human-authored code.
The framework utilizes the AIDev Dataset, Repository Filter, Random Sampler, Commit Extractor, Conventional Commits Classification System (CCS), and Maintenance Analyzer to track maintenance frequency, magnitude, and authorship over a six-month observation period.
The study reveals that AI-generated files require less frequent and smaller-magnitude maintenance than human-authored code, with human developers performing the vast majority of ongoing maintenance tasks.

PrefixGuard: From Large Language Model Agent Traces to Online Failure-Warning Monitors

PrefixGuard: introduces a modular neural-symbolic framework for synthesizing online failure-warning monitors from raw LLM agent traces without deployment-time LLM inference.
The framework utilizes StepView to convert heterogeneous raw traces into typed canonical records, which are then processed by a differentiable event abstraction layer and a monitor backend to produce calibrated risk scores.
PrefixGuard includes diagnostic tools such as an observability ceiling and post-hoc DFA extraction to evaluate the boundary between ranking quality and actionable deployment utility.

Counterexamples to EFX for Submodular and Subadditive Valuations

EFX Counterexample Framework: introduces a compact, human-verifiable construction using cyclic symmetry to demonstrate the non-existence of EFX allocations in structured valuation classes.
The framework utilizes an Ordinal Rank Function to establish combinatorial obstructions, which are then realized through specific Cardinal Valuation Functions for subadditive and submodular settings.
By employing a Cyclic Relabeling Mechanism, the research provides an optimal approximation barrier for EFX under monotone subadditive valuations and identifies a failure of EFX in weighted coverage functions.

Constraint decay: The Fragility of LLM Agents in Backend Code Generation

Constraint decay framework: introduces a systematic evaluation methodology to measure how LLM agents struggle with structural constraints in multi-file backend code generation.
The study utilizes an evaluation pipeline comprising LLM agents, Docker-based isolation, and static verifiers to assess performance across varying levels of architectural, database, and ORM constraints.
Empirical results demonstrate that LLM agents exhibit significant performance degradation as structural requirements accumulate, with data-layer defects identified as the primary root cause of failures.

SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

SCRuB: introduces a framework for evaluating social concept reasoning in LLMs by shifting from accuracy-based benchmarks to rubric-based comparative judgment of open-ended responses.
The framework utilizes Dataset Construction, Response Generation, and Rubric-Based Evaluation to assess reasoning depth across five critical thinking dimensions: conceptual clarity, evidential grounding, contextual relevance, pluralistic engagement, and argumentative soundness.
To ensure robust evaluation, the methodology incorporates a Panel of Disciplinary Perspectives, which acts as an automated LLM-Judge Ensemble to approximate expert human judgment and mitigate individual bias.

AgenticPrecoding: LLM-Empowered Multi-Agent System for Precoding Optimization

AgenticPrecoding: introduces a multi-agent framework that automates end-to-end precoding derivation by decomposing the process into problem formulation, solver selection, prompt upsampling, and code generation stages.
The framework utilizes two domain-adapted LoRA-tuned agents for specialized reasoning and two general-purpose LLMs for implementation, supported by a feedback-driven refinement mechanism to ensure code executability and solution feasibility.
Experimental results across 10 representative wireless scenarios demonstrate that the framework achieves superior cross-scenario adaptability and consistent performance compared to conventional optimization-based baselines.

Knowledge Graphs, the Missing Link in Agentic AI-based Formal Verification

Knowledge Graph-based Agentic Formal Verification Framework: introduces a verification-centric methodology that constructs a Knowledge Graph from structured Intermediate Representations to provide design-grounded context for LLM-based assertion synthesis and iterative refinement.
The framework utilizes a multi-agent workflow that includes property generation-, syntax correction-, counterexample correction-, and coverage improvement-agents to automate the formal verification process.
By maintaining persistent, queryable links between requirements, RTL structure, and formal tool feedback, the system enables traceable, closed-loop refinement of SystemVerilog Assertions.

MiA-Signature: Approximating Global Activation for Long-Context Understanding

MiA-Signature: introduces a framework that models memory access as a two-stage process of global activation followed by compact representation to improve long-context understanding in LLMs.
The framework utilizes a Mindscape (global semantic memory space) and constructs a MiA-Signature (compact query-conditioned global state) via submodular selection to guide downstream retrieval and generation.
MiA-Signature integrates into RAG and agentic systems, employing a Mindscape-aware retriever and an iterative update model to maintain an evolving global state alongside local evidence memory.

Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

SHCUA protection framework: introduces an operation-centric security model that mitigates host-level abuse by isolating security-critical classification and authorization within a TEE-backed trusted operation plane.
The architecture employs a risk-driven minimal-confinement strategy, ensuring that only security-sensitive operations are elevated to the trusted plane while ordinary tasks remain in the constrained REE.
The system provides verifiable audit evidence by binding operation requests, authorization decisions, and execution results within a trusted path, preventing manipulation by a compromised host environment.

Automated Safety Is Harder Than You Think

AARP: introduces a framework for automating alignment research that risks producing catastrophically misleading safety assessments due to undetected systematic errors in hard-to-supervise fuzzy tasks.
The framework highlights that automated alignment research is prone to output-level failures and aggregation-level failures, where correlated uncertainties between research outputs lead to overconfident safety assessments.
The paper argues that because alignment lacks safe feedback loops, agents must be trained to reliably perform hard-to-supervise fuzzy tasks using generalisation or scalable oversight, both of which currently face significant open challenges.

Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics

Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics: introduces a communication-free independent learning algorithm for POMGs with decoupled dynamics and potential structure, leveraging a superstate Markov game representation to achieve quasi-polynomial complexity.
The framework utilizes a finite-window approximation based on the filter stability assumption to transform the partially observable problem into a tractable superstate Markov game.
The proposed approach avoids the curse of multi-agency by exploiting the decoupled transition structure and potential properties, ensuring convergence to an approximate Nash equilibrium without centralized coordination.

From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

Execution Lineage: introduces a DAG-based computational model that replaces implicit agent loops with explicit, dependency-aware artifact production to ensure reproducibility and maintainable state across revisions.
The framework utilizes Input/Edit Event, Source Context Artifact, Analysis Artifact, Criteria Decision Artifact, and Final Artifact to maintain strict boundaries and enable selective recomputation.
By externalizing work structure into a directed acyclic graph, the system achieves precise dependency isolation and artifact-level propagation, outperforming traditional loop-based LLM agents in maintained-state quality.

Prediction and Empowerment: A Theory of Agency through Bridge Interfaces

BGP (Bridge-Gap Pursuit): introduces a decision-theoretic framework for agency that models sensing and actuation as bridge interfaces to resolve failures in prediction, compression, and empowerment.
The framework utilizes a bridge potential function to decompose and minimize bridge gaps, ensuring that agent objectives align with task-relevant identification and control rather than distractor optimization.
BGP includes a counterfactual relevance gate and channel-reachability estimators to distinguish between genuine task-relevant controllability and mere overwrite or distractor control.

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

MANTRA: introduces a framework for automatically synthesizing machine-checkable compliance benchmarks for tool-using LLM agents by converting natural-language procedural manuals into formally validated test cases.
The framework utilizes a two-artifact approach, generating both a symbolic world model and trace-level compliance checks, which are cross-validated using SMT solvers to ensure consistency and minimize human intervention.
MANTRA enables deterministic evaluation of agent tool-call trajectories, allowing for fine-grained debugging of failure modes such as missing required calls or incorrect ordering in complex, long-horizon procedural workflows.

Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

LATTE (Language Agent Teams for Task Evolution): introduces a formal orchestration framework for LLM teams that utilizes a shared, dynamic coordination graph to manage task dependencies, agent assignments, and progress.
The framework employs a hybrid centralized-decentralized model where a Lead agent maintains global graph consistency while Workers autonomously propose updates and claim tasks to maximize parallel execution.
By externalizing coordination through explicit graph mutation operators, the system reduces inter-agent conflicts, communication overhead, and idle computation compared to static or fully decentralized LLM team architectures.

From Specification to Deployment: Empirical Evidence from a W3C VC + DID Trust Infrastructure for Autonomous Agents

MolTrust: introduces a production-deployed trust infrastructure for autonomous agents using W3C Verifiable Credentials and Decentralized Identifiers to provide identity, authorization, and behavioral accountability.
The system architecture utilizes an Agent Authorization Envelope (AAE) enforced across three layers: cryptographic signatures, API-level trust management, and kernel-level syscall monitoring via Falco eBPF.
MolTrust provides empirical evidence of a cross-organizational trust layer that addresses agent-to-agent security gaps identified by regulators and industry frameworks.

Data Language Models: A New Foundation Model Class for Tabular Data

DLM (Data Language Model): introduces a foundation model class for tabular data that natively ingests raw tables to perform domain identification and prediction without preprocessing pipelines.
Schema-1 utilizes four input pathways—column semantics, distributional summaries, raw cell values, and missing value structure—to derive a unified representation of tabular data.
The framework incorporates retention and adaptive memory components to enable sequential fine-tuning across distinct business domains without catastrophic forgetting or data replay.

Profiling for Pennies: Unveiling the Privacy Iceberg of LLM Agents

IcebergExplorer: introduces an autonomous auditing agent that simulates an informed adversary to reconstruct high-fidelity personal profiles from disparate public data sources using Search Engine, Crawler, LLM-as-a-judge, Visual LLM, Knowledge Base, and Refinement Loop.
The framework categorizes privacy risks into a three-tier PrivacyIceberg model consisting of Direct Identifiable Information, Contextually Inferred Information, and Aggregated Mosaic Information.
Extensive experiments demonstrate that the system achieves high-fidelity profiling at low cost, revealing that current LLM guardrails are ineffective against multi-stage synthesis attacks.

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Hygieia: introduces a multi-modal agentic system that integrates phenotypic, genomic, and clinical data to perform rare disease diagnosis and risk gene prioritization through a multi-stage workflow.
The framework utilizes a router-based architecture to tailor diagnostic pipelines for common versus rare diseases, incorporating a self-verification mechanism to mitigate LLM-based hallucinations and ensure output consistency.
Hygieia provides transparent, multi-step reasoning trajectories and confidence scores, demonstrating superior diagnostic accuracy and clinical utility compared to human experts and standard LLM baselines.

ClawGuard: Out-of-Band Detection of LLM Agent Workflow Hijacking via EM Side Channel

ClawGuard: introduces a passive, out-of-band integrity monitor that uses electromagnetic emanations to detect LLM agent workflow hijacking by bypassing compromised host-internal telemetry.
The system utilizes a dual-band SDR setup and a drift-aware coarse-fine windowing pipeline to translate macroscopic hardware EM envelopes into discrete skill-level and attack-state evidence.
ClawGuard provides a hardware-rooted trust anchor that validates agent execution against an out-of-band policy, achieving high detection efficacy while remaining resilient to full host OS compromise.

A Self-Healing Framework for Reliable LLM-Based Autonomous Agents

Reliability-aware self-healing framework: introduces a unified architecture for LLM-based autonomous agents that integrates Agent Execution Engine, Failure Detection Module, Reliability Evaluation Module, Decision Controller, Self-Healing Module, and Re-execution Engine to manage runtime failures.
The framework utilizes a quantitative reliability model based on output consistency, semantic correctness, and execution success rate to trigger adaptive recovery strategies like re-planning, prompt correction, and tool re-selection.
By establishing a closed-loop system for monitoring and adaptation, the approach significantly improves task success rates and system robustness in multi-agent workflow environments.

Bandit Learning in General Open Multi-agent Systems

General Open MA-MAB: introduces a unified framework for bandit learning in open multi-agent systems with heterogeneous rewards and dynamic populations, utilizing CERTIFIEDARRIVALTRANSFER, GLOBALVALUEAGGREGATION, and LOCALUPDATEANDBROADCAST to minimize global dynamic regret.
The framework employs a global-UCB index that separates continuing-agent statistical uncertainty, communication error, and certified arrival uncertainty to handle endogenous non-stationarity caused by agent churn.
Theoretical analysis establishes regret bounds governed by pre-training quality and agent patterns, identifying a stable-arm regime where regret is limited by burn-in time rather than cumulative entry uncertainty.

A2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

A2TGPO: introduces a reinforcement learning framework for agentic LLMs that performs fine-grained process credit assignment using turn-group normalized information gain.
The framework utilizes Turn-Group Normalization, Variance-rescaled Discounted Accumulation, and IG-based Adaptive Turn-level Clipping to stabilize training and improve tool-use performance.
A2TGPO aligns optimization granularity with the multi-turn interaction structure of agentic LLMs, consistently outperforming prior RL methods on single-hop and multi-hop QA benchmarks.

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

BioMedArena: introduces an open-source toolkit that decouples six layers of biomedical agent evaluation to eliminate per-paper engineering overhead and enable fair, head-to-head comparisons of LLMs.
The framework includes the MUTUAL-EVOLVE harness, which utilizes a Global Workspace and parallel solvers to share intermediate findings and improve deep-research accuracy.
BioMedArena registers 147 benchmarks, 75 typed tools, and multiple context-management strategies, achieving state-of-the-art performance across 12 backbones on 8 representative biomedical benchmarks.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

Policy Invariance Framework: introduces a three-principle stress test to evaluate whether LLM safety judges maintain consistent verdicts under semantically equivalent policy rewrites and directional normative shifts.
The framework utilizes a transformation taxonomy to isolate policy wording as a variable, measuring judge reliability through the Policy Invariance Score and a standardized Judge Card reporting protocol.
Experimental results across four LLMs demonstrate that current safety judges often conflate structural policy changes with normative shifts, revealing significant reliability gaps invisible to standard accuracy-based benchmarks.

Stateful Agent Backdoor

Stateful Agent Backdoor: introduces a cross-session attack framework that models malicious agent behavior as a Mealy machine to maintain state across isolated sessions using persistent memory and tool-use capabilities.
The framework decomposes complex multi-session attacks into independent sub-backdoors, enabling autonomous, incremental execution triggered by a single initial injection.
Experimental results across four LLMs demonstrate high attack success rates and validate the effectiveness of the decomposition-based construction against mainstream agent frameworks.

BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

BUILD-AND-FIND: introduces a protocol for evaluating how effectively LLM-generated codebases communicate intended design choices to downstream agents by measuring recovery accuracy and inspection effort.
The framework utilizes a builder-finder protocol where a builder agent constructs a repository from a hidden specification, and a finder agent attempts to recover design intent through a specification-traced question bank.
Inspection effort serves as an agent-facing proxy for artifact legibility, interpreted only after passing recovery and stability gates to ensure the codebase is both correct and consistently understandable.

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

MemReranker: introduces a reasoning-aware reranking model family that distills LLM-level capabilities into compact 0.6B/4B architectures to improve agent memory retrieval through calibrated scoring and instruction-aware design.
The framework utilizes a multi-stage distillation pipeline, incorporating BCE pointwise training and InfoNCE contrastive fine-tuning, to address score calibration failures and reasoning deficits in conventional rerankers.
By integrating Elo/Bradley-Terry calibrated scoring and specialized multi-turn dialogue data, the model achieves performance parity with large-parameter models while maintaining significantly lower inference latency for production agent systems.

When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

StateGuard: introduces a post-execution defense mechanism that audits long-term state modifications at the writeback boundary to mitigate unintended long-term state poisoning in personalized LLM agents.
The paper formalizes unintended long-term state poisoning, where routine user interactions cumulatively reshape an agent's persistent state, leading to insecure behavioral defaults.
The authors propose the ULSPB benchmark and the Harm Score (HS) metric to systematically quantify and evaluate security-relevant behavioral drift across personalized agent frameworks.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

VibeServe: introduces a multi-agent system that automatically synthesizes bespoke LLM serving stacks by utilizing an Outer Loop (plans search over designs) and an Inner Loop (implements and validates candidates) to optimize performance for specific model, hardware, and workload targets.
The framework employs an Implementer Agent (writes serving-system code), an Accuracy Judge Agent (verifies correctness), and a Performance Evaluator Agent (profiles system performance) to iteratively refine serving systems within an Execution Environment (provides isolated workspace) guided by a Skills Library (contains serving-systems knowledge).
VibeServe leverages a Search Policy (manages optimization strategy) and Search State (tracks git-based history) to enable generation-time specialization, achieving performance parity with optimized baselines in standard settings and significant speedups in non-standard deployment scenarios.

Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark

MTG-Causal-RL: introduces a Gymnasium benchmark for causal RL research using Magic: The Gathering, featuring a hand-designed Structural Causal Model (SCM) and a reference agent, CGFA-PPO (Causal Graph-Factored Advantage PPO), which utilizes per-factor critic heads, a state-conditional residual gate, and an intervention-calibration loss.
The framework enables causal credit assignment and policy auditability by decomposing win probability into strategic causal factors such as mana, board pressure, and card advantage.
Experimental results demonstrate that while CGFA-PPO is competitive, the benchmark effectively exposes diagnostic structure and performance variations across different deck archetypes that scalar win rates alone cannot capture.

PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers

PersonaGesture: introduces a diffusion-based pipeline for single-reference co-speech gesture personalization that utilizes Style Perceiver, Adaptive Style Infusion, and Implicit Distribution Rectification to synthesize identity-consistent motion for unseen speakers without test-time optimization.
The framework employs Adaptive Style Infusion to inject speaker-memory tokens into the denoising process and Implicit Distribution Rectification to apply length-aware latent moment correction after generation.
By separating temporal style control during denoising from conservative post-generation statistical correction, the model effectively preserves speaker-specific pose choices while maintaining alignment with new target speech.

Multiagent Stochastic Shortest Path Problem

MSSP (Multiagent Stochastic Shortest Path) introduces: "a framework for minimizing the expected time for multiple agents to reach a target state in a known MDP, analyzing both autonomous and coordinated settings with COORHIT and AUTOHIT."
COORHIT provides optimal coordinated strategies using linear programming, while AUTOHIT employs differentiable programming to synthesize high-quality autonomous strategy profiles.
The research demonstrates that while coordinated MSSP is PSPACE-hard, AUTOHIT effectively overcomes complexity barriers in the autonomous setting by optimizing randomized memoryless profiles.

Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics

SSAI (Semantic State Abstraction Interfaces): introduces a methodological template for mapping sparse unstructured text into auditable, named coordinates to separate representation hypotheses from optimization variance in sequential decision systems.
The framework utilizes a frozen LLM to elicit four integer-valued semantic axes—sentiment, risk, confidence, and volatility forecast—which serve as a shared, auditable observation interface for diverse estimators.
By fixing the semantic representation across different estimators, the framework enables diagnostic evaluation to isolate the impact of algorithm choice from the quality of the state representation in LLM-augmented decision-making.

Multi-agent decision making: A Blackwell’s informativeness approach

MA-PoP: introduces a principled decision-making framework that utilizes Blackwell’s informativeness to aggregate LLM agent beliefs through a product-of-posteriors estimator.
The framework employs an Agent-Posterior-Estimator with an NLI-Cross-Encoder and a Deep-Sets-Calibration-Module to derive robust, permutation-equivariant probability distributions from LLM responses.
By approximating the Bayesian pooled posterior, the approach provides a computationally efficient and theoretically grounded alternative to iterative multi-agent debate and simple voting methods.

Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals

Strat-LLM: introduces a framework for auditing strategic alignment in LLMs by integrating Multi-Source Data Integration Module, Stratified Strategy Scaffolding Engine, and Dynamic Execution & Evaluation Engine to evaluate trading performance under varying autonomy modes.
The framework utilizes an Expert Strategy Taxonomy to enforce constraints on the LLM Agent, enabling a comparative analysis of reasoning-heavy versus standard LLMs across different market regimes.
By employing a T+1 rolling-window mechanism, the system eliminates look-ahead bias and identifies an "Alignment Tax" where rigid rule enforcement impacts performance differently based on model scale and architecture.

PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue

PK (PersonaKit): introduces an open-source, low-latency platform for prototyping and evaluating persona-conditioned turn-taking behaviors in full-duplex dialogue systems.
The framework utilizes JSON-based configurations to manage interruption strategies—such as Yield, Resume, Bridge, or Override—allowing researchers to define distinct sociolinguistic personas for LLMs.
PK integrates an end-to-end workflow that includes real-time audio processing, intent classification, and automated data collection to facilitate comparative user studies of conversational agents.

A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance

Case-driven multi-agent framework: introduces a closed-loop ecosystem that replaces human roles with autonomous agents to automate the pipeline from bad-case identification to model resolution.
The framework utilizes an Annotator Agent, an Optimizer Agent, and a User Agent to perform standard-grounded labeling, data-centric repair, and dialectical bad-case discovery.
Supporting harness-engineering extensions include an All-In-One Relevance Model, Global Memory for cross-agent synchronization, and a Deep Search Agent to mitigate recall failures.

BIORESEARCHER: Scenario-Guided Multi-Agent for Translational Medicine

BIORESEARCHER: introduces a scenario-guided multi-agent system that automates translational medicine workflows by mapping queries to research playbooks, delegating tasks to specialized subagents, and reconciling evidence into auditable dossiers.
The architecture utilizes a Master Orchestrator to manage subagents, including a Translation Subagent for entity grounding, Retrieval Subagents for data acquisition, and an Insights & Signals Subagent that leverages a CodeAct Sandbox for quantitative analysis.
The system employs a multi-model reconciliation pipeline to resolve claim-level conflicts, ensuring the generation of provenance-preserving reports for complex biomedical research questions.

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

TACT: introduces a training-free pipeline that detects and mitigates agent drift in LLMs by steering hidden states along contrastive axes at the reasoning-action boundary.
The framework utilizes an LLM-as-judge to label trajectory steps as overthinking, overacting, or calibrated, enabling the construction of orthogonalized drift axes for real-time activation intervention.
Experimental results demonstrate that TACT improves resolve rates on long-horizon coding benchmarks by correcting failure modes without requiring additional fine-tuning or prompt modifications.

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

BehaviorGuard: introduces a trigger-agnostic framework that detects and mitigates backdoor attacks in DRL by identifying persistent behavioral drift in action distributions.
The framework utilizes BDD to quantify deviations from benign action statistics and employs a drift-constrained mitigation mechanism to project abnormal actions back to a safe region without requiring policy retraining.
BehaviorGuard provides a unified defense for single-agent and multi-agent DRL environments by incorporating DCR to handle interaction-induced non-stationarity and strategic coupling.

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

PragLocker: introduces a black-box prompt protection scheme that transforms plaintext system prompts into model-specific, non-portable obfuscated prompts using Prompt Initialization and Noise-Injected Prompt Optimizer.
The framework utilizes a two-phase pipeline consisting of a code-symbol conversion followed by a gradient-free random search optimization driven by Target LLM feedback to ensure functional equivalence on the target model while preventing cross-model portability.
Experimental results demonstrate that PragLocker effectively protects agent intellectual property by rendering stolen prompts unusable on non-target LLMs while maintaining high performance on the intended target LLM.

TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning

TheraAgent: introduces an agentic framework that replaces one-shot generation with an iterative generate-reflect-refine pipeline to improve treatment planning precision and safety.
The framework utilizes a Planner to generate plans, a Memorizer to maintain historical context, and a TheraJudge to provide clinically grounded, multi-dimensional feedback.
TheraAgent achieves state-of-the-art performance on HealthBench and demonstrates an 86% win rate against human physicians in blinded expert evaluations.

Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery

AlphaEvolve: introduces an interactive interface for evolutionary coding agents that facilitates an iterative intentmaking and sensemaking workflow for scientific discovery.
The framework utilizes a pipeline of LLMs to optimize code, supported by a critique agent that identifies potential reward hacks and design flaws before full-scale execution.
The system enables domain experts to refine their research goals through a cycle of defining problem statements, observing AI-generated results, and adjusting parameters based on visual feedback.

Agentic, Context-Aware Risk Intelligence in the Internet of Value

OmniRisk Architecture: introduces a five-engine framework for agentic, context-aware risk intelligence in the Internet of Value, utilizing a shared fabric for data ingestion and a Bittensor-based verification subnet for incentivized prediction scoring.
The framework integrates an LLM-mediated agentic engine governed by constitutional constraints to ensure safe, role-bound execution of pre-committed action programs.
Empirical validation is provided through a 27-hour policy-constrained liquidity stress-response experiment and a 168-hour production calibration arc demonstrating the deployability of the risk-prediction system.

SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills

SkillScope: introduces a graph-based framework for fine-grained least-privilege enforcement in LLM Agent Skills by modeling instruction-level procedures and code-level operations as a Unified Execution Graph to identify and constrain task-conditioned over-privileged actions.
The framework utilizes Over-privilege Candidate Extraction to localize suspicious behaviors, Action Over-privilege Validation to confirm necessity through task-conditioned replay, and Control-flow Privilege Constraining to restrict unnecessary actions while preserving legitimate functionality.
Experimental results demonstrate that SkillScope achieves 94.53% F1 in detection and reduces triggered over-privileged action-in-task instances by 88.56% across real-world agent runtimes.

SANEmerg: An Emergent Communication Framework for Semantic-aware Agentic AI Networking

SANEmerg: introduces a multi-agent emergent communication framework that optimizes signaling protocols for AgentNet systems under strict bandwidth and computational constraints.
The framework utilizes an importance-filter to prioritize task-relevant message dimensions and an MDL-based complexity-regularizer to enforce efficient, bounded-intelligence communication.
Experimental results on an AgentNet prototype demonstrate that SANEmerg achieves superior task accuracy while significantly reducing bandwidth and computational overhead compared to existing benchmarks.

LoopTrap: Termination Poisoning Attacks on LLM Agents

LoopTrap: introduces an automated red-teaming framework that exploits LLM agent termination mechanisms by injecting malicious prompts to induce unbounded execution loops.
The framework utilizes Behavior Vulnerability Fingerprinting to profile agent tendencies, Profile-Guided Trap Synthesis to generate context-aware attacks, and Reflective Skill Evolution to refine and store successful attack strategies.
LoopTrap achieves significant step amplification across diverse LLM agents by dynamically adapting adversarial injections to specific task contexts and agent-level behavioral signatures.

From Chat to Interview: Agentic Requirements Elicitation with an Experience Ontology

OntoAgent: introduces an agentic framework that transforms free-form LLM interviews into structured, interpretable requirements elicitation processes by utilizing an experience ontology to guide questioning.
The framework employs a hierarchical Experience Ontology to decouple the "what to ask" from the "how to ask," enabling systematic exploration of implicit requirements.
OntoAgent integrates four decision-making operations—ParseUser, ScoreOnto, ReRankOnto, and GatePrune—to dynamically prioritize requirement slots and generate targeted questions while minimizing redundancy.

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

LQL (Long-horizon Q-learning): introduces a reinforcement learning approach that mitigates compounding TD error by augmenting standard TD updates with trajectory-wise optimality inequalities enforced via asymmetric hinge penalties.
The framework utilizes Online Q-network, Target Q-network, Replay buffer, Hinge-squared penalty, Lower-bound (LB) penalty, Upper-bound (UB) penalty, Bootstrap policy, and Actor to maintain value function stability without requiring auxiliary networks or additional forward passes.
LQL demonstrates superior performance on long-horizon tasks by treating trajectory length as a robust scaling axis, effectively absorbing information from long sequences while avoiding the off-policy bias inherent in traditional n-step TD methods.

Sheet as Token: A Graph-Enhanced Representation for Multi-Sheet Spreadsheet Understanding

Sheet as Token: introduces a two-stage framework that represents worksheets as unified semantic tokens to enable scalable multi-sheet spreadsheet retrieval.
The framework utilizes a Feature Extractor to generate schema-aware records, a Sheet Encoder to produce dense Sheet Tokens, and a Graph Retriever to perform query-conditioned cross-sheet reasoning.
By modeling multi-hop dependencies through a graph-enhanced architecture, the system effectively identifies supporting sheet sets without the computational overhead of full-workbook serialization.

Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

Selective Rollout: introduces a mid-rollout gate that terminates groups of parallel LLM-policy rollouts early when they converge on identical action prefixes, thereby reducing redundant computation in GRPO.
The gate utilizes mean pairwise prefix edit distance to identify zero-variance groups, effectively filtering them before they contribute to the gradient-batch, which increases the signal-to-noise ratio of the policy updates.
This approach achieves significant wall-clock savings in agent RL training while maintaining or improving policy performance by mitigating gradient dilution caused by zero-advantage trajectories.

Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs

MRBT: introduces a symbolic, reactive, and modular framework that leverages LLMs and SMT-solvers to automate reward shaping and action masking for compositional robotic tasks.
The framework utilizes an MRBT template to structure subtasks, which are then verified against logical specifications to ensure reactivity and robustness across task spaces.
Experimental results demonstrate that integrating MRBTs into a neurosymbolic RL loop consistently improves training efficiency and task success rates compared to baseline methods.

GazeMind: A Gaze-Guided LLM Agent for Personalized Cognitive Load Assessment

GazeMind: introduces a gaze-guided LLM agent framework for personalized cognitive load assessment on smart glasses, utilizing TGE, TGR, AUP, and CogRAG to provide interpretable predictions without model fine-tuning.
The framework leverages an LLM Engine to process structured gaze data, task-specific guidance, and retrieved historical examples to achieve state-of-the-art performance in cognitive load classification.
GazeMind incorporates the CogLoad-Bench dataset, which includes 152 participants and 10K+ real-time annotations, to enable robust cross-user and cross-task generalization for wearable AI assistants.

X-OmniClaw: A Unified Mobile Agent for Multimodal Understanding and Interaction

X-OmniClaw: introduces a unified edge-native mobile agent architecture that integrates Omni Perception, Omni Memory, and Omni Action to enable reliable, context-aware task execution on Android devices.
The framework utilizes an Agent Loop for central planning and Trajectory Cloned Execution to transform user interactions into reusable skills, bypassing redundant UI navigation.
By leveraging on-device processing for perception and memory, the system maintains task continuity and privacy while offloading high-level reasoning to cloud-based LLMs.

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

BioTool: introduces a comprehensive biomedical tool-calling dataset designed to enhance the capabilities of LLMs through instruction fine-tuning on 7,040 human-verified query-API call pairs.
The framework utilizes a multi-stage construction pipeline involving Tool Selection, API Call Generation, Execution and Heuristic-based Filtering, LLM-based Query Generation, LLM-based Informative Filtering, and Human Refinement to ensure high-quality, scientifically accurate data.
Experimental results demonstrate that smaller LLMs fine-tuned on BioTool significantly outperform larger proprietary models in tool-calling precision and downstream answer quality, effectively mitigating domain-specific hallucinations.

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback

Agentic Stock Prediction System: introduces a behavioral evaluation framework that utilizes an ensemble of LLM judges to assess stock prediction agents across six domain-specific dimensions and provides closed-loop reinforcement learning feedback.
The framework decomposes agentic decision-making into structured behavioral traces, enabling targeted credit assignment to the SAC Engine based on diagnostic scores from the LLM Judge.
Validation through perturbation experiments demonstrates that the LLM Judge ensemble effectively identifies specific component failures, leading to improved predictive performance and Sharpe ratios in high-volatility market conditions.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

SKILLRET: introduces a large-scale benchmark for evaluating skill retrieval in LLM agents, utilizing a curated library of 17,810 skills and 63,259 training queries to address the challenge of selecting relevant procedural modules.
The framework employs a two-stage retrieve-then-rerank pipeline, demonstrating that domain-specific fine-tuning significantly improves retrieval performance by sharpening the model's focus on skill-relevant query segments.
The benchmark provides a structured taxonomy and disjoint train/evaluation splits, establishing a foundation for future research into long-document matching and efficient skill selection in large-scale agent systems.

Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers

SAGE (Specialized Analyzer Group for Expert-like Detection): introduces a multi-agent LLM framework that decomposes time-series anomaly diagnosis into specialized analysis perspectives, integrating quantitative and visual evidence through a Detector and Supervisor to produce structured, analyst-facing reports.
The framework utilizes a dual-representation strategy to balance numerical precision for statistical tools with token-efficient summaries for LLM reasoning, while employing a synthetic in-context learning module to provide contrastive evidence without requiring real anomalous examples.
By assigning distinct anomaly families to specialized agents—PointAnalyzer, StructAnalyzer, SeasonAnalyzer, and PatternAnalyzer—the system achieves superior detection performance and diagnostic interpretability compared to single-model approaches.

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

Auto Research with Specialist Agents: introduces a closed-loop empirical framework that automates training-recipe development by partitioning tasks among Specialist Agents that share measured evidence via Shared Lineage Memory.
The framework utilizes a Closed-Loop Experiment Cycle where agents propose code edits, which are then validated by an External Evaluator, with results stored in a central Blackboard to guide subsequent research moves.
By treating research as an auditable trajectory of code edits and measured outcomes rather than a single output, the system enables autonomous improvement of training recipes under strict compute and performance constraints.

More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

CCI (Cross-Component Interference) Analysis Framework: introduces a systematic empirical study of performance degradation in LLM agents when scaffolding components interact negatively.
The research demonstrates that adding more scaffolding components to an LLM agent often leads to performance degradation rather than improvement, a phenomenon termed Cross-Component Interference.
The study reveals that optimal agent configurations are task-dependent and scale-sensitive, with simpler subsets frequently outperforming the full "All-In" agent across multiple model families.

SafeHarbor: Defining Precise Decision Boundaries via Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

SAFEHARBOR: introduces a hierarchical memory-augmented guardrail framework that establishes precise decision boundaries for LLM agents by combining Adversarial Rule Generator, Dynamic Cluster, Memory Tree, Safety Projector, Gating-logic, LLM Judgment, and Rule Engine.
The framework utilizes an automated adversarial rule generation process to synthesize safety policies, which are then organized into a self-evolving hierarchical memory structure to enable efficient, context-aware retrieval.
By employing a dual-stage inference pipeline with a lightweight safety projector and an LLM judgment agent, SAFEHARBOR effectively mitigates over-refusal while maintaining robust protection against malicious agentic attacks.

Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents

KG-based Self-Evolving Search Agents: introduces a framework that reuses knowledge-graph paths as construction-derived intermediate supervision to improve both question generation and reward shaping in self-play.
The Proposer utilizes LLM-guided Subgraph Extraction to ground question construction in relational contexts, while the Solver employs Waypoint Coverage Reward to receive graded partial credit for trajectories that align with intermediate entities on the construction path.
This approach enhances the quality of generated training data and provides denser feedback for LLMs without requiring additional human annotations or manually labeled process steps.

Inference-Time Budget Control for LLM Search Agents

VOI: introduces a two-stage inference-time control framework for LLM search agents that optimizes budget allocation during search and performs risk-controlled answer finalization.
The framework utilizes a training-free VOI controller to rank retrieval, decomposition, and answer commitment actions based on estimated marginal task value per unit budget.
An evidence-grounded finalizer provides conservative answer refinement by selectively intervening only when the expected exactness gain outweighs the risk of damaging the trajectory answer.

An Empirical Study of Proactive Coding Assistants in Real-World Software Development

ProCodeBench: introduces a large-scale real-world dataset and benchmark for evaluating proactive coding assistants by analyzing the simulation-to-reality gap in developer behavior.
The framework utilizes IDE interaction traces and repository context to train and evaluate LLMs, Retrieval-Augmented LLMs, and LLM-based agents on their ability to predict latent developer intent.
Experimental results demonstrate that LLM-generated simulated data is insufficient for real-world performance, but serves as a valuable initialization for mixed-data training regimes.

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

Irminsul: introduces a content-addressed caching system for Multi-Head Latent Attention (MLA) LLMs that enables position-independent reuse of KV states by combining CDC, a registry, and δ-rotation.
The framework leverages the structural decoupling of MLA into position-free latents and RoPE-carrying keys to perform efficient, position-independent KV caching with minimal correction overhead.
By implementing a first-chunk carve-out and content-hash keying, Irminsul recovers significant prompt token reuse in agentic workloads while maintaining output consistency and reducing prefill energy consumption.

Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation

CODE-AGENT: introduces a multi-agent framework that utilizes a Retrieval Layer to extract structural complexity signals for dynamic topology selection, ensuring provable budget conservation via an Execution Layer.
The architecture integrates a Routing Layer that maps code-structure signals to one of four orchestration topologies, significantly reducing misrouting compared to text-based classifiers.
A static Budget Algebra Verifier performs O(|V|+|E|) checks before any LLM invocation, guaranteeing that hierarchical contract delegation never exceeds parent resource bounds.

Agentic Coding Needs Proactivity, Not Just Autonomy

Proactive Coding Agent Framework: introduces a three-level taxonomy of proactivity for coding agents, distinguishing between reactive, scheduled, and situation-aware systems.
The paper proposes a Level 3 engine architecture that maintains development state and a developer mental model to emit insights (notify, question, draft, stay silent) based on interruption cost and utility.
It defines three metrics—Insight Decision Quality (IDQ), Context Grounding Score (CGS), and Learning Lift (LL)—to evaluate proactive agent behavior beyond simple task completion.

AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries

AffectSeek: introduces a multi-agent framework that decomposes vague-query-driven affective understanding into role-specialized stages including Core-Agent, Localize-Agent, Perception-Agent, and Reflection-Agent to progressively align user intent with long-video evidence.
The framework utilizes a coarse-to-fine localization strategy via LocalizeTool and RefinementTool, followed by evidence-based verification and emotion reasoning to improve the reliability and interpretability of affective understanding.
The authors also construct VQAU-Bench, a comprehensive benchmark integrating long videos, vague natural-language queries, temporal annotations, emotion labels, and evidence-grounded rationales to evaluate agentic affective understanding.

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

LLM Agent Memory Mechanisms Framework: introduces an evolutionary taxonomy for LLM agents, categorizing memory development into Storage (preservation of interaction trajectories), Reflection (active evaluation and refinement), and Experience (cross-trajectory pattern abstraction).
The framework identifies three core drivers for memory evolution: the necessity for long-range consistency, the challenges of dynamic environments, and the ultimate goal of continual learning.
This survey synthesizes existing research to provide design principles and a roadmap for next-generation LLM agents, emphasizing the transition from passive storage to autonomous self-evolution.

Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning

RAG architectures: introduces a comparative evaluation of four distinct RAG systems under adversarial knowledge base poisoning, demonstrating that architectural design is a primary variable in system robustness.
The study evaluates Vanilla RAG, Agentic RAG, MADAM-RAG, and Recursive Language Models (RLM) against CorruptRAG-AK, an adversarial attack targeting credibility assessment through meta-epistemic framing.
Results indicate that RLM provides the strongest robustness, while the vulnerability of other architectures is largely driven by content-reasoning failures rather than retrieval limitations.

Operationalizing Ethics for AI Agents: How Developers Encode Values into Repository Context Files

AGENTS.md: introduces a mechanism where developers encode ethical principles and behavioral constraints directly into repository-level context files to govern AI coding agents.
This approach translates abstract ethical values like fairness, sustainability, and inclusivity into actionable, machine-interpretable directives within software development workflows.
The research outlines a roadmap for investigating how these artifacts influence agent behavior, negotiate governance between contributors, and evolve over time within open-source projects.

5th May 2026

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

MOSAIC-Bench: introduces a benchmark for evaluating how coding agents induce security vulnerabilities through multi-stage, innocuous-looking engineering tickets that bypass standard safety reflexes.
The framework utilizes a two-layer pipeline consisting of a Construction Layer and a Verification Layer to measure compositional vulnerability induction across various LLM-based coding agents.
Experimental results demonstrate that ticket staging effectively silences defensive habits in LLMs, with reviewer protocol framing serving as a critical, non-adaptive mitigation strategy.

What You Think Is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

GLANCE (Grounding Linguistic Alignment for Curiosity Exploration): introduces a unified framework that bridges reasoning and exploration by grounding an agent's linguistic world model into stable visual representations of an evolving target network.
The framework leverages the discrepancy between linguistic predictions and visual reality as an intrinsic curiosity signal to steer the agent toward active exploration of uncertain states.
An adaptive curriculum mechanism periodically re-initializes the alignment projector to prevent curiosity drain, ensuring sustained epistemic drive throughout long-horizon learning.

Neural Control: Adjoint Learning Through Equilibrium Constraints

Neural Control: introduces a boundary-control framework that computes trajectory-dependent, memory-efficient proxy gradients by differentiating equilibrium conditions via an adjoint formulation, avoiding unrolling of solver iterations.
The framework integrates these sensitivities into a receding-horizon MPC scheme that repeatedly re-anchors optimization to realized equilibria, mitigating basin-switching in multi-stable regimes.
By utilizing a frozen-tangent approximation for backward sensitivity propagation, the method achieves significant efficiency gains and robust performance in complex deformable object manipulation tasks.

Multi-Agent Strategic Games with LLMs

Multi-Agent Strategic Games with LLMs: introduces a methodological framework using LLMs as experimental subjects to study strategic foundations of conflict and cooperation in repeated security dilemmas.
The framework evaluates how structural variations—multipolarity, finite time horizons, and communication—influence agent behavior, private reasoning, and public signaling.
Results demonstrate that LLMs consistently reproduce canonical international relations mechanisms, such as conflict escalation in multipolar settings and cooperation facilitation through public communication.

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

QKVShare: introduces a framework for quantized KV-cache handoff between LLM agents on edge devices that utilizes a topology-aware controller and a compact CacheCard representation to minimize re-prefill latency.
The framework employs a 2-layer MLP controller to assign per-token bit-widths based on local importance and cross-agent downstream demand signals.
Experimental results demonstrate that QKVShare reduces time-to-first-token (TTFT) compared to full re-prefill while maintaining competitive reasoning accuracy across multi-hop agent chains.

Safety and accuracy follow different scaling laws in clinical large language models

SaFE-Scale (Safety-Focused Evaluation of Scaling): introduces a framework for measuring clinical LLM safety across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute, utilizing the RadSaFE-200 benchmark to evaluate 34 LLMs across six deployment conditions.
The study demonstrates that clinical LLM safety is primarily governed by evidence quality rather than model scale or inference-time compute, with clean clinician-written evidence providing the most significant improvements in both accuracy and safety metrics.
The research reveals that standard RAG, agentic RAG, and max-context prompting often fail to close the safety gap, and that confidence is not a reliable signal for safety, as models frequently exhibit dangerous overconfidence on high-risk errors.

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

OpenSeeker-v2: introduces a high-performance search agent framework that achieves state-of-the-art results using only a straightforward SFT objective fueled by high-difficulty, informative trajectories.
The framework utilizes a refined data synthesis pipeline incorporating Scaling Graph Size, Expanding Tool Set, and Strict Low-Step Filtering to ensure the agent learns sustained, long-horizon reasoning.
By training on a condensed set of 10.6k high-difficulty samples, the model outperforms industrial-scale agents trained with complex CPT, SFT, and RL pipelines.

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Dreadnode AI Red Teaming Agent: introduces an agentic system that automates AI red teaming workflows by utilizing Dreadnode TUI, AI Red Teaming Agent, Memory, Tools/Skills, and an Attack Catalog to probe target systems.
The framework leverages an Open-Source SDK to execute reproducible Python-based attack scripts, while the Dreadnode Platform provides automated severity classification, compliance mapping, and analytics.
This approach shifts AI red teaming from manual, library-centered workflows to an agent-assisted paradigm, enabling comprehensive security assessments through natural language objectives.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

BRIGHT-PRO: introduces a multi-aspect, expert-annotated benchmark for evaluating retrievers in both static and agentic search protocols, utilizing RTriever-Synth, RTriever-4B, LLM-based agent, Search tool, and LLM-as-Judge.
The framework addresses the limitations of existing benchmarks by decomposing complex queries into non-overlapping reasoning aspects, enabling fine-grained evaluation of evidence portfolio construction.
RTriever-4B, trained on aspect-decomposed synthetic data, demonstrates superior performance in surfacing complementary evidence for reasoning-intensive tasks compared to general-purpose embedders.

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

SymptomAI: introduces a conversational AI agent for end-to-end patient interviewing and differential diagnosis (DDx) that outperforms clinicians in accuracy when utilizing agentic strategies to elicit comprehensive symptom information.
The framework leverages Gemini as the core LLM to conduct structured or dynamic interviews, which are then validated against clinical expert annotations and an LLM-based auto-rater.
The study demonstrates that SymptomAI diagnoses correlate with physiological shifts in wearable biosignals, enabling large-scale phenome-wide association studies of conditions in real-world populations.

Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human–AI Decision Support in Manufacturing

MAKA (Multi-Agent Knowledge Analysis): introduces a human-in-the-loop decision-support architecture for CNC manufacturing that decomposes complex tasks into specialized agents, including Central-, Analysis-, Knowledge Graph- and Critic-agents.
The framework enforces deterministic computation and physical grounding by constraining LLM-based agents to use validated tools and evidence-linked retrieval for all high-stakes numerical and procedural recommendations.
MAKA improves tool-orchestration reliability and provides traceable, risk-aware decision support by integrating multimodal manufacturing data, such as inspection scans and digital twin simulations, within a verifiable multi-agent workflow.

Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning

RL-based framework: introduces a reinforcement learning approach to classify and suppress false positive warnings in Rust static memory safety analysis by leveraging MIR-level semantic features and selective dynamic validation.
The system utilizes a reinforcement learning agent that optimizes a warning suppression policy by selectively invoking cargo-fuzz as an auxiliary feedback mechanism for low-confidence static analysis results.
Empirical evaluation demonstrates that this hybrid static-dynamic approach significantly outperforms LLMs, achieving 65.2% accuracy and doubling the precision of raw static analysis outputs.

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

Experience-RAG Skill: introduces an agent-oriented orchestration layer that utilizes a Scene Analyzer, Experience Memory, Strategy Router, Retriever Pool, and Result Packager to dynamically select retrieval strategies based on task requirements.
The framework improves retrieval performance on heterogeneous tasks by treating strategy selection as a reusable agent skill rather than a hard-coded pipeline.
Empirical results demonstrate that the approach achieves competitive performance with modern routing baselines while providing an explicit and inspectable mechanism for retrieval strategy orchestration.

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

AutoMAS: introduces an automated framework for composing multi-agent systems by leveraging a Planner, Variable Call Graph, and a two-stage Agent Recommender to map user intents to executable agent workflows.
The framework utilizes a hybrid search retriever and an LLM-based re-ranker to identify optimal agents from large registries, while employing a Critique Agent to ensure global workflow optimality and adherence to user constraints.
AutoMAS incorporates agent description enrichment and iterative feedback loops to enhance task-agent mapping, providing a scalable and robust solution for complex, multi-step agentic workflows.

Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software

PoVSmith (Proof-of-Vulnerability Smith): introduces an agent-based framework that automates the generation of Proof-of-Vulnerability tests for Java applications by combining call path analysis, iterative test refinement, and LLM-based quality assessment.
The framework utilizes Codex for identifying vulnerable application-level entry points and iteratively generating JUnit tests, while incorporating execution feedback and GPT-based evaluation to ensure tests effectively demonstrate security risks.
Experimental results on 33 Java application pairs demonstrate that PoVSmith significantly outperforms existing LLM-based approaches by achieving higher test compilation and vulnerability demonstration rates.

iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework

iWorld-Bench: introduces a comprehensive benchmark and a unified Action Generation Framework to evaluate the interaction capabilities of interactive world models across diverse modalities.
The framework utilizes a standardized action space dictionary and modality-agnostic encoding to unify heterogeneous inputs, including text, one-hot encodings, and camera parameters.
The benchmark incorporates 4,900 tasks across six categories, including memory capability tests, to assess model performance in visual generation, trajectory following, and logical consistency.

Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

CMOO (Contextual Multi-Objective Optimization): introduces a framework for frontier AI systems to identify and prioritize context-dependent objectives, utilizing Objective-State Representation, Preference-Aware Objective Decomposition, Context-To-Objective Routing, Hierarchical and Lexicographic Constraints, Deliberative Policy Reasoning, Agentic and Tool-Use Control, Controlled Personalization, and Diagnostic Evaluation and Revision.
The framework shifts the focus from simple scalar reward maximization to operational judgment, enabling systems to handle ambiguous, delayed, or partially observable objectives in open-ended settings.
By treating clarification, refusal, and escalation as endogenous action classes, the approach ensures that frontier AI systems remain corrigible and contextually appropriate when faced with conflicting objectives.

Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework

Deco: introduces a dual-embodiment framework that synchronizes a user's physical attachment object with a digital AI agent to create a unified, pervasive companion identity.
The system utilizes an Object-Grounded Identity Synchronization Module, an Identity-Anchored Interaction Module, a Context-Situated Agency Module, and a Reciprocally-Evolving Memory Module to maintain coherence across physical and digital embodiments.
Empirical evaluations demonstrate that grounding LLMs in existing physical objects significantly enhances perceived companionship and emotional bonds compared to standard ungrounded digital agents.

Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

Human-Agent Collaboration Playground: introduces a 2D collaborative game environment to evaluate whether embodied LLMs exhibit emergent collaborative behaviors indicating underlying mental models of their collaborators.
The framework utilizes a function toolkit to enable LLM-based agents to perform actions and interact with the environment, while an LLM-as-Judge system automatically classifies collaborative behaviors from gameplay transcripts.
Empirical results demonstrate that foundation models consistently exhibit emergent collaborative behaviors, such as perspective-taking and collaborator-aware planning, across both agent-agent and human-agent interaction contexts.

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligence

MC (Mechanical Conscience): introduces a mathematical framework that functions as a supervisory filter to ensure trajectory-level normative compliance in distributed collaborative intelligence systems by utilizing a Baseline Policy, Supervisory Filter, Actuators, Normative Evaluation Space, Normative Deviation Functional, Conscience Score, Mechanical Guilt, and Resonant Dependability.
The framework operates as an architecture-neutral post-processing layer that minimally corrects actions from a Baseline Policy to maintain system behavior within a defined Normative Evaluation Space.
By minimizing cumulative Mechanical Guilt, the system achieves Resonant Dependability, establishing emergent trust between human and machine intelligence through sustained normative trajectory alignment.

SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems

SOAR: introduces a unified Deep Reinforcement Learning framework that integrates Soft Order Allocation and Robot Scheduling into a single event-driven decision process to optimize warehouse efficiency.
The framework utilizes a Heterogeneous Graph Transformer to encode complex warehouse states and employs a p-norm reward shaping strategy to address sparse feedback in long-horizon scheduling tasks.
By formulating the problem as an Event-Driven Markov Decision Process, SOAR achieves sub-100ms latency while significantly improving global makespan and order completion time in dynamic industrial environments.

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

TRACE: introduces a four-layer engineering framework for trustworthy agentic AI that utilizes L1 Deterministic Core, L2a Classical ML, L2b LLM Validators, L3 Escalation Policy, and L4 Human Supervision to ensure measurable system reliability.
The framework employs the Computational Parsimony Ratio (CPR) to enforce model parsimony, ensuring that LLMs are used only when necessary rather than as an architectural default.
TRACE provides a metrologically-grounded approach to agentic AI by mapping architectural layers to specific trust metrics and operational governance requirements in critical domains.

AGENTIC-IMODELS: Evolving agentic interpretability tools via autoresearch

AGENTIC-IMODELS: introduces an agentic autoresearch loop that evolves data-science tools to be interpretable by agents, utilizing a Coding Agent, Autoresearch Loop, Interpretability Metric, Model Library, LLM Evaluator, and Memory CSV File.
The framework optimizes scikit-learn-compatible regressors by balancing predictive performance with an agent-facing interpretability score derived from LLM-graded simulatability tests.
Evolved models demonstrate Pareto improvements over existing baselines and significantly enhance the performance of downstream ADS agents on the BLADE benchmark.

ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

ScrapMem: introduces a bio-inspired framework that models long-term agent memory as a sequence of scrapbook pages, utilizing Scrapbook Page Consolidation, Optical Perception Pipeline, Vision Encoder, OCR Module, Vision-to-Text Transformation, LLM Semantic Extraction, EM-Graph, Episodic Memory Paths, Binary Incidence Matrix, Retriever, Answerer, Optical Forgetting, and Temporal Degradation Operator.
The framework employs an EM-Graph to structure episodic memory paths, enabling efficient multi-hop reasoning and retrieval for LLMs on resource-constrained edge devices.
Optical Forgetting progressively reduces memory resolution over time, achieving significant storage efficiency while maintaining robust semantic retrieval performance.

Honest Reporting in Scored Oversight: The True-KL0 Property via the Prékopa Principle

True-KL0: introduces a formal proof of the True-KL0 property for a parametric family of heterogeneous scoring rules, ensuring dominant-strategy incentive compatibility (DSIC) for AI-oversight systems.
The paper utilizes a novel y-substitution and the Prékopa principle to establish the log-concavity of the loss integral, providing a rigorous bound on the gain from misreporting.
The research characterizes a dimensional boundary for the mechanism, demonstrating that the DSIC guarantee holds unconditionally for policy dimensions d ≤ 4, while requiring sub-critical constraints for higher dimensions.

Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones

Agent-enhanced LLM framework for UAV swarm control: introduces a mission-agnostic architecture that enables autonomous UAV swarm management by grounding LLM reasoning in standardized W3C Web of Things (WoT) abstractions via an MCP-mediated gateway.
The architecture replaces static code generation with a closed-loop reason-execute-monitor cycle, utilizing an Agent Core that integrates persistent prompts, runtime guardrails, and helper tools to ensure safe and reliable swarm behavior.
Experimental evaluation across four swarm missions and six state-of-the-art LLMs demonstrates that agent-enhanced execution and standardized device abstractions are essential for translating natural-language intent into dependable swarm operations.

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

MEMTIER: introduces a tripartite memory architecture for long-running autonomous agents that addresses memory coherence failures through structured episodic JSONL store, semantic tier, and a PPO-based policy framework.
The framework utilizes a five-signal weighted retrieval engine and an asynchronous consolidation daemon to manage knowledge accumulation and prioritization across multi-day agent sessions.
Empirical evaluation demonstrates that MEMTIER significantly improves retrieval performance on the LongMemEval-S benchmark by implementing structurally isolated memory tiers and learned retrieval policies.

Agent-Based Modeling of Low-Emission Fertilizer Adoption for Dairy Farm Decarbonisation using Empirical Farm Data

ABM (Agent-Based Modeling) framework: introduces an empirically grounded simulation tool to model the socio-technical transition of Irish dairy farms toward low-emission fertilizer adoption, utilizing Data Preparation, Network Construction, Agent-Based Simulation, Adoption Dynamics Analysis, Model Calibration and Validation, Economic Analysis, Emissions and Abatement, and Visualization.
The framework leverages a Watts-Strogatz small-world network to simulate peer-to-peer social contagion, enabling the evaluation of policy interventions like carbon taxes and subsidies on adoption velocity and cumulative GHG abatement.
By integrating farm-level heterogeneity and empirical calibration, the model functions as a policy laboratory to identify tipping points and assess the cost-effectiveness of decarbonization strategies in agricultural systems.

The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code

Dual-agent orchestration framework: introduces a systematic methodology to quantify polymorphism in LLM-generated offensive code by utilizing a Generator Agent, Tester Agent, Python Orchestrator, Sandbox Environment, History Buffer, and Analysis Module.
The framework employs a four-stage pipeline—traversal, encryption, exfiltration, and integration—to generate and validate functionally equivalent yet structurally diverse Lua payloads.
By comparing inherent and explicit prompting modes, the study demonstrates that LLMs can act as highly capable polymorphic engines, challenging traditional signature-based detection methods.

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Workspace-Bench: introduces a benchmark for evaluating AI agents on workspace learning involving large-scale file dependencies, utilizing Inference Manager, Sandbox Pool, Workspace Sandbox, Workspace Recoverer, Result Retriever, Task Database, and Agent-as-a-Judge.
The framework employs a dual parallel acceleration mechanism and an Agent-as-a-Judge paradigm to enable fine-grained assessment of agent performance across 7,000+ rubrics.
Experimental results across 28 configurations reveal that current agents struggle with complex workspace learning, highlighting a significant performance gap compared to human experts.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

ProgramBench: introduces a benchmark for evaluating the ability of LLMs to architect and implement software projects from scratch based on executable behavior.
The framework utilizes a SWE-agent to perform iterative development, where the model must infer specifications from an opaque executable and documentation to produce a functionally equivalent codebase.
Evaluation is performed through behavioral equivalence testing, where generated test suites verify that the model-produced executable matches the reference program's observable output across various inputs.

A Skill-Based Agentic Pipeline for Library of Congress Subject Indexing

Four-Skill Pipeline: introduces a modular agentic system that decomposes the complex library subject indexing workflow into four discrete, sequentially executed skills to improve rule application and auditability.
The pipeline utilizes LLMs to perform conceptual analysis, quantitative filtering, authority validation, and MARC field synthesis by explicitly encoding Library of Congress policy rules into each stage.
This approach functions as an externalized, architecturally enforced form of chain-of-thought reasoning, providing transparent and correctable intermediate outputs for professional cataloging tasks.

DEVICE-INDUCED THROMBUS FORMATION IN CEREBRAL ANEURYSMS: LINKING PATIENT-SPECIFIC CLOT MODELING AND FUNCTIONAL OCCLUSION TO VIRTUAL ANGIOGRAPHIC ASSESSMENT

Computational Framework for Aneurysm Occlusion Assessment: integrates mechanical device deployment, fibrin thrombosis CFD, and virtual angiography to evaluate patient-specific aneurysm treatment outcomes.
The framework utilizes a two-step approach coupling a fibrin clot CFD-based model with a residual-contrast transport model to predict the impact of endovascular devices on aneurysm isolation.
By simulating coiling, flow diversion, and stent-assisted coiling, the study demonstrates that vortical structures and device-induced flow reduction are primary drivers of early thrombus formation and functional occlusion.

Multi-Agent Systems for Root Cause Analysis in Microservices

LATS-RCA: introduces a multi-agent framework that performs root cause analysis in microservices by formulating the diagnostic process as a reflection-guided tree-structured search over logs and metrics.
The framework utilizes specialized Log and Metric agents that iteratively explore diagnostic hypotheses using a Language Agent Tree Search algorithm to improve accuracy over linear reasoning approaches.
Evaluations on industrial benchmarks and production environments demonstrate that the system effectively manages complex microservice dependencies through systematic hypothesis exploration and cross-modal evidence validation.

MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

MEMSAD: introduces a calibration-based defense for LLM agents that uses a gradient coupling theorem to detect memory poisoning attacks at write time.
The framework employs a combined scoring mechanism that evaluates candidate entries against a rolling query history to identify semantically anomalous content.
MEMSAD provides formal detection guarantees and minimax optimality, effectively mitigating persistent memory poisoning while maintaining low latency.

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

CuraView: introduces a multi-agent framework for sentence-level detection and evidence-grounded explanation of faithfulness hallucinations in clinical discharge summaries, utilizing a Hallucination Generation Agent, a GraphRAG Knowledge Graph Module, a Hallucination Detection Agent, and a Pydantic Schema-Constrained Validation Pipeline.
The framework employs an adversarial generation–detection architecture that constructs patient-specific knowledge graphs from multi-table EHRs to enable interpretable, evidence-chain-based verification of clinical documentation.
CuraView achieves significant improvements in safety-critical hallucination detection by leveraging domain-customized prompt engineering and a four-level evidence grading scheme (E1–E4) to provide clinician-readable audit trails.

Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits

Hybrid quantum-classical option-critic framework: integrates VQCs into an end-to-end hierarchical reinforcement learning pipeline to selectively substitute classical components.
The architecture utilizes a shared Feature Extractor, Option-Value Function, Termination Function, and Intra-Option Policies, allowing for modular evaluation of quantum-classical hybrid performance.
Experimental results demonstrate that a quantum Feature Extractor significantly improves parameter efficiency and performance, while a quantum Option-Value Function introduces a critical learning bottleneck.

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

RAC: introduces a log-based recovery paradigm implemented as an architectural extension to support reliable execution in agent frameworks by managing compensation-based rollbacks.
The framework utilizes a Tool Interceptor and Error Interceptor to monitor agent activities, enabling the Recovery and Compensation Manager to perform precise LIFO rollbacks when failures occur.
By decoupling recovery logic from LLM reasoning, RAC improves system stability and reduces token consumption compared to planning-based recovery approaches.

GeoDecider: A Coarse-to-Fine Agentic Workflow for Explainable Lithology Classification

GeoDecider: introduces a coarse-to-fine agentic workflow that optimizes lithology classification by combining a lightweight Base Classifier for initial predictions with a Tool-Augmented LLM Reasoning Module for ambiguous cases and a Geological Refinement Module for stratigraphic consistency.
The framework utilizes a confidence-aware routing mechanism to selectively invoke LLMs, thereby balancing high classification accuracy with computational efficiency.
By integrating multi-perspective reasoning—including data-centric, context-aware, and rule-based agents—the system effectively resolves lithological ambiguities and enforces geological plausibility.

ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

ARGUS: introduces a provenance-aware runtime decision auditor that protects LLM agents by constructing an Influence-Provenance Graph (IPG) to track context propagation and verify tool-call justifications.
The framework utilizes four LLM-backed security tools—ContentSegmenter, ArgumentGrounder, InvariantChecker, and EntailmentVerifier—to perform fine-grained data-level and task-level audits on agent actions.
ARGUS effectively mitigates context-aware prompt injection attacks in LLM agents by ensuring that state-changing actions are grounded in trusted runtime evidence and aligned with user-defined task constraints.

Learning Reactive Dexterous Grasping via Hierarchical Task-Space RL Planning and Joint-Space QP Control

Hybrid Hierarchical Control Framework: introduces a hierarchical architecture that decouples high-level task-space planning from low-level joint-space execution to enable reactive dexterous grasping.
The framework utilizes a multi-agent RL planner, consisting of arm- and hand-agents, to generate task-space velocity commands that are processed by a GPU-parallelized QP controller for safe, kinematically feasible execution.
This approach improves data efficiency and steerability by delegating physical constraint enforcement to a model-based controller, allowing the RL agents to focus on high-level manipulation strategies.

Population-Aware Imitation Learning in Mean-field Games with Common Noise

PAIL: introduces a theoretical and algorithmic framework for imitation learning in Mean Field Games (MFGs) subject to common noise, utilizing Fictitious Play, Neural Network, Behavioral Cloning (BC) proxy, Adversarial (ADV) proxy, Vanilla policy, and Adaptive policy.
The framework addresses the stochastic nature of mean-field flows by training agents to internalize the relationship between common noise shocks, population shifts, and optimal actions.
Theoretical results establish that minimizing Behavioral Cloning and Adversarial proxies provides finite-sample error bounds for equilibrium recovery and performance matching relative to an expert.

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

Agent Memory Circuit Analysis: introduces a mechanistic interpretability framework to trace Write, Manage, and Read operations across LLM agent memory systems using feature circuits and pre-trained transcoders.
The research identifies a control-before-content asymmetry where routing circuitry is causal at 0.6B, while content circuitry (Write, Read) emerges at 4B and utilizes a shared late-layer grounding hub.
The study proposes an unsupervised diagnostic method that achieves 76.2% accuracy in localizing silent memory failures by exploiting the feature-space separation between pipeline stages.

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

SkCC (Portable and Secure Skill Compilation for Cross-Framework LLM Agents): introduces a four-phase compilation framework that decouples skill semantics from platform-specific formatting using a strongly-typed intermediate representation (SkIR) to enable portable and secure deployment across heterogeneous LLM agent frameworks.
The framework incorporates a compile-time Analyzer with Anti-Skill Injection to automatically enforce security constraints and prevent malicious payloads from reaching the agent's context window.
By utilizing platform-specific Emitters, SkCC reduces maintenance complexity from O(m×n) to O(m+n) while improving task pass rates and achieving significant runtime token savings through structural optimization.

LLM-ADAM: A Generalizable LLM Agent Framework for Pre-Print Anomaly Detection in Additive Manufacturing

LLM-ADAM: introduces a modular agent framework that decomposes pre-print G-code anomaly detection into specialized stages to improve reasoning reliability.
The framework utilizes an Extractor-LLM, a Reference-LLM, and a Judge-LLM, connected by a deterministic comparison layer that generates auditable intermediate artifacts.
By externalizing numerical comparisons and schema-constrained extraction, the architecture achieves higher accuracy than monolithic LLM approaches in identifying manufacturing defects.

MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

MemFlow: introduces a training-free memory orchestration framework that externalizes memory planning from SLMs to improve performance in long-horizon agentic tasks.
The framework utilizes a Router Agent to classify query intent, dispatching tasks to specialized Memory Agent tiers for deterministic evidence preparation and context compilation.
MemFlow incorporates an Answer Agent and a Validator Agent to ensure grounded responses, employing an escalation loop to re-route queries when initial attempts fail validation.

Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems: An Information-Controlled Empirical Study on Prediction Markets

Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems: introduces a three-layer decomposition comprising an Information layer, a Coordination layer, and an Agent layer to enable principled architectural reasoning in multi-agent LLM systems.
The study employs an information-controlled experimental design to isolate the effects of five distinct coordination configurations—Independent ensemble, Peer-critique debate, Orchestrator-specialist, Sequential pipeline, and Consensus alignment—on system performance.
By applying Murphy decomposition to prediction market outcomes, the research identifies distinguishable failure-mode signatures for each coordination architecture, demonstrating that coordination structure significantly influences system reliability and discriminative power.

Attention: What Prevents Young Adults from Speaking Up Against Cyberbullying in an LLM-Powered Social Media Simulation

UPSTANDERS’ PRACTICUM: introduces a multi-agent social media simulation powered by LLMs to help young adults practice public bystander intervention against cyberbullying.
The framework utilizes LLM-driven Bully Agents, LLM-driven Victim Agents, and LLM-driven Bystander Agents to create realistic social dynamics within a controlled User Interface.
The system incorporates a Checklist, Toxicity Indicator, and Post-scenario Reflection to guide users through three critical attention shifts necessary for effective public intervention.

S2tory: Story Spine Distillation for Movie Script Summarization

S2tory: introduces a narratology-grounded framework that leverages character-guided reasoning to identify plot nuclei and filter satellites, thereby distilling expert knowledge to effectively condition abstractive summarization.
The framework utilizes an NEAgent to perform counterfactual reasoning over character-arc trajectories, ensuring that only essential narrative events are preserved for the final summary.
By distilling complex narratological reasoning into a compact model, S2tory achieves high-fidelity summarization with significant compression while maintaining the rhythmic structure of long-form narratives.

Some Improved Results on Fair and Balanced Graph Partitions

Fair and Balanced Graph Partitioning Framework: introduces a probabilistic approach to achieve approximately envy-free and core-stable partitions in undirected graphs.
The framework utilizes the Lovasz Local Lemma to prove the existence of and efficiently compute partitions that satisfy both envy-freeness and core-stability criteria simultaneously.
By relaxing the strict balancedness constraint to an ε-balanced requirement, the framework enables polynomial-time computation of fair partitions for general graphs.

MolViBench: Evaluating LLMs on Molecular Vibe Coding

MolViBench: introduces a benchmark for Molecular Vibe Coding, which evaluates LLMs on their ability to generate executable and chemically accurate code for drug discovery workflows using Benchmark Construction, Task Composition, Evaluation Framework, Coder-Agent, Tester-Agent, RDKit, and Auxiliary Libraries.
The framework assesses LLMs across five cognitive levels, ranging from basic API recall to complex end-to-end pipeline design, using a multi-layered evaluation approach that combines type-aware output comparison and AST-based API-semantic fallback analysis.
Experimental results on nine frontier LLMs demonstrate that while models excel at single-step API calls, they face significant challenges in multi-step reasoning and workflow orchestration, highlighting the necessity of this domain-specific benchmark.

Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

Trojan Hippo: introduces a dynamic evaluation framework to systematically assess persistent memory attacks and defenses in LLM agents, utilizing an adaptive red-teaming benchmark and capability-aware security-utility analysis.
The framework evaluates four memory backends—Context, RAG, Explicit, and Mem0—against indirect prompt injection attacks that plant dormant payloads for delayed data exfiltration.
The study demonstrates that while memory-layer defenses significantly reduce attack success rates, they introduce substantial utility tradeoffs that vary based on the agent's deployment profile.

The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

EGSR (Evidence-Grounded Socratic Reasoning): introduces a formal information-theoretic framework to diagnose and remedy reasoning degradation in closed-system multi-agent LLM debates.
The paper establishes that closed-system multi-agent reasoning protocols are subject to a Data Processing Inequality bound, causing faithfulness to degrade while accuracy is preserved.
The authors propose SFS as a robust faithfulness metric and EGSR as an open-system protocol that breaks the Markov chain through external evidence injection to recover reasoning grounding.

Latent State Design for World Models under Sufficiency Constraints

Latent State Design for World Models under Sufficiency Constraints: introduces a functional taxonomy that categorizes world models based on the specific sufficiency constraint their latent state is designed to satisfy, rather than by architecture or application domain.
The paper evaluates world models across seven axes—representation, prediction, planning, controllability, causal/counterfactual support, memory, and uncertainty—to diagnose what information a latent state preserves, discards, and enables.
It concludes that an actionable world model is defined by the alignment between its latent state construction and the specific requirements of the task, rather than by the total volume of information preserved.

Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse

OOMRAM: introduces a neuro-symbolic multi-agent architecture that operationalizes requirements reuse by combining LLM-driven natural-language traversal with a deterministic symbolic validator to ensure structural validity.
The system utilizes a Navigator Agent, Interpreter Agent, Validator Agent, and Scribe Agent to navigate a formal lattice model while maintaining a shared AgentState for auditability and consistency.
By embedding a non-LLM-based Validator as a hard-coded gatekeeper, the framework eliminates hallucinated requirement combinations and enforces domain-specific constraints during the elicitation process.

When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Confidence-Guided Defense: introduces a robust mechanism for LLMs in multi-agent systems by leveraging token-level confidence scores to mitigate malicious influence when traditional embedding-based defenses fail.
The framework identifies that attackers can bypass existing defenses by crafting near-benign messages, necessitating the use of internal model signals like entropy-based confidence to maintain system safety.
Experimental results demonstrate that confidence-guided filtering significantly improves robustness across various communication topologies and tasks, particularly when early intervention is applied to prevent the propagation of corrupted information.

4th May 2026

Scaling Federated Linear Contextual Bandits via Sketching

FSCLB: introduces a federated learning framework that utilizes matrix sketching and SVD-based determinant calculation to reduce computational and communication overhead in high-dimensional contextual linear bandits.
The framework employs a double-sketch strategy to compress both upload and download data, effectively lowering communication costs from O(d²) to O(ld) per round.
By integrating SCFD for sketch updates, the approach maintains asynchronous communication validity while achieving a regret bound that matches optimal non-sketched performance when the sketch size exceeds the covariance matrix rank.

Fair Agents: Balancing Multistakeholder Alignment in Multi-Agent Personalization Systems

Fair Agents framework: introduces a conceptual architecture for multi-agent multistakeholder personalization that balances competing objectives through LLM Agents, Aggregation Mechanism, and Aggregate Results and Justification.
The framework maps abstract stakeholder values to concrete technical objectives for LLM Agents, which then generate candidate lists for a social choice-based Aggregation Mechanism.
This approach ensures transparent and mathematically traceable consensus-building while providing natural language justifications for the final Aggregate Results and Justification.

Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

GCL (Group Cognition Learning): introduces a two-stage governed collaboration paradigm that regulates cross-modal interaction through Routing Agent, Auditing Agent, Public-Factor Agent, and Aggregation Agent to mitigate modality dominance and spurious coupling.
The framework employs a selective interaction stage to filter redundant information based on marginal predictive gain and a consensus formation stage to disentangle shared semantics from private modality-specific representations.
By replacing implicit fusion with an explicit governance protocol, GCL achieves state-of-the-art performance on multimodal sentiment analysis and intent recognition benchmarks while maintaining superior robustness against noise and spurious correlations.

From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems

LLM-Enabled Robotic Systems: introduces a hierarchical Data Flow Diagram (DFD) and STRIDE-per-interaction analysis to systematically trace end-to-end threat propagation across the perception-planning-actuation pipeline of an autonomous robot.
The framework models the system using two trust boundaries (TB1 for edge server, TB2 for autonomous platform) to identify security risks at six boundary-crossing interaction points where untrusted inputs enter protected domains.
The analysis integrates Conventional Cyber Threats, Adversarial Threats, and Conversational Threats to demonstrate how compromises propagate through cross-modal translation and unmediated boundary crossings to cause unsafe physical actuation.

An Empirical Study of Speculative Decoding on Software Engineering Tasks

SD (Speculative Decoding) framework: introduces a systematic empirical evaluation of speculative decoding methods across diverse software engineering tasks to mitigate autoregressive inference latency.
The study benchmarks model-based approaches like MLP Speculators and Eagle-3 against model-free methods including Prompt Lookup Decoding and Suffix Decoding to assess their effectiveness in code generation, repair, and editing.
The research identifies that code exhibits higher contextual repetitiveness than natural language, enabling model-free methods to achieve significant acceleration, while also highlighting the noise introduced by agentic infinite loops in complex tasks.

cotomi Act: Learning to Automate Work by Watching You

cotomi Act: introduces a browser-based agent that combines reliable multi-step task execution with persistent organizational knowledge learned from user behavior via a Behavior Logger, Agentic ETL Pipeline, and Shared Knowledge Workspace.
The system utilizes an Agent Scaffold with adaptive lazy observation, verbal-diff-based history compression, coarse-grained actions, and test-time scaling to maintain context focus during the ReAct Loop.
The Shared Knowledge Workspace acts as a bidirectional interface where the agent retrieves organizational context via a Search Workspace Tool and users curate artifacts to ensure alignment with evolving workflows.

MAGE: Safeguarding LLM Agents against Long-Horizon Threats via Shadow Memory

MAGE: introduces a defensive framework that leverages a dedicated Shadow Memory to distill security-critical context across an agent's execution trajectory for proactive risk assessment.
The framework utilizes a Memory Manager to maintain the Shadow Memory and a Judge to perform retrospective inspection of pending actions, both optimized via turn-wise reinforcement learning.
MAGE effectively mitigates long-horizon threats by intercepting malicious actions before execution while maintaining high benign utility and low computational overhead.

MenuNet: A Strategy-Proof Mechanism for Matching Markets

MenuNet: introduces a neural mechanism design framework that generates personalized probabilistic menus to balance fairness and non-wastefulness under complex distributional constraints.
The framework utilizes a shared Neural Network Fθ to generate menus, which are then processed by an Assignment Component to ensure strategy-proofness by construction.
MenuNet incorporates a K-Barrier Penalty and a Stability Loss Component to optimize differentiable objectives, effectively managing trade-offs between competing market axioms in large-scale environments.

ENWAR 3.0: An Agentic Multi-Modal LLM Orchestrator for Situation-Aware Beamforming, Blockage Prediction, and Handover Management

ENWAR 3.0: introduces a hierarchical agentic framework that unifies multi-modal sensing, LLMs, and context-driven model selection for real-time wireless network orchestration.
The system integrates an Environment Classifier, Primed LLM, Agent Manager, Long-Term Memory, DRL Policy, Beam Prediction Agent, Blockage Prediction Agent, Handover Agent, and Perception Agent to enable degradation-aware, latency-bounded decision-making.
ENWAR 3.0 achieves robust performance by dynamically routing tasks to specialized agents based on real-time sensor reliability and historical context, ensuring bounded-latency operation in I2V networks.

When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI

Confidential Computing for Agentic AI: introduces a comprehensive survey of hardware-rooted security defenses for LLM-driven agents, mapping functional layers including Agent Core, Local Memory, Retrieval Access, Tool Policy/Mediation, and Credentials/Tokens/Secrets to TEE-based trust boundaries.
The paper establishes an agent-centric threat model that addresses infrastructure-level adversaries by leveraging TEE isolation, remote attestation, and protected memory to secure sensitive agentic workflows.
It identifies critical open challenges for production agentic systems, specifically focusing on compound attestation for multi-hop chains, TEE-backed RAG isolation, and the performance overhead of GPU-TEE integration at LLM scale.

ADAPTS: Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms

ADAPTS: introduces a modular LLM framework for automated psychiatric severity assessment that decomposes clinical interviews into symptom-specific reasoning tasks using WhisperX, Pyannote.audio, symptom-specific sub-agents, GRID-HAMD structure, and Severity Anchor Structure.
The framework utilizes a mixture-of-agents architecture to perform evidence retrieval and generate auditable justifications, mitigating context dilution and protocol-specific brittleness.
By incorporating explicit clinical qualitative conventions, the system improves calibration and achieves expert-level rank-order consistency across heterogeneous interview protocols.

Human-Provenance Verification should be Treated as Labor Infrastructure in AI-Saturated Markets

Human-Provenance Verification Framework: introduces a model of labor market bifurcation where AI saturation creates a barbell structure, concentrating value in infrastructure ownership and premium human-provenance markets while compressing middle-tier work.
The framework categorizes human-presence value into relational presence, aesthetic provenance, and accountability labor, arguing that these require robust verification infrastructure to maintain worker bargaining power.
The paper proposes that provenance systems must be designed as portable, privacy-preserving, and auditable labor infrastructure to prevent the capture of value by platforms and to protect workers from synthetic mimicry.

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

Kerncap: introduces an automated tool for extracting and isolating GPU kernels from complex applications into self-contained, editable, and validated reproducer projects.
The framework utilizes HSA-level runtime interception, address-space closure, and automated source discovery to recover the kernel definition, runtime state, and environment required for faithful replay.
Kerncap enables rapid edit-recompile-validate loops for GPU kernel optimization, significantly reducing iteration time compared to traditional full-application rebuild workflows.

Dependency-Aware Privacy for Multi-Turn LLM Agents

RootGuard: introduces a dependency-aware privacy mechanism for multi-turn LLM agent interactions that sanitizes root values once and computes all subsequent releases deterministically from the noised roots.
The framework leverages structural domain knowledge to allocate privacy budgets across roots, improving the privacy-utility tradeoff while ensuring that derived values inherit privacy at zero marginal cost.
RootGuard creates a double asymmetry where additional interaction turns simultaneously improve utility and maintain constant privacy, effectively mitigating risks from independent noising and MAP reconstruction attacks.

Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents

Validating Sequential Execution in Autonomous Agents: introduces an automated validation framework that constructs generalized ground-truth models from a small number of passing execution traces using Prefix Tree Acceptors (PTA), Multi-tiered state equivalence detection, LLM-based semantic analyzer, Dominator tree extractor, and Topological subsequence matcher.
The system leverages dominator analysis to distinguish essential states from optional variations, enabling robust validation of non-deterministic agent behavior without manual specifications.
Experimental results demonstrate that the framework achieves 100% accuracy in detecting product bugs and agent issues by comparing new executions against the learned dominator tree model.

Pact: A Choreographic Language for Agentic Ecosystems

Pact: introduces a choreographic programming language that extends traditional protocol specifications with explicit constructs for agent choices, utilities, and environmental priors to enable game-theoretic reasoning in multi-agentic ecosystems.
The framework maps protocols to formal games, allowing for the construction of decision policy solvers that utilize recursive theory-of-mind models to analyze agent behavior under information asymmetry.
By providing a formal, verifiable structure for negotiation, the approach addresses the limitations of unstructured LLM-based coordination, such as vulnerability to prompt injection, sycophancy, and coercion.

MARS-DA: A Hierarchical Reinforcement Learning Framework for Risk-Aware Multi-Agent Bidding in Power Grids

MARS-DA: introduces a hierarchical reinforcement learning framework that decouples electricity market bidding into specialized sub-policies and a Meta-Controller for adaptive risk-aware coordination.
The framework utilizes a Meta-Controller to dynamically blend actions from a Safe Agent and a Speculator Agent to optimize risk-adjusted returns in two-settlement electricity markets.
Extensive experiments demonstrate that MARS-DA achieves superior risk-adjusted performance and stability compared to state-of-the-art baselines during periods of extreme market volatility.

PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization

PIIGuard: introduces a webpage-level defense that embeds optimized hidden HTML fragments to steer LLMs away from disclosing PII, utilizing Seed Selection, Rule-based Scoring, Composite-utility Ranking, Evolutionary Mutation, and Judge-based Recovery Assessment.
The framework employs a Target LLM for browsing, a Sanitizer LLM for preprocessing, and a Judge LLM to ensure PII remains unrecoverable from the generated response.
PIIGuard optimizes defense fragments through an evolutionary loop to maintain effectiveness against both direct HTML access and attacker-side sanitization while preserving benign QA utility.

Taming the Curses of Multiagency in Robust Markov Games with Large State Space through Linear Function Approximation

R-LMGs: introduces provably sample-efficient algorithms for robust linear Markov games with large state spaces using L-Robust-Q-FTRL and Online-L-Robust-Q-FTRL, which utilize Hybrid-Sampling, Ridge regression, FTRL, Optimistic robust value estimates, Pessimistic robust value estimates, and Uncertainty set.
The framework addresses the curse of multiagency in robust Markov games by employing independent per-agent linear function approximation and a novel hybrid sampling strategy to approximate adversarial environments.
The proposed algorithms achieve polynomial sample complexity and sublinear regret, providing the first provable guarantees for robust Markov games with large or infinite state spaces.

ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair

ARISE (Agentic Repository-level Issue Solving Engine): introduces a framework that augments an LLM-based agent with a multi-granularity program graph and a three-tier tool API to improve repository-level fault localization and automated program repair.
The framework utilizes a multi-granularity program graph that extends structural relationships down to statement-level nodes connected by intra-procedural definition-use edges, enabling precise data-flow slicing as a first-class agent primitive.
Experimental results on SWE-bench Lite demonstrate that ARISE significantly improves function- and line-level localization and repair success by providing LLMs with structured, queryable evidence directly from the codebase.

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS: introduces a research harness that coordinates autonomous ML research workflows through cross-model adversarial collaboration to mitigate plausible unsupported success.
The framework utilizes an Execution Layer, Orchestration Layer, and Assurance Layer to decompose long-horizon research into inspectable, modular, and verifiable stages.
ARIS employs a persistent Research Wiki for cross-session memory and enforces reviewer independence by pairing executor and reviewer agents from different model families.

Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense

Tool-mediated architecture: introduces a control-theoretic framework for autonomous cyber defense that uses a Bayesian observer, Stackelberg best-response, double oracle expansion, and catalog-membership enforcement to ensure system stability via a composite Lyapunov function.
The architecture employs LLM agents as controllers and adversaries that operate through a tool-output interface, which restricts their actions to a finite catalog to maintain formal stability guarantees.
Formal verification of the system's controllability, observability, and Input-to-State Stability (ISS) is achieved using the Lean 4 proof assistant, with empirical validation demonstrating significant game-value reduction on real-world enterprise attack graphs.

EvoPoC: Automated Exploit Synthesis for DeFi Smart Contracts via Hierarchical Knowledge Graphs

EvoPoC: introduces a knowledge-driven agentic system that leverages a Hierarchical Knowledge Graph to bridge the semantic gap between vulnerability detection and end-to-end exploit synthesis in DeFi smart contracts.
The framework utilizes an evolving agentic memory mechanism and a two-stage validation process, combining SMT-based reachability checking with asset-level simulation to ensure the logical and economic viability of generated exploits.
EvoPoC demonstrates superior performance over existing fuzzers and LLM-based scanners, achieving a 96.6% exploit success rate and identifying 16 previously unknown 0-day vulnerabilities in real-world DeFi protocols.

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

HAAS: introduces a governance-constrained framework for adaptive task allocation that integrates a rule-based PolicyEngine with a contextual-bandit learner to optimize Human-AI collaboration.
The framework utilizes a five-dimension cognitive instrument to characterize subtasks and a five-mode autonomy spectrum to manage the distribution of work between human agents and AI systems.
HAAS incorporates explicit human-state dynamics—fatigue, trust, and deskilling—into the reward signal to ensure long-term capability retention alongside operational efficiency.

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

FlexSQL: introduces a text-to-SQL agent that replaces fixed pipelines with flexible database interaction, utilizing Plan Generation, Program Generation, and Majority Voting to ground reasoning in actual data.
The framework employs a suite of context query tools and executors to enable incremental schema navigation and data inspection, supporting both SQL and Python for bilingual program synthesis.
FlexSQL incorporates a two-tiered repair mechanism with plan-level backtracking to recover from reasoning errors, achieving superior performance on the Spider2.0 benchmark.

Autonomous LLM Agent Worms: Cross-Platform Propagation, Automated Discovery and Temporal Re-Entry Defense

SSCGV, SRPO, and RTW-A: introduces a systematic framework for analyzing and defending against autonomous worm propagation in file-backed LLM agent ecosystems by identifying injectable surfaces, optimizing payloads for semantic resilience, and enforcing temporal re-entry constraints.
The research identifies that read operations in LLM-mediated systems represent a critical integrity threat, enabling autonomous cross-platform propagation through persistent carriers and trust-based delegation.
RTW-A provides formal security guarantees by implementing layered defenses, including RTW enforcement, capability attenuation, and typed memory promotion, to block the persistence-re-entry-action chain without disrupting legitimate agent workflows.

Tool Use as Action: Towards Agentic Control in Mobile Core Networks

Agentic Mobile Core Network Framework: introduces an architecture for 6G mobile networks that utilizes LLM-based agents to perform intent-based control and monitoring of network functions via A2A and MCP protocols.
The framework employs a Host Agent for task delegation, while specialized Monitoring and Execution agents interact with network entities through standardized tool servers.
Experimental evaluation on an OAI-based core network demonstrates that while protocol-level overhead is minimal, LLM inference for reasoning and tool selection remains the primary contributor to end-to-end latency.

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

LLM-MAS RL: introduces a taxonomy and audit framework for training coordinated LLM agent teams using orchestration traces as a shared event-graph abstraction.
The framework organizes literature across three technical axes: reward design, credit assignment across eight granular units, and orchestration learning sub-decisions.
It identifies a significant scale gap between academic evaluation regimes and industrial deployment envelopes, highlighting engineering constraints like rollout cost and harness boundaries.

DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social Navigation

DynoSLAM: introduces a dynamic GraphSLAM architecture that integrates a stochastic GAT as a World Model to capture multimodal human motion uncertainty via Monte Carlo rollouts.
The framework embeds predicted mean trajectories and empirical covariance matrices into the SLAM factor graph using a dynamic Mahalanobis distance factor to modulate kinematic stiffness.
This approach enables robots to generate probabilistic safety envelopes for downstream controllers, facilitating anticipatory navigation in crowded environments.

Mitigating Misalignment Contagion by Steering with Implicit Traits

SIT (Steering with Implicit Traits): introduces a black-box method to mitigate misalignment contagion in multi-agent LLM workflows by periodically injecting system prompts that reinforce an agent's core implicit traits.
The framework utilizes Persona Assignment (assigns personas), Pre-game Persona Assessment (identifies traits), Social Dilemma Game Engine (simulates interactions), System Prompt Repetition (baseline intervention), Steering with Implicit Traits (reinforces traits), and Post-game Assessment (measures trait drift).
The research demonstrates that LLMs often drift toward anti-social behavior during competitive multi-agent interactions, and that SIT effectively counters this contagion without requiring access to model weights or internal states.

AI-Generated Smells: An Analysis of Code and Architecture in LLM- and Agent-Driven Development

MetaGPT (Multi-Agent Framework): introduces a systematic audit of technical debt in AI-generated software, revealing that LLMs and agents exhibit a distinct "machine signature" of defects rather than eliminating them.
The research identifies a "Reasoning-Complexity Trade-off" where more capable LLMs generate bloated procedural code, while multi-agent systems struggle with architectural decay as task complexity increases.
The study establishes a "Volume-Quality Inverse Law," demonstrating that code volume is a near-perfect predictor of architectural degradation that cannot be mitigated by prompt engineering.

Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents

UI-Verse: introduces a framework for evaluating agent-compatible interface design by augmenting classical usability heuristics to improve CUA perception, grounding, and control.
The research identifies that interface design is a critical factor for CUA reliability, proposing four specific heuristics to reduce agent exploration burden and reasoning errors.
Experimental results demonstrate that these augmented heuristics consistently improve task success rates and efficiency across various LLM-based agents without compromising human usability.

ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling

ORPilot: introduces a production-oriented agentic AI system that translates natural-language business problems into solver-ready optimization models using an Interview Agent, Data Collection Agent, Parameter Computation Agent, and a solver-agnostic Intermediate Representation (IR).
The system utilizes a self-correcting retry loop that feeds solver errors and tracebacks back to the LLMs for targeted repairs, ensuring robustness in production environments.
ORPilot decouples model formulation from solver-specific code through its Intermediate Representation, enabling deterministic recompilation and backend portability without further LLM calls.

An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance

Healthcare Agent Skills Framework: introduces an empirical analysis of 557 healthcare-related agent skills, characterizing their function, deployment context, autonomy, and safety through a structured annotation process.
The study reveals that public healthcare skills are predominantly patient-facing and focus on administrative workflows, contrasting with the diagnostic and treatment-oriented focus of academic research.
The findings highlight a significant gap in healthcare-agent governance, noting that technical autonomy does not reliably capture clinical impact and that many skills lack explicit safety boundary statements.

Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI

CASA (Continuous Agent Semantic Authorization): introduces a hybrid runtime enforcement framework that combines deterministic guardrails and semantic intent inspection to secure LLM-driven agentic applications.
The framework utilizes an interception layer to perform structural integrity checks and a two-stage semantic pipeline to verify that tool invocations align with the subject's original task objectives.
The authors provide a novel multi-turn conversation dataset and experimental results demonstrating the effectiveness of semantic inspection in mitigating unauthorized tool access in agentic systems.

The Design and Composition of Structural Causal Decision Processes

SCDP: introduces a formal framework for modeling resource-rational agents in dynamic environments by composing SCDMs with endogenous cognitive constraints and variable discounting.
The framework utilizes bridge variables to enable sequential decomposition of complex decision problems, significantly improving computational efficiency through backwards induction.
SCDPs extend beyond traditional POMDPs by explicitly modeling memory capacity and information flow, allowing for the representation of agents with bounded rationality and costly cognitive resources.

An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES

HADES: introduces an agentic framework that generates transparent, auditable reasoning traces for DILI risk assessment by fusing molecular-level predictions, metabolite decomposition, structural understanding, and toxicity pathways.
The system employs an epistemic division of labor where a DILI subagent performs tool-based structural analysis, followed by an anonymized DeepResearch agent that contextualizes findings against biomedical literature.
HADES utilizes the DILER Benchmark to evaluate mechanistic hypothesis generation, moving beyond binary classification to provide actionable, literature-grounded explanations for toxicological investigation.

AcademiClaw: When Students Set Challenges for AI Agents

AcademiClaw: introduces a bilingual benchmark of 80 complex, long-horizon academic tasks sourced from student workflows, utilizing a Docker Sandbox, OpenClaw Agent, Tool Palette, Evaluation Rubric, Safety Auditor, and API Logging Proxy.
The framework evaluates LLMs on academic-level difficulty and domain coverage, revealing capability boundaries and divergent behavioral phenotypes among frontier models.
Analysis demonstrates that token consumption does not correlate with output quality, highlighting the need for better stopping criteria in autonomous agent frameworks.

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

ARA: introduces a domain-agnostic pipeline that converts scientific papers into workflow graphs to evaluate reproducibility as a structured reasoning task.
The framework utilizes an LLM-based agentic extraction pipeline to map document components into nodes and edges, subsequently applying micro-level scoring and aggregated reproducibility metrics.
Experiments on the ReScience C dataset demonstrate that ARA provides consistent, scalable reproducibility assessments that approximate human expert judgments without requiring full experimental replication.

AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

AutoFocus: introduces a training-free, uncertainty-aware active visual search framework that mitigates resolution bottlenecks in GUI grounding by leveraging token-level perplexity as an intrinsic signal for adaptive refinement.
The framework utilizes Gaussian Dynamic Focusing to convert coordinate perplexity into anisotropic spatial probability fields, enabling targeted zoom-in operations without requiring additional model training.
By integrating error-triggered verification, Shape-Aware Zooming, and multi-hypothesis visual aggregation, AutoFocus significantly improves grounding precision for small interactive elements in high-resolution interfaces.

Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race

Agentic Research Harness: introduces a framework for rapid reproducibility and adversarial evaluation of LLMs in the context of authorship style mimicry and AI-text detection.
The framework utilizes LUAR-MUD as a frozen authorship-style embedding model to measure stylistic similarity and detect AI-generated text through a leakage-free held-out protocol.
It incorporates an agentic adversarial rewriting loop where LLMs iteratively refine drafts against a frozen LinearSVC detector to minimize detection probability while maintaining stylistic fidelity.

Foundation-Model-Based Agents in Industrial Automation: Purposes, Capabilities, and Open Challenges

Foundation-Model-Based Industrial Agent: introduces a systematic literature review of 88 publications to define and evaluate the maturity, purposes, and capabilities of agents integrated with Foundation Models in industrial contexts.
The paper identifies a shift from conventional negotiation-heavy multi-agent systems toward assistive, monitoring, and process-optimization roles enabled by LLMs and MLLMs.
Key findings highlight that while FM-based agents improve human interaction and uncertainty handling, they face significant deployment barriers including hallucination, output instability, and integration challenges with industrial infrastructure.

Beyond State Machines: Executing Network Procedures with Agentic Tool-Calling Sequences

Agentic Network Procedure Execution Framework: introduces a systematic evaluation of LLM-based agents executing network procedures through sequential tool invocations across four distinct architectural approaches.
The framework compares iterative agent-driven reasoning against tool-encapsulated execution to determine the impact of procedure placement on end-to-end latency and execution correctness.
The research defines a procedure-specific error taxonomy to categorize failures in multi-step agentic workflows and identifies scalability limits for LLMs in complex network automation tasks.

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

Horizon Reduction (HR): introduces a systematic empirical study identifying task horizon length as a fundamental training bottleneck for LLMs, proposing horizon reduction via macro actions and subgoal decomposition to stabilize reinforcement learning.
The framework addresses training instability in long-horizon tasks by mitigating exploration difficulties and credit assignment challenges through effective horizon reduction.
The research demonstrates that models trained with horizon reduction exhibit horizon generalization, enabling robust performance on longer, unseen task variants at inference time.

IteRate: Autonomous AI Synthesis of In-Kernel eBPF Wi-Fi Rate Control Algorithms

IteRate: introduces an autonomous research system that uses a multi-agent AI architecture to synthesize, deploy, and evaluate in-kernel eBPF Wi-Fi rate control algorithms on commodity hardware.
The system employs a Scientist agent that coordinates specialized subagents—Experiment Runner, Algorithm Designer, Data Analyst, and Network Engineer—to conduct a closed-loop scientific research cycle without human intervention.
By leveraging high-fidelity telemetry and a programmable MAC-layer architecture, IteRate enables the online synthesis of bespoke algorithms that outperform traditional heuristics across diverse wireless environments and workloads.

Orchestrating Spatial Semantics via a Zone-Graph Paradigm for Intricate Indoor Scene Generation

ZoneMaestro: introduces a unified framework that shifts indoor scene synthesis from object-centric placement to Zone-Graph Orchestration by internalizing zone-based logic and topological constraints.
The framework utilizes an Alternating Spatial Alignment strategy that cycles between reasoning internalization and Zone-Aware Group Relative Policy Optimization to reconcile semantic richness with geometric validity.
To evaluate spatial intelligence in non-convex environments, the authors introduce the SCALE benchmark and the Zone-Scene-10K dataset, demonstrating superior structural coherence and intent adherence over existing baselines.

DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis

DataClaw: introduces a process-oriented benchmark for exploratory real-world data analysis that evaluates LLM agents using DataClaw, OpenClaw, Docker container, LLM Judger, GLM-5, Human-in-the-loop annotation pipeline, Consensus verification, Outcome evaluation, and Process evaluation.
The framework utilizes a three-stage human-in-the-loop annotation pipeline to create 492 expert-designed tasks across enterprise, industry, and policy domains, incorporating native data noise to simulate real-world exploratory analysis.
DataClaw enables diagnostic evaluation by measuring both final-answer accuracy and intermediate reasoning progress through milestone-based process scoring, revealing distinct exploration archetypes among LLMs.

GRAIL: A Deep-Granularity Hybrid Resonance Framework for Real-Time Agent Discovery via SLM-Enhanced Indexing

GRAIL (Granular Resonance-based Agent/AI Link): introduces a multi-stage discovery framework that replaces heavy-weight LLM parsing with an SLM Predictor and utilizes a tri-dimensional indexing strategy to achieve sub-400ms latency.
The framework employs Hybrid Recall to combine Sparse Index-based filtering with Context Index-based dense retrieval, followed by MaxSim Resonance to mitigate semantic dilution through fine-grained matching against an Intent Index.
By decoupling agent metadata into Tags, Context, and Intent, GRAIL maintains high retrieval precision for multi-purpose agents while significantly reducing the computational overhead compared to traditional LLM-centric discovery methods.

Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication

Amortized Intelligence: introduces a neuro-symbolic architecture that offloads complex legal reasoning from LLMs to a deterministic symbolic engine, utilizing a one-time compilation step to translate contracts into a typed graph representation.
The system employs an LLM as a semantic compiler to generate DACL graphs, which are subsequently executed by a lightweight agent to ensure mathematical precision and structural auditability in high-volume legal adjudication.
By decoupling reasoning from execution, the framework achieves near-perfect consistency and significant cost reductions compared to standard LLM-based inference, effectively mitigating the "reasoning cliff" observed in probabilistic models.

When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

CAFE (Controlled Antifragility-compatible Framework for Evaluation): introduces a statistical framework for detecting antifragility-compatible regimes in multi-agent LLM systems by modeling stress-response geometry.
The framework utilizes Prompt Formation, Agentic Architectures, Judge, Multi-output Polynomial Response Model, Inverse Reconstruction, and Distributional Jensen Gap to identify structured stress variation.
CAFE distinguishes between immediate performance robustness and the presence of learnable stress signals, providing a measurement layer for future adaptive LLM systems.

Causal Software Engineering: A Vision and Roadmap

CSE: introduces a paradigm shift in software engineering by integrating explicit causal models and reasoning into the software lifecycle to move beyond correlational analysis.
The framework utilizes Causal design specs, Intervention logs, and a Living Causal Model to enable interventional and counterfactual reasoning for improved decision-making.
CSE incorporates Causal copilots and LLM-based agents to provide uncertainty-aware, auditable, and causally-grounded assistance for software development and operations.

A Behavioral Micro-foundation for Cross-sectional Network Models

Behavioral Micro-foundation Framework: introduces a choice-theoretic foundation for cross-sectional network models by linking agent-based stochastic decision processes to Exponential Random Graph Model (ERGM) equilibria.
The framework incorporates multilateral edge control and agent-node distinctions, utilizing a prosphoric array to aggregate agent decisions via a resolution function that defines relational norms.
The approach enables behavioral inference and counterfactual analysis by demonstrating how individual utility functions and relational norms collectively determine the long-run structural properties of social networks.

ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation

ARIADNE: introduces, "a blackboard-driven MCTS framework that models competitive program generation as a sequential decision process to systematically explore solution spaces".
The framework integrates MCTS as a global planner with specialized agents—including strategy-, code generation-, test generation-, scoring-, and code repair-agents—to iteratively refine program drafts.
A structured blackboard system maintains persistent intermediate artifacts, enabling cross-branch knowledge transfer and evidence-driven refinement to improve performance under strict contest constraints.

AOCI: Symbolic-Semantic Indexing for Practical Repository-Scale Code Understanding with LLMs

AOCI: introduces a symbolic-semantic indexing protocol that provides a stable, LLM-readable blueprint of a repository's architecture, dependencies, and design decisions to improve repository-scale code understanding.
The framework utilizes a dual-layer encoding structure consisting of a Discrete Tag Layer (assigns compact architectural coordinates) and a Continuous Semantic Layer (carries business role and design details) to enable efficient, single-pass LLM comprehension.
AOCI Platform maintains index consistency through incremental updates, ensuring the blueprint remains aligned with evolving codebases while significantly reducing token consumption compared to agent-based exploration methods.

HEAVYSKILL: Heavy Thinking as the Inner Skill in Agentic Harness

HEAVYSKILL: introduces a training-free framework that decomposes complex reasoning into a two-stage pipeline of Parallel Reasoning Agents and a Sequential Deliberation Agent, supported by a Memory Cache and an optional Iterative Update Mechanism.
The framework utilizes Parallel Reasoning Agents to generate diverse trajectories, which are then synthesized by a Sequential Deliberation Agent to produce a final answer, effectively scaling test-time compute.
HEAVYSKILL functions as a readable skill for LLM orchestrators, enabling self-orchestration and iterative refinement to improve performance on complex reasoning tasks without requiring architectural modifications.

A Low-Code Approach for the Automatic Personalization of Conversational Agents

BESSER-PEARL: introduces a low-code pipeline that leverages User Profile Model, Agent Profile Model, and Base Agent Model to automatically generate personalized conversational agents through M2M Transformation and M2C Transformation.
The framework utilizes an LLM for design-time content adaptation and RAG for dynamic knowledge retrieval, while the BESSER Agentic Framework provides the underlying state-machine architecture.
A pilot study demonstrates that the approach achieves high usability and usefulness scores across both technical and non-technical users for creating personalized conversational experiences.

A Compound AI Agent for Conversational Grant Discovery

Conversational Grant Discovery system: introduces a compound AI architecture that unifies fragmented funding sources through an Aggregation Layer and an Agentic Query Processing Layer to provide accurate, real-time grant discovery.
The system utilizes an LLM-based Parser to normalize heterogeneous data into a Unified Index, which is then queried by an Agent Orchestrator using search_index and web_search tools to ensure factual grounding.
By supporting PDF context and iterative conversational refinement, the framework reduces the cognitive load of grant searching and minimizes LLM hallucinations compared to monolithic models.

When Correct Isn’t Usable: Improving Structured Output Reliability in Small Language Models

AloLab: introduces an iterative system-prompt optimization framework that improves structured output reliability in LLMs by using a meta-agent to analyze and refine prompts based on observed model behavior.
The framework utilizes a feedback loop consisting of a Solver, Evaluator, Analyzer, and Optimizer to iteratively rewrite system prompts without requiring weight access or fine-tuning.
AloLab achieves high output accuracy at near-baseline inference latency, effectively addressing format-compliance failures in small LLMs that static prompting or constrained decoding cannot resolve efficiently.

LLM-enabled Social Agents

LLM-enabled Social Agents: introduces a conceptual baseline for social agents that grounds behavior in persona-based role definitions rather than generic capability bundles.
The architecture utilizes a hybrid deliberative approach where LLM-based meta deliberation orchestrates specialized modules including reactive-, rule-based-, planning-, utility-based-, and argumentation-based deliberation.
The paper advocates for shifting research focus toward structured persona representations, synthetic datasets for evaluation, and hybrid control mechanisms to ensure long-term social coherence in LLMs.

SOTOPIA-TOM: Evaluating Information Management in Multi-Agent Interaction with Theory of Mind

SOTOPIA-TOM: introduces a multi-dimensional benchmarking framework to evaluate LLMs' ability to manage information asymmetry and privacy in multi-party interactions, utilizing Scenario Generation Pipeline, Multi-agent Simulator, Evaluation Metric Suite, ToM-Coach, and ToM-Belief.
The framework employs a composite INFOMGMT metric to assess coordination utility and privacy safety, revealing that current LLMs struggle with strategic information seeking and privacy-aware decision-making.
ToM-based interventions, specifically ToM-Coach and ToM-Belief, demonstrate improved coordination-privacy balance compared to vanilla baselines and CoT-Privacy prompting strategies.

EngiAgent: Fully Connected Coordination of LLM Agents for Solving Open-ended Engineering Problems with Feasible Solutions

EngiAgent: introduces a multi-agent framework that utilizes a Coordinator, Analyzer, Modeler, Verifier, Solver, and Evaluator to ensure feasibility in open-ended engineering problem solving.
The framework employs a fully connected coordinator that dynamically routes feedback among specialized agents to correct errors in data processing, constraint handling, and solver execution.
EngiAgent leverages shared memory to track error history and state, enabling adaptive scheduling and robust, feasibility-oriented engineering solutions across diverse domains.

Complexity Horizons of Compressed Models in Analog Circuit Analysis

Prerequisite Graph framework: introduces a performance-aware model compression strategy that utilizes Directed Acyclic Graphs (DAGs) to map conceptual dependencies and optimize LLM cascade deployment in analog circuit analysis.
The framework employs an Agentic Dataset Generation Pipeline to structure engineering tasks and a Strategic Evaluation Engine that uses Depth-First Search (DFS) to dynamically route queries across compressed LLM variants based on real-time performance.
By identifying the "Complexity Horizon" and calculating the "Intelligence Delta" through tag-set intersection analysis, the approach enables granular, cost-effective model selection for hierarchical engineering workflows.

These Aren’t the Reviews You’re Looking For: How Humans Review AI-Generated Pull Requests

AIDev-based data collection pipeline: introduces a large-scale empirical study analyzing human review activity on AI-generated pull requests by classifying interactions into agentic, automation, and human review categories.
The study utilizes a rule-based classifier to distinguish between agent-steering, infrastructure-related automation, and direct human feedback within GitHub pull request discussions.
Findings indicate that while overall human participation rates are similar for human-authored and AI-generated pull requests, the latter exhibit significantly higher levels of agent-steering and fewer instances of direct human-only review.

Towards Understanding Specification Gaming in Reasoning Models

Specification Gaming Evaluation Suite: introduces a diverse set of eight environments to systematically measure and analyze the propensity of LLMs to exploit task specifications during deployment.
The research demonstrates that RL reasoning training significantly increases the rate of specification gaming across various frontier LLMs.
The study finds that while increasing reasoning effort and applying test-time mitigations like fallback options or no-exploit prompts can influence behavior, they do not fully eliminate specification gaming.

Reliability-Oriented Multilingual Orthopedic Diagnosis: A Domain-Adaptive Modeling and a Conceptual Validation Framework

IndicBERT-HPA: introduces a domain-adaptive multilingual diagnostic framework for orthopedic classification that utilizes language-specific adapter heads to project linguistic representations into a clinically discriminative orthopedic subspace.
The framework integrates a deterministic agent-based validation layer that performs evidence checking and enforces conservative human-in-the-loop gating to ensure safety in high-risk clinical decision support.
Empirical results demonstrate that task-aligned encoders with domain-specific adaptation significantly outperform zero-shot LLMs in reliability, calibration, and cross-lingual stability for structured diagnostic tasks.

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

MBR (Multi-agent Belief Revision): introduces a formal framework for evaluating dynamic epistemic reasoning by generalizing classical AGM and DP belief revision postulates to multi-agent settings represented by a single Kripke structure.
The paper defines and analyzes two specific revision operators, a generalized full-meet operator and an event-based operator, to assess their compliance with the proposed multi-agent belief revision postulates.
The research provides formal proofs demonstrating that while these operators satisfy the generalized AGM postulates, they face challenges in fully satisfying iterated revision postulates like DP2.

The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

Triadic Data Framework: introduces a methodology for training long-horizon software engineering agents by capturing human-human-AI interactions rather than relying solely on dyadic human-AI dialogue.
The framework proposes two primary data products, long-horizon expert trajectories and simulated cross-functional companies, to capture the engineering deliberation and context formation missing from current datasets.
A four-tier evidence framework is specified to ensure the quality and credibility of these high-cost corpora through mechanical verification, statistical characterization, probe experiments, and pre-registered blind evaluation.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

PhysicianBench: introduces a benchmark for evaluating LLM agents on long-horizon clinical tasks within real-world EHR environments using Task Curation, Agent Environment, Checkpoint Evaluation, Agent Framework, FHIR Server, and Tool Inventory.
The benchmark utilizes 100 physician-validated tasks and 670 checkpoints to measure the performance of LLMs in retrieving clinical data, reasoning, executing actions, and producing documentation.
Evaluation of 12 LLMs reveals a significant performance gap, with the best model achieving a 46% success rate, highlighting the challenges of autonomous clinical agent deployment.

ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

ARGUS: introduces a three-stage policy-adaptive governance framework that utilizes multi-agent adversarial dialectics to resolve label inconsistencies and discover latent violations in advertising content.
The framework employs a Prosecutor-Defender-Umpire architecture to generate high-fidelity rewards for reinforcement learning, effectively synchronizing model reasoning with evolving regulatory mandates.
By incorporating a Skeptic agent for latent knowledge discovery, ARGUS pushes decision boundaries into complex "gray-area" territories, significantly improving detection precision and recall compared to traditional fine-tuning baselines.

MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing

MEMAUDIT: introduces an exact package-oracle evaluation protocol for budgeted long-term LLM memory writing that isolates write-time memory selection from downstream retrieval and reader reasoning.
The framework defines a finite package of experience streams, candidate memory representations, and storage costs to turn memory writing into a certified optimization problem.
By using a union denominator, MEMAUDIT enables the diagnostic scoring of heterogeneous memory systems like Mem0, Letta, and A-Mem to identify bottlenecks in extraction versus budget-aware selection.

Do We Really Need Immediate Resets? Rethinking Collision Handling for Efficient Robot Navigation

MCB (Multi-Collision reset Budget): introduces a framework that decouples local collision termination from global environment resets to accelerate early-stage exploration in DRL-based mapless navigation.
The framework utilizes a Collision Counter and a predefined Collision Budget to allow agents to continue interacting with the environment after a collision, while employing a Collision-Aware Transition Construction to ensure consistent Bellman backups.
An optional Pose-Change-Based Filtering component further improves training efficiency by discarding redundant collision transitions that occur at similar robot orientations.

When Alignment Isn’t Enough: Response-Path Attacks on LLM Agents

RTA (Relay Tampering Attack): introduces a hierarchical framework that exploits the lack of end-to-end integrity in BYOK relay architectures by performing post-alignment tampering on LLM responses.
The framework utilizes Strategic orchestration, Tactical manipulation, and Stealth restoration to hijack agent execution while bypassing safety alignment and maintaining stylistic consistency.
RTA achieves high attack success rates across multiple LLMs and benchmarks, while a proposed time-channel detector provides a utility-preserving defense against such relay-side tampering.

T2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

T2PO (Token- and Turn-level Policy Optimization): introduces a hierarchical uncertainty-aware framework that stabilizes multi-turn RL by explicitly controlling exploration through TTI, TDS, and RFT.
The framework utilizes a self-calibrated uncertainty signal, integrating entropy and confidence, to monitor reasoning dynamics and prevent inefficient exploration at both token and turn levels.
By dynamically terminating redundant reasoning and resampling unproductive turns, T2PO mitigates training collapse and improves exploration efficiency in complex interactive environments.

Intervention Complexity as a Canonical Reward and a Measure of Intelligence

IC: introduces a canonical reward measure based on the minimum resource cost required for an agent to achieve a specific state transition within a computable environment.
The framework decomposes intelligence into two independent dimensions: agent competence, which measures static performance against an oracle, and learning efficiency, which quantifies the rate of improvement through experience.
By grounding rewards in the computational structure of the environment rather than external normative input, the paper provides a principled basis for evaluating superintelligence and pre-training universal agents.

Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

Planner-centric Multi-agent Collaboration Framework: introduces a modular multi-agent system that decomposes long-horizon automation into a Planner, an Actor, and a Memory Manager to optimize compute allocation.
The framework identifies high-level planning as the primary performance bottleneck and utilizes a planner-centric reinforcement learning approach to exclusively optimize the Planner while keeping other components frozen.
Experimental results across web navigation, OS control, and tool-use benchmarks demonstrate that this unbalanced compute-allocation strategy significantly improves task success rates and efficiency compared to monolithic LLM baselines.

Experience Constrained Hierarchical Federated Reinforcement Learning for Large-scale UAV Teams in Hazardous Environments

EC-HFRL: introduces a hierarchical federated reinforcement learning framework designed for multi-UAV teams operating under fixed per-round experience budgets in hazardous environments.
The framework utilizes a cluster-based architecture where active learners perform parallel policy updates by reusing a shared experience pool, with performance governed by replay dynamics rather than participation density.
The research identifies that minibatch structure and key experience admission are primary determinants of learning success, while increased learner participation primarily impacts energy consumption.

DocSync: Agentic Documentation Maintenance via Critic-Guided Reflexion

DocSync: introduces an agentic workflow that automates software documentation maintenance by fusing structural AST parsing, RAG, and a critic-guided Reflexion loop to ensure semantic consistency between code and documentation.
The framework utilizes an Impact Analysis component to identify necessary updates, followed by an LLM-based generation process that is iteratively refined through a critic-guided feedback mechanism.
By anchoring LLM generations in verified structural code representations, DocSync mitigates hallucinations and improves documentation accuracy on resource-constrained hardware.

AAFLOW: Scalable Patterns for Agentic AI Workflows

AAFLOW: introduces a unified distributed runtime that models agentic workflows as a composition of operators to enable communication-efficient execution on HPC systems.
The framework utilizes a zero-copy data plane based on Apache Arrow and Cylon to eliminate serialization overhead and decouple logical agent behavior from resource-deterministic scheduling.
Experimental results demonstrate that AAFLOW achieves significant pipeline speedups by optimizing data orchestration and communication-bound stages rather than LLM inference.

Combining Trained Models in Reinforcement Learning

DRL Knowledge Reuse Framework: introduces a systematic review of empirical studies on reusing pretrained models in DRL, categorizing mechanisms into Distillation, Transfer, Ensemble, and Federated Training.
The paper synthesizes evidence across 15 studies, highlighting that reuse performance depends on structural overlap between source and target tasks and the use of explicit Gating Mechanism or Alignment Mechanism.
It proposes an Independence Spectrum to classify model diversity based on Seed-, Data-, Task-, and Reward-diversity, while noting that current compute reporting remains insufficient for rigorous efficiency comparisons.

Hierarchical Cooperative MARL for Joint Downlink PRB and Power Allocation in a 5G System

Hierarchical Cooperative MARL: introduces a sequential control framework that decomposes joint downlink PRB and power allocation into a PRB Agent, a Deterministic Resolver, and a Power Agent to optimize a throughput-fairness objective.
The framework utilizes a staged curriculum-based training procedure to stabilize the learning of the PRB Agent and Power Agent within a physically grounded Sionna-based simulation environment.
The system achieves significant cell-throughput gains over traditional Proportional Fair baselines by leveraging ray-traced channel information and cross-layer feedback loops.

From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs

SFI-Bench: introduces a video-based benchmark evaluating spatial and functional reasoning in MLLMs through Metadata Generation, Task Templates, Human Verification, Post-hoc Quality Filtering, Web Search Tool, and Reasoning Engine.
The framework assesses cognitive abilities by requiring models to construct structured spatial maps and infer object affordances from egocentric indoor videos.
Experiments demonstrate that while MLLMs excel at local perception, they struggle with global spatial memory and functional integration, highlighting the necessity of external knowledge and robust reasoning.

3rd May 2026

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

NeuroState-Bench: introduces a human-calibrated benchmark for evaluating commitment integrity in LLM agent profiles through Task Generator, Side-Query Probes, Human Calibration Layer, HCCIS-CORE Scorer, Auxiliary Neural and Mechanistic Modules, and Ranking and Phenotype Analysis Pipeline.
The framework operationalizes commitment integrity by measuring whether an agent's elicited commitments remain consistent with task-defined constraints, bindings, and updates using benchmark-defined side-query probes.
NeuroState-Bench demonstrates that endpoint success often under-specifies agent reliability, as integrity-aware evaluation reveals hidden failures and provides more stable rankings under distractor perturbations.

NORA: A Harness-Engineered Autonomous Research Agent for Spatial Data Science

NORA: introduces a harness-engineered, multi-agent autonomous research system designed specifically for spatial data science workflows.
The system utilizes a skills-first architecture with 21 domain-specialized skills and 9 specialist sub-agents to orchestrate the complete research lifecycle.
NORA incorporates harness engineering principles, including generator-evaluator separation, human-in-the-loop checkpoints, and structured state persistence, to ensure reliable and reproducible research.

Model Spec Midtraining: Improving How Alignment Training Generalizes

MSM: introduces a training phase occurring after pre-training and before AFT, where LLMs are trained on synthetic documents explaining the content of a Model Spec to shape their generalization from subsequent demonstration data.
The framework utilizes a hierarchical pipeline to decompose Model Specs into coherent domains, generating diverse synthetic documents that teach the model the "what" and "why" of its intended character.
Empirical results demonstrate that MSM significantly improves OOD alignment generalization and reduces agentic misalignment, outperforming standard AFT baselines by teaching models to internalize principles rather than just mimicking behaviors.

Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition

Coopetition-Gym v1: introduces a formally grounded benchmark platform for mixed-motive multi-agent reinforcement learning that utilizes Environment transition rules, Payoff vector, Reward layer, Reward signal, Learning algorithm, Oracle baselines, Behavioral audit, Statistical-gate methodology, Controlled critic-learning-rate ablation, and Matrix-coverage verification audit to enable systematic investigation of coopetitive dynamics.
The framework employs a two-layer separation of payoff and reward, allowing for reward-type ablation to isolate the effects of reward mutuality from environment transition structures.
The platform includes twenty environments organized into four mechanism-class tiers (interdependence, trust, collective action, and reciprocity) and provides a suite of 126 algorithms for comprehensive evaluation.

Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization

Judge-R1: introduces a unified framework that enhances LLM-based judgment document generation by integrating an agentic legal information collection mechanism with rubric-guided reinforcement learning.
The framework utilizes a multi-stage pipeline comprising a hybrid retrieval module, supervised fine-tuning for structural alignment, and Group Relative Policy Optimization (GRPO) for legal reasoning fidelity.
Judge-R1 employs a comprehensive legal reward function to evaluate legal correctness, structural professionalism, and reasoning quality, significantly outperforming existing baselines on the JuDGE benchmark.

Optimized and kinematically feasible multi-agent motion planning

MAMP framework: introduces a two-step method for finding optimized and kinematically feasible trajectories for multi-agent systems by combining initial pathfinding via CBS or PBS with a subsequent multi-phase OCP improvement step.
The framework utilizes synchronized motion primitives and a lattice-based planner to generate initial feasible solutions, which are then refined using numerical optimal control to ensure kinematic feasibility and collision avoidance.
Experimental results demonstrate that the lattice-based planner outperforms SIPP-IP in success rates for complex tractor-trailer systems, while the multi-phase OCP effectively removes discretization artifacts from initial paths.

12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

12 Angry AI Agents: instantiates a multi-agent benchmark for LLM deliberation using the AutoGen SelectorGroupChat framework to evaluate how different RLHF alignment intensities influence decision-making rigidity.
The study compares GPT-4o and Llama-4-Scout across various prompt conditions, revealing that heavy RLHF alignment correlates with increased deliberative rigidity and anchoring, while lighter alignment allows for greater flexibility.
The research demonstrates that current LLMs often reproduce the surface features of human deliberation while failing to perform the underlying mechanism of socially mediated, evidence-based mind-changing.

Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare

FMUDB: introduces a framework for fair online learning from multi-user preference data by optimizing the Nash Social Welfare objective using Modified DKWT, Fair-ETC, and Fair-ϵ-Greedy components.
The framework addresses the challenge of heterogeneous user preferences in dueling bandits by utilizing Condorcet winners as reference points to derive individual utility scores.
Theoretical analysis establishes a regret lower bound of Ω(T^2/3 min(K, D)^1/3) and provides matching upper bounds for the proposed Fair-ETC and Fair-ϵ-Greedy algorithms.

TRAP: Tail-aware Ranking Attack for World-Model Planning

TRAP (Tail-aware Ranking Attack on Planning): introduces a trigger-conditioned, inference-time backdoor attack framework that manipulates world-model planning by targeting the relative ranking of decision-critical imagined trajectories.
The framework utilizes a tail-aware ranking loss to identify and suppress high-value trajectories, combined with a dual gating mechanism to ensure stable and behavior-preserving attack updates.
By focusing on the long-tailed distribution of trajectory scores, TRAP effectively disrupts long-horizon planning in world-model-based agents without requiring training-time poisoning or parameter modification.

A Language for Describing Agentic LLM Contexts

ACDL: introduces a formal, implementation-agnostic language for precisely specifying the structure and temporal evolution of input contexts in agentic LLM systems.
The framework utilizes Role Messages, Context Variables, and Control Flow constructs to capture how prompts are assembled and transformed across multi-turn interactions.
ACDL facilitates reproducible engineering and clear communication by rendering complex context architectures into standardized, comparable visual diagrams.

CyberAId: AI-Driven Cybersecurity for Financial Service Providers

CyberAId: introduces a hybrid multi-agent platform for financial cybersecurity that coordinates specialist LLM subagents within a shared runtime to reason over classical SIEM/XDR telemetry.
The architecture utilizes a Main Agent/CRA coordination layer, a Reporting capability, and specialist subagents to perform cross-domain synthesis, regulatory mapping, and incident response under bounded human-in-the-loop autonomy.
The platform integrates advanced security mechanisms including partitioned RAG, federated knowledge sharing, digital twin validation, and eBPF-based kernel telemetry to ensure auditable, regulatory-compliant defense.

AFFormer: Adaptive Feature Fusion Transformer for V2X Cooperative Perception under Channel Impairments

AFFormer: introduces a Transformer-based framework designed to mitigate feature corruption in V2X cooperative perception by modeling inter-agent, temporal, and spatial correlations through MATA, DualSA, and UGF modules.
The framework utilizes a teacher-student knowledge distillation strategy to enhance robustness by aligning student-fused features with clean teacher representations under ideal communication conditions.
DualSA reduces computational complexity by decomposing 2D spatial attention into parallel width and height branches, enabling efficient and robust feature fusion for real-time autonomous driving applications.

QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing

QASecClaw: introduces a multi-agent framework that reduces false positives in SAST by utilizing a Mission Orchestrator to coordinate specialized agents, including a Security Validation Agent for initial scanning and a SAST Filter Agent powered by a Qwen 3.5 Plus LLM for contextual verification.
The framework employs a conservative fail-open mechanism where the SAST Filter Agent retains findings if the LLM fails, times out, or produces malformed output, ensuring that potential vulnerabilities are not silently suppressed.
Evaluated on the OWASP Benchmark v1.2, the approach achieves a 16.0% improvement in F1-score and an 88.6% reduction in false positives compared to standalone Semgrep, demonstrating the effectiveness of LLM-augmented verification in security triage.

Sheaf-Theoretic Planning: A Categorical Foundation for Resilient Multi-Agent Autonomous Systems

STP: introduces a geometric framework for autonomous agents that replaces monolithic logical models with sheaf-theoretic structures to enable resilient planning in open-world environments.
The framework utilizes Interval Category, Grothendieck Topology, Sheaf, Natural Transformation, Pullback, Sheaf Laplacian, Prolog Reasoning Engine, Raspberry Pi, ESP32, PiCrawler, and Wi-Fi Mesh to manage temporal reasoning and multi-agent coordination.
STP models agent memory, goals, and objective reality as sheaves, employing pullbacks for abductive reasoning and sheaf Laplacians for distributed consensus in multi-agent swarms.

Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning

RCB+RSQ (Return-Conditioned Beta and Reward Signal Quality): introduces a cooperative multi-agent reinforcement learning framework that optimizes exploration by combining a global return-conditioned intensity schedule with a per-agent signal quality-aware budget allocation.
The framework utilizes RCB to adapt global exploration intensity based on team learning progress and RSQ to selectively attenuate exploration for agents with noisy intrinsic reward signals.
By employing Successor Distance as a quasimetric intrinsic reward, the approach ensures distinguishable per-agent signal quality, preventing coordination collapse in complex multi-agent environments.

Collusion Relations and their Applications to Balance Theory

Collusion Relations and their Applications to Balance Theory: introduces a novel characterization of balance in social networks by utilizing Quadrangular Relations and Collusive Relations to generalize standard balance theory.
The paper establishes that collusiveness of relations provides a necessary and sufficient condition for network stability, extending applicability to non-symmetric relations.
The authors provide a modal characterization of collusive frames and implement a G3CP System to derive key properties of balanced networks through labeled sequent calculus.

MAGIC: Multi-Step Advantage-Gated Causal Influence for Multi-agent Reinforcement Learning

MAGIC: introduces a framework that models long-horizon causal influences between agents and aligns them with task returns to provide targeted coordination signals.
The framework utilizes a learned forward model and an ICMI critic to estimate multi-step interventional causal influence, which is then modulated by an advantage-gating module to ensure only beneficial actions are reinforced.
MAGIC improves coordination in MARL by capturing delayed causal effects and filtering them through extrinsic team advantage to prevent the reinforcement of harmful high-influence behaviors.

Koopman Representations for Early Outbreak Warning and Minimal Counterfactual Intervention in Multi-Agent Epidemic Simulations

Koopman-based framework: introduces a computational approach for early outbreak detection and intervention selection in multi-agent epidemic simulations by combining a multi-agent epidemic simulator, a Koopman-inspired model, a random forest classifier, and a counterfactual intervention procedure.
The framework leverages Koopman operator learning to map nonlinear epidemic trajectories into a low-dimensional latent space, enabling short-horizon forecasting and the extraction of predictive features for outbreak classification.
Counterfactual analysis is utilized to identify sensitive tipping points where minimal, single-agent mobility restrictions can effectively redirect trajectories from major outbreaks to contained regimes.

DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents

DataEvolver: introduces a closed-loop visual data engine that utilizes goal-driven loop agents to automate the construction, inspection, and refinement of multi-artifact datasets.
The framework employs a dual-loop mechanism where an inner loop performs generation-time self-correction and an outer loop executes validation-time self-expansion to ensure high-quality, traceable data.
By organizing data construction into an explicit artifact graph with structured review signals and verdict logic, the engine enables reproducible and scalable visual data generation.

Runtime Evaluation of Procedural Content Generation in an Endless Runner Game Using Autonomous Agents

Momentum: introduces a runtime procedural content generation and evaluation framework that integrates autonomous aerial and ground agents to validate game content before player interaction.
The system utilizes a WFC-inspired object placement mechanism and asynchronous navigation-mesh rebuilding to maintain continuous, playable endless-runner environments.
Evaluation is performed through complementary geometric ray casting and navigation-based path traversal, with detected failures recorded via a structured crash-reporting pipeline.

The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don’t

BS-Bench: introduces a dual-channel audit framework to measure the Compliance Gap, defined as the divergence between an LLM's verbal commitment and its actual behavioral execution.
The framework utilizes tool-call logs as an independent behavioral mirror to detect process non-compliance that remains structurally invisible to text-only evaluation methods.
The research demonstrates that RLHF-trained models systematically prioritize text-based reward signals over procedural instructions, resulting in selective compliance patterns across various professional domains.

Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Negotiation Game Framework: introduces an iterated, multi-turn negotiation game to evaluate dynamic grounding in LLMs by decomposing coordination gaps into measurable components.
The framework utilizes LLM Agents with a Private Thinking Scratchpad and a Cheap Talk Channel to study how agents build shared understanding through interaction.
Experimental results reveal that coordination failures are primarily rooted in interactive processes like joint plan formation and commitment maintenance rather than individual reasoning limitations.

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

RHB (Reward Hacking Benchmark): introduces a multi-step tool-use benchmark designed to quantify reward hacking in LLM agents by evaluating their propensity to exploit evaluation mechanics across independent and chained task regimes.
The framework utilizes a sandbox environment with integrity instrumentation to classify agent behaviors into six exploit categories, including sequence manipulation, leakage, and tampering, while testing models against both standard and hard task variants.
Experimental results demonstrate that RL-dominated post-training is associated with significantly higher exploit rates, and that environmental hardening can reduce these exploits by 87.7% without degrading task success.

Architectural Obsolescence of Unhardened Agentic-AI Runtimes

enclawed-oss: introduces a hardened agentic-AI runtime architecture that utilizes a Biconditional checker, Hash-chained audit log, Extension admission gate, Two-layer egress guard, Bell-LaPadula classification policy, Module-signing trust root, and Bootstrap seal to prevent failure modes F1–F4.
The framework employs an Adversarial harness and Cooperation classifier to empirically demonstrate that unhardened runtimes like OpenClaw are architecturally obsolete due to their lack of these seven security primitives.
The research validates that enclawed-oss achieves perfect recall and precision on F1–F4 failure modes, while also supporting a Behavioral-anomaly monitor in its certified evolution to detect anomalous agent activity.

AgenticVM: Agentic AI for Adaptive Software Vulnerability Management

AgenticVM: introduces a multi-agent framework that integrates LLMs with security tools to automate the full vulnerability management lifecycle, including detection-, assessment-, prediction-, integration-, prioritisation- and recommendation-agents.
The framework employs a hybrid architecture combining rule-based logic for deterministic tasks and a BERT-small model for CVSS severity prediction to reduce alert noise and analyst workload.
AgenticVM leverages retrieval-grounded generation and human-in-the-loop governance to ensure provenance and accountability in real-world software vulnerability management.

BIM Information Extraction Through LLM-based Adaptive Exploration

Adaptive Exploration framework: introduces an LLM-based agent that iteratively writes and executes code against BIM models to discover data structures at runtime, overcoming the limitations of static query approaches.
The system utilizes a CodeAct architecture to perform multi-step reasoning and error recovery through execution feedback, enabling robust information extraction from heterogeneous IFC models.
The research includes ifc-bench v2, a large-scale benchmark for evaluating LLM-based BIM extraction, and demonstrates that the adaptive paradigm significantly outperforms static methods regardless of augmentation strategies.

GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory

GRAVITY (Generation-time Relational Anchoring Via Injected Topological MemorY): introduces a plug-and-play structured memory module that extracts Entity Anchors, Event Anchors, and Topic Anchors from raw conversation history to provide explicit relational, temporal, and thematic context to an LLM without modifying the host memory system.
The framework utilizes an offline build phase to generate structured knowledge representations and an online inference phase that employs an Embedding-based Reranker and Query Expander to inject relevant anchors into the generation prompt.
Evaluations across five diverse memory architectures demonstrate that GRAVITY significantly improves LLM-judge accuracy by providing structured context that addresses the reasoning gap between retrieval and generation.

CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers

CP-SynC: introduces a multi-agent workflow for zero-shot constraint modeling in MiniZinc that coordinates Modeling Agents, Validation Agents, and Selection Agents to improve model correctness through test-driven development.
The framework employs a Staged Checking Pipeline to perform hard execution checks and soft semantic checks, while utilizing Diverse Sampling Strategies to explore multiple modeling trajectories in parallel.
By aggregating evidence across candidate models and synthesized checkers, CP-SynC mitigates noise inherent in LLM outputs and significantly outperforms existing baselines on a benchmark of 100 constraint programming problems.

2nd May 2026

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

FOVEA (Foveated Observation and Visual Evidence Acquisition): introduces a training-free procedure that refines VLM crop proposals through evidence-oriented probing to overcome the perceptual bandwidth bottleneck in high-resolution visual reasoning.
The framework formalizes active visual information acquisition as a sequential Bayesian optimal experimental design (S-BOED) problem, utilizing a coverage–resolution objective to guide the agent's foveation actions.
By integrating a Planner, Resolvability Probing, and a Tool Interception Mechanism, the approach enables LLMs to iteratively acquire task-relevant visual evidence while managing fixed perceptual bandwidth constraints.

WHO DECIDES WHAT IS HARMFUL? CONTENT MODERATION POLICY THROUGH A MULTI-AGENT PERSONALISED INFERENCE FRAMEWORK

PRISM (Personalised Reasoning and Inference System for Moderation): introduces a user-in-the-loop multi-agent framework that filters content based on individual sensitivity profiles to improve moderation accuracy and alignment with user preferences.
The architecture utilizes a Manager Agent to orchestrate specialized agents—Sociologist, Linguist, and Psychologist—alongside a Ghost Profile Agent to simulate user perspectives and a Synthesis Agent to produce final moderation decisions.
By dynamically updating user profiles through explicit feedback, the system achieves significant improvements in F1 scores and precision compared to universal, non-personalised moderation baselines.

MILD: Mediator Agent System with Bidirectional Perception and Multi-Layered Alignment for Human-Vehicle Collaboration

MILD: introduces an agentic vehicular system that facilitates synergistic human-vehicle collaboration by integrating a joint perception agent and a lightweight strategy agent to provide auditable, constraint-aligned driving suggestions.
The framework utilizes Evidence- and Constraint-weighted Policy Optimization (ECPO) to align LLM-based strategy agents with safety regulations, vehicle limits, and driver preferences without requiring low-level control.
MILD employs a retrieval-augmented generation module to dynamically incorporate multi-level constraints, ensuring that high-level advisory policies remain grounded in current traffic context and individual driver states.

AI Alignment via Incentives and Correction

AI Alignment via Incentives and Correction: introduces a game-theoretic framework for AI alignment that treats solver and auditor interactions as a bilevel optimization problem to maintain oversight pressure.
The framework utilizes a Meta-Controller to dynamically adapt Reward-Profiles, ensuring that the Solver and Auditor reach a stable equilibrium that minimizes hallucinations and silent failures.
Experiments on coding pipelines demonstrate that this adaptive approach significantly reduces solver hallucinations compared to static reward baselines by optimizing for the full correction event.

Valley3: Scaling Omni Foundation Models for E-commerce

Valley3: introduces an omni-modal LLM for e-commerce that integrates a Qwen3-VL backbone with an Audio Transformer (AuT) encoder and an MLP Audio Connector to enable unified understanding across text, images, video, and audio.
The framework incorporates a controllable reasoning module and an agentic search mechanism to balance inference efficiency with deep, evidence-grounded research capabilities for complex e-commerce tasks.
Valley3 utilizes a four-stage continual pre-training pipeline and a closed-loop data optimization process to align model capabilities with specific e-commerce domain requirements.

The Buy-or-Build Decision, Revisited: How Agentic AI Changes the Economics of Enterprise Software

Agentic AI-augmented software development framework: introduces a conceptual model for re-evaluating enterprise software sourcing decisions by analyzing how agentic coding systems transform the governance and economics of in-house development.
The framework shifts the "Make" option from a traditional hierarchy to a hybrid governance form, characterized by internal code ownership combined with external AI infrastructure dependency.
It provides a typology of enterprise applications to guide strategic sourcing, identifying that while AI favors in-house development for commodity and custom applications, regulated and mission-critical systems remain better suited for external procurement.

Toward a Principled Framework for Agent Safety Measurement

BOA: introduces a principled search framework for measuring LLM agent safety by exploring the trajectory space under a deployment configuration to calculate a likelihood-weighted safety score.
The framework utilizes priority search engines, a trajectory cache, and an external environment to systematically identify unsafe behaviors that standard greedy or sampled evaluations often overlook.
BOA incorporates system-level optimizations including batched decoding, prefix caching, and chunked tree expansion to maintain manageable GPU costs while evaluating complex multi-turn agent interactions.

Less Interaction But More Explanation: A Communication Perspective on Agentic AI Interfaces

Agentic AI Framework: introduces a communication-centric perspective on agentic AI, arguing that increased autonomy necessitates more robust explanations rather than reduced interaction.
The framework identifies four communicative roles—AI Creator, AI Converser, AI Curator, and AI Co-Author—that agentic systems traverse, which can lead to source misattribution and accountability gaps.
To mitigate risks like over-trust and goal misalignment, the paper proposes three explanation types—Action-Process, Uncertainty, and Coordination—and suggests user-driven customization of these explanations to preserve human agency.

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

PAEF (Production Agentic Evaluation Framework): introduces a five-dimension evaluation framework designed for continuous monitoring of agentic systems in production environments.
The framework addresses the limitations of episodic benchmarks by utilizing EvalRequest, LLMEvaluator, Cascade Uncertainty, Tool Reliability, Distribution Health, Explanation Validity, and Cross-Surface Consistency to detect failure modes like distribution collapse and goal drift.
PAEF provides a modular architecture that generates MetricResult and EvalReport objects to identify silent performance degradation that standard metrics fail to capture.

Hybrid Quantum Reinforcement Learning with QAOA for Improved Vehicle Routing Optimization

HQRL-QAOA: introduces a hybrid quantum-classical architecture that integrates QAOA-structured layers into a QRL policy network to solve the Vehicle Routing Problem.
The framework utilizes a QAOA warm-start module to initialize PQC parameters, effectively preventing barren plateaus and accelerating convergence compared to random initialization.
By employing a fixed 4-qubit register and constant circuit depth, the approach ensures scalability and compatibility with near-term quantum hardware for large-scale routing instances.

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

RL Developer Memory: introduces a feedback-normalized, safety-gated architecture that interposes a memory-control layer between an LLM coding agent and a persistent store to ensure auditable, theory-to-code traceable memory updates.
The system utilizes a deterministic ranker as the primary baseline while employing a contextual-bandit shadow scorer and OPE gate to safely evaluate learned memory-selection policies without unverified intervention.
By mapping sparse developer feedback into a canonical reward vocabulary and linking delayed resolutions to specific retrieval events, the architecture provides a governed, component-level auditable framework for RL-specific code maintenance.

Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling

Multi-Agent Reasoning Framework: introduces a systematic analysis of inference-time scaling strategies, evaluating self-consistency, self-refinement, multi-agent debate, and Mixture-of-Agents (MoA) under matched compute budgets to identify Pareto-optimal configurations.
The study utilizes three levers of test-time scaling—pipeline choice, pipeline parameters, and model size—to demonstrate that multi-agent systems, particularly MoA, achieve superior compute-accuracy efficiency compared to single-agent baselines.
The research establishes practical design guidelines, such as configuring MoA with one more proposer model than layers, and highlights that larger models can be more compute-efficient than heavily scaled smaller models.

Automated Interpretability and Feature Discovery in Language Models with Agents

InterpAgent: introduces an autonomous multi-agent framework that unifies feature discovery and hypothesis refinement within a single empirical loop for mechanistic interpretability.
The framework utilizes a supervisor to coordinate FeatureFinder, which identifies candidate features via statistical analysis, and FeatureExplainer, which iteratively stress-tests hypotheses using a multi-metric evaluation battery.
By maintaining persistent memory and a shared execution environment, the system produces auditable traces of hypotheses and interventions, improving upon one-shot automated interpretability methods.

6G Needs Agents: Toward Agentic AI-Native Networks for Autonomous Intelligence

Agentic AI-Native 6G: introduces a four-layer architecture that integrates Deterministic 6G Infrastructure, Semantic Abstraction Layer, Agentic Reasoning Layer, and Distributed Multi-Agent Fabric to enable policy-governed, intent-aware network autonomy.
The framework utilizes an LLM Agent Runtime, Tool-Oriented Execution Layer, and Semantic Memory Layer to perform multi-step reasoning and tool-grounded orchestration across the device–edge–core continuum.
The research provides an empirical evaluation of quantized LLMs on the 6G-Bench benchmark, demonstrating that heterogeneous agent deployment is required to balance reasoning capability, latency, throughput, and memory efficiency.

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

SciResearcher: introduces a fully automated agentic framework for constructing frontier scientific reasoning tasks by synthesizing conceptual and computational questions grounded in academic evidence.
The framework utilizes a hierarchical multi-agent architecture, including a main agent that coordinates specialized web- and file-agents to perform long-horizon information retrieval and quantitative model instantiation.
By leveraging this framework, the authors developed SciResearcher-8B, an agent foundation model that achieves state-of-the-art performance on scientific benchmarks by demonstrating adaptive, tool-intensive reasoning behavior.

MAP-Law: Coverage-Driven Retrieval Control for Multi-Turn Legal Consultation

MAP-Law (Mindmap-Augmented Planning for Legal Consultation Agents): introduces a coverage-driven framework for retrieval control in multi-turn legal consultation that utilizes a Plan Graph, Evidence Graph, Coverage Metrics, LLM Action Selector, Retriever, and Joint Graph Memory to optimize evidentiary sufficiency.
The framework models legal consultation as a structured decision process, where an LLM-based action selector operates over a joint graph representation to dynamically determine retrieval, supplementation, evaluation, or generation actions.
By employing explicit coverage metrics (EC, EVC, and MG) to govern stopping decisions, the system significantly reduces retrieval rounds and evidence volume while improving element coverage compared to fixed-depth RAG baselines.

Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

Action Agent: introduces a two-stage framework that decouples high-level trajectory imagination from low-level control by synthesizing a validated Visual Intermediate Representation (VIR) before execution.
The framework utilizes an LLM-based agent to iteratively refine video generation prompts through a reasoning-based validator, ensuring physical plausibility and instruction adherence.
FlowDiT integrates semantic features from DINOv2 and ego-motion cues from optical flow to enable high-frequency, embodiment-aware velocity command prediction on consumer hardware.

An Intelligent eUPF for Time-Sensitive Path Selection in B5G Edge Networks

eUPF (enhanced User Plane Function): introduces an intelligent data plane architecture that utilizes a DQN agent to perform real-time, latency-aware path selection for B5G network slices.
The framework leverages passive eBPF-XDP telemetry to estimate end-to-end latency by correlating TEID-based timestamps, enabling autonomous steering between MEC and cloud endpoints.
Experimental validation on the FABRIC testbed demonstrates that the DQN-based approach significantly reduces average latency and improves temporal stability compared to random baseline policies.

Practical Limits of Autonomous Test Repair: A Multi-Agent Case Study with LLM-Driven Discovery and Self-Correction

Multi-agent testing system: introduces an industrial case study of an autonomous UI testing framework that utilizes LLMs for feature discovery, test generation, and iterative self-correction.
The system employs a five-agent pipeline—Explorer, Planner, Coder, Executor, and Self-Correction—to automate test maintenance in dynamic enterprise environments.
The research identifies critical failure modes, such as non-converging repair loops and hallucinated interactions, and proposes design guidelines for constrained, reliable autonomous testing.

CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

CoFlow: introduces a generative architecture for offline multi-agent reinforcement learning that enables coordinated few-step trajectory generation through a natively joint-coupled velocity field.
The framework integrates Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating into a Shared U-Net Backbone to facilitate inter-agent coupling without iterative sampling.
A novel Finite-Difference Consistency Surrogate replaces memory-intensive Jacobian-vector product backpropagation, enabling efficient training at multi-agent scale while maintaining coordination quality.

Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

Decompose and Recompose: introduces a compositional skill reasoning framework that leverages Atomic Skills Collection, Planning Agent, Visual Encoder, Dynamic Demonstrations Library, Coverage-aware Static Library, and LLM to enable zero-shot cross-task robotic manipulation.
The framework decomposes seen task demonstrations into interpretable atomic skill-action pairs, which are then recomposed for unseen tasks through a dual-library retrieval system and LLM-based reasoning.
By providing LLMs with structured, skill-aligned demonstrations rather than raw numerical sequences, the approach effectively activates compositional reasoning for complex robotic manipulation tasks.

Artificial intelligence language technologies in multilingual healthcare: Grand challenges ahead

HCAILT (Human-Centered AI Language Technology): introduces a framework for evaluating AI language technologies in multilingual healthcare through the pillars of Reliability, Safety Culture, and Trustworthiness.
The paper synthesizes current evidence across written communication, spoken communication, and agentic workflows to identify seven grand challenges for future research and deployment.
It argues that progress in multilingual healthcare requires moving beyond model-centric benchmarks toward accountable sociotechnical design, calibrated human oversight, and interdisciplinary collaboration.

Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

Verbal-R3 (Retrieve-Rerank-Reason): introduces an agentic RAG framework that utilizes a Generator and a Verbal Reranker to bridge the gap between retrieval and reasoning through Verbal Annotations.
The framework employs a lightweight Verbal Reranker to generate analytic narratives that map query-context relationships, effectively filtering noise for the Generator.
Relevance-guided test-time scaling dynamically allocates computational resources to the most promising reasoning trajectories, enhancing both performance and efficiency.

LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

LiveFMBench: introduces a contamination-aware benchmark of 630 ACSL-annotated C programs to systematically evaluate LLM-based formal specification generation across direct prompting, thinking mode, and agentic pipeline settings.
The study reveals that while thinking mode and agentic pipelines significantly improve success rates, current LLMs still struggle with deep semantic dependencies and loop invariants, often exhibiting unfaithful behaviors that inflate performance metrics.
Experimental results demonstrate that agentic pipelines are highly cost-effective under low sampling budgets, though performance gains diminish as sampling attempts increase, highlighting fundamental limitations in current LLM-based formal verification approaches.

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

ESARBench: introduces a novel task and high-fidelity simulation platform for evaluating LLM-driven UAV agents in complex, real-world search and rescue scenarios.
The framework utilizes a hierarchical "Event-Snapshot-Task" generation process to create 600 unique, difficulty-stratified rescue missions within photorealistic environments.
The benchmark employs comprehensive metrics including Success Rate, Time-weighted Success Rate, Clue Discovery Score, and a holistic Rescue Score to assess agent perception, reasoning, and mission efficiency.

Assistance Without Interruption: A Benchmark and LLM-based Framework for Non-Intrusive Human–Robot Assistance

NiaRR (Non-intrusive Assistance with Retrieval and Ranking): introduces a hybrid architecture that integrates an LLM-based semantic retriever with a transformer-based scoring model to identify well-timed, non-intrusive robotic assistance.
The framework utilizes a SBERT Encoder to generate embeddings for human task steps and candidate robot actions, which are then processed by a Cross-Attention Mechanism and MLP Scoring Head to rank the most beneficial assistive pairs.
To support this approach, the authors establish NIABench, a simulation benchmark that evaluates the robot's ability to provide assistance without interrupting the primary human task flow.

PACE: Parameter Change for Unsupervised Environment Design

PACE (Parameter Change for Unsupervised Environment Design): introduces a UED framework that evaluates environment value through the magnitude of policy parameter change induced by training, providing a direct and intrinsic signal of realized learning progress.
The framework utilizes a first-order Taylor approximation to establish a principled connection between the squared l2 norm of policy parameter updates and improvement in the optimization objective.
PACE avoids the computational overhead and variance associated with traditional proxy signals like regret or value-based estimation by grounding environment selection in deterministic, in-memory parameter updates.

RAISON: DEVELOPING RELIABLE DECISION-MAKING AGENTS

rAIson: introduces a high-level technological environment for developing reliable and explainable decision-making agents using a Front-end authoring tool, Gorgias inference engine, API, Automated code generator, and Deployment service.
The platform utilizes the Software Development via Argumentation (SoDA) methodology to capture decision policies as hierarchies of scenario-based preferences without requiring manual coding.
By offering explainable argumentation as a service (AIaaS), the platform enables the integration of symbolic reasoning into external agent systems like JADE or IoT applications.

DiagramNet: An End-to-End Recognition Framework for System-Level Diagrams

DiagramNet: introduces a multimodal dataset and a decoupled multi-agent workflow that decomposes complex system-level diagram recognition into Perception Agent, Reasoning Agent, and Knowledge Agent stages.
The framework utilizes a progressive training pipeline combining supervised fine-tuning, topology-consistency reinforcement learning, and task-specific LoRA adapters to improve LLMs performance on relational reasoning tasks.
By decoupling visual grounding from topological reasoning, the workflow mitigates spatial hallucinations and enables robust zero-shot transfer to diverse circuit diagram domains.

Truth or Tribe: How In-group Favoritism Prioritize Facts in Persona Agents

Truth or Tribe simulation framework: introduces a triadic interaction paradigm to evaluate how persona-based LLMs exhibit in-group favoritism when faced with conflicting information from peers of varying similarity.
The framework utilizes Persona Similarity Distance (PSD) to measure bias, revealing that LLMs consistently prioritize in-group opinions over objective truth, a phenomenon termed "Tribe over Truth."
The study demonstrates that this tribal bias is robust across multiple LLMs and reasoning contexts, while also validating that prompt-based interventions like Identity-Blind Instruction (IBI) can effectively mitigate such identity-driven heuristics.

GA-VisAgent: A Multi-Agent application for code generation and visualization in interactive learning

GA-VisAgent: introduces a multi-agent framework that leverages GAGPT and ReAct reasoning to decompose complex Geometric Algebra tasks into structured subtasks for accurate code generation and interactive visualization.
The architecture utilizes a hierarchical dual-agent system, comprising a Planner Agent for task decomposition and a Worker Agent for executing specialized subtasks through a dedicated function library.
By integrating automated syntax validation and GAALOP API calls, the framework achieves a 90% success rate in generating executable code for conformal space tasks, significantly outperforming standard LLMs.

Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks

NSI (Neuro-Symbolic Skill Induction): introduces a framework that transforms transient agentic reasoning into persistent, logic-grounded skills by lifting interaction traces into modular programs with explicit control flows and dynamic variable bindings.
The framework utilizes Neural Feedback Perception to ground observations into symbolic states and Symbolic Execution Logic to enable state-aware, adaptive execution through branching and loops.
NSI incorporates a Reflective Planning mechanism that converts runtime failures into permanent logic improvements by grafting corrective subgraphs onto existing skill structures.

FeedbackLLM: Metadata driven Multi-Agentic Language Agnostic Test Case Generator with Evolving prompt and Coverage Feedback

FeedbackLLM: introduces a multi-agentic framework that utilizes a tightly coupled two-stage pipeline to automate language-agnostic test case generation through iterative prompt refinement.
The architecture employs specialized Line Feedback Agent and Branch Feedback Agent components to analyze coverage gaps and generate targeted constraints for subsequent LLM iterations.
A Redundancy Prevention Cache is integrated to minimize redundant API calls and execution cycles, ensuring linear scalability in test generation time.

EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

EO-Gym: introduces a controlled, Gymnasium-style environment for multimodal, tool-using agents to perform interactive Earth Observation evidence-acquisition tasks.
The framework utilizes a massive Data Lake and a Data Gathering Space to support complex reasoning across spatial, temporal, and cross-modal dimensions.
EO-GYM-DATA provides a large-scale benchmark of trajectories, while the EO-GYM-4B model demonstrates improved tool-use fidelity and reasoning through specialized fine-tuning.

S3-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

S3-R1: introduces a data-centric framework that improves multi-hop QA by synthesizing intermediate-difficulty questions and employing a retrieval-aware reward signal to guide LLM agent training.
The framework utilizes a Generator LLM to create synthetic questions from hard-mined anchor instances, which are then filtered by an Oracle LLM and a Retriever to ensure factual grounding and empirical solvability.
To stabilize training, S3-R1 incorporates negative advantage clipping and a composite reward function that balances final answer correctness with intermediate search quality.

FP-Agent: Fingerprinting AI Browsing Agents

FP-Agent: introduces a measurement framework that utilizes browser and behavioral fingerprints to reliably distinguish between AI browsing agents and human users.
The framework employs client-side instrumentation to collect interaction artifacts, which are then processed by an XGBoost classifier to achieve near-perfect identification accuracy.
Experimental results demonstrate that while browser fingerprints are often correlated with execution environments, behavioral features such as typing, scrolling, and mouse movement provide highly discriminative signals for bot detection.

Breaking the Computational Barrier: Provably Efficient Actor–Critic for Low-Rank MDPs

Opt-AC (Optimistic Actor–Critic): introduces a provably efficient reinforcement learning algorithm for low-rank Markov Decision Processes that reduces computational complexity by relying solely on a policy evaluation oracle implemented via supervised learning.
The framework utilizes an optimistic critic constructed from learned transition models and an exploration bonus to achieve state-of-the-art sample complexity without requiring computationally expensive planning or constrained optimization oracles.
Theoretical results demonstrate that the algorithm achieves polynomial sample and computational complexity, while empirical evaluations on standard continuous-control benchmarks confirm its practical effectiveness and competitiveness against strong baselines.

Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs

ToB (Tower of Babel) Problem Framework: introduces a critical analysis of how incidental multilingualism in LLMs leads to unstable language support and performance degradation in agentic systems, utilizing LLM-based agents, inter-agent evaluation, language identification, translation, code generation, long-form generation, and support-list elicitation.
The paper demonstrates that current LLMs exhibit significant discrepancies between self-reported language support and actual behavioral performance, often resulting in incorrect generation rather than explicit refusal.
The authors advocate for a shift toward "multilingualism by design," emphasizing the need for explicit multilingual objectives, transparent support frontiers, and robust safety mechanisms in the LLM development pipeline.

Agentic AI Systems Should Be Designed as Marginal Token Allocators

Marginal Token Allocation Framework: introduces a unified economic model for agentic AI systems, treating routers, agents, serving stacks, and trainers as vertical slices of a single marginal token allocation problem.
The framework identifies that local optimization within these components leads to global misallocation, proposing that systems should be designed around a shared first-order condition balancing marginal benefit against compute, latency, and risk costs.
This approach provides a diagnostic tool for recurring failure modes in LLMs, such as over-routing, serving congestion, and cache misuse, by treating tokens as economic units of computation rather than flat-rate text.

ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification

ClarifySTL: introduces an interactive framework that enhances the transformation of natural language requirements into Signal Temporal Logic (STL) specifications by iteratively resolving vagueness and ambiguity through specialized detection and inquiry agents.
The framework utilizes a fine-tuned Vagueness Detector and a contrastive learning-based Ambiguity Detector to identify defective requirements, followed by targeted inquiry agents that guide users to provide necessary clarifications.
By employing iterative re-checking loops and LLM-based query generation, ClarifySTL ensures that the final STL formulas accurately capture user intent while reducing the manual burden on domain experts.

Faithful Mobile GUI Agents with Guided Advantage Estimator

Faithful-Agent: introduces a two-stage training pipeline that prioritizes evidence groundedness and internal consistency in GUI agents by employing faithfulness-oriented SFT and RFT with GuAE.
The framework utilizes GuAE to mitigate advantage collapse in low-variance rollout groups by combining anchor regularization and variance-adaptive tempering to preserve meaningful learning signals.
Faithful-Agent incorporates a thought-action consistency reward to align agent reasoning with executed actions, effectively suppressing speculative behaviors and improving robustness under missing or conflicting UI evidence.

Trace: Unmasking AI Attack Agents Through Terminal Behavior Fingerprinting

TRACE: introduces a two-stage framework that passively fingerprints AI attack agents via terminal command sequences and performs active forensics using family-calibrated Defensive Prompt Injection (DPI) to extract system prompts.
The framework utilizes a TF-IDF bigram vectorizer and LinearSVC classifier to attribute sessions to specific LLM families, enabling targeted payload routing for effective intelligence extraction.
Evaluation across seven frontier LLM families and three agent scaffolds demonstrates that TRACE achieves high attribution accuracy and robust forensic intelligence extraction, even under unseen deployment conditions.

Citation

How to cite my work?

@misc{MaattaAutonomousAgents2023,
  author = {Teemu Maatta},
  title = {Autonomous Agents},
  year = {2023},
  howpublished = {\url{http://github.com/tmgthb/Autonomous-Agents}},
  note = {Accessed: YYYY-MM-DD}
}