Evaluating and Governing Agentic AI: Benchmarks, Metrics and Regulatory Alignment

Azib Farooq^∗1, Shaina Raza^∗2, Nazmul Karim³, Hasan Iqbal⁴, Athanasios V. Vasilakos^†5, Christos Emmanouilidis^†6,

^∗Equal Contributors ^†Senior Authors

Various Institutions

Abstract

Agentic AI represents a new generation of Artificial Intelligence (AI) systems capable of perceiving, reasoning, planning, and acting toward achieving goals with a degree of autonomy. Unlike traditional AI models that merely generate outputs, these systems maintain memory, interact with their environment, and adapt over time. However, evaluating such interactive and evolving behavior remains a significant challenge. While several recent surveys have examined agentic AI architectures, components, and applications, few have systematically reviewed their evaluation, particularly regarding performance, reliability, and governance across an evolving agentic AI ecosystem. This paper addresses that gap by reviewing recent progress in the development and assessment of agentic AI, focusing on three core dimensions: benchmarks, metrics, and governance. We analyze how current evaluation frameworks capture reasoning, planning, collaboration, and ethical alignment across single- and multi-agent systems. Ultimately, this study aims to establish a unified foundation for building trustworthy, auditable, and human-aligned AI agents.

Key Contributions

Structured Benchmark Review: The first comprehensive survey categorizing existing benchmarks and evaluation metrics for agentic AI systems by lifecycle stage and highlighting commonly used metrics.
Thematic Analysis: Identification of emerging trends and open challenges in the evaluation of agentic AI systems through a detailed thematic synthesis.
Governance Examination: Analysis of governance aspects and regulatory alignment in current evaluation frameworks for agentic AI.

Symbolic Comparison of Prior Survey Papers and Our Work on Agentic AI

Survey	Focus	Timeline	Benchmarks	Metrics	Governance
Yehudai et al. (2025)	Evaluation of LLM-based agents	Post	✓	✓	◐
Plaat et al. (2025)	Agentic LLMs: reason–act–interact	Post	◐	◐	◐
Acharya et al. (2025)	Agentic AI foundations & applications	Pre+Post	◐	◐	✓
Bandi et al. (2025)	Taxonomy: AI agents vs. agentic AI	Pre+Post	✗	✗	◐
Hughes et al. (2025)	Multi-expert industry analysis	Pre+Post	✗	✗	✓
Piccialli et al. (2025)	Distributed AgentAI for Industry 4.0	Pre+Post	◐	◐	✓
Mohammadi et al. (2025)	Evaluation/benchmarking of LLM agents	Post	✓	✓	✗
Nisa et al. (2025)	Agentic AI overview (org transformation)	Pre+Post	◐	◐	✗
Yu et al. (2025)	Trustworthy & Security evaluation of agents	Post	✗	◐	✗
Ours	Benchmarks, metrics, and governance	Post	✓	✓	✓

Legend: ✓ = Present / Strong, ◐ = Partial / Limited, ✗ = Absent, Pre = Pre-LLM era, Post = Post-LLM era.

BibTeX

@article{farooq2025agentic, title={Evaluating and Regulating Agentic AI: A Study of Benchmarks, Metrics, and Regulation}, author={Farooq, Azib; Raza, Shaina; Karim, Nazmul; Iqbal, Hassan; Vasilakos, Athanasios; Emmanouilidis, Christos}, journal={techrxiv, techrxiv.176186841.18883348/v1}, year={2025} }

Evaluating and Governing Agentic AI: Benchmarks, Metrics and Regulatory Alignment

A comprehensive survey exploring agentic AI Benchmarks, Evaluation Metrics & Regulatory Alignment.

Agentic Pipeline Overview

Regulatory Alignment

Agentic Benchmarks

Timeline of the Agentic AI papers

Abstract

Taxonomy of Agentic AI metrics and Benchmarks

Key Contributions

Symbolic Comparison of Prior Survey Papers and Our Work on Agentic AI

BibTeX