Evaluating and Governing Agentic AI: Benchmarks, Metrics and Regulatory Alignment

Azib Farooq∗1, Shaina Raza∗2, Nazmul Karim3, Hasan Iqbal4, Athanasios V. Vasilakos†5, Christos Emmanouilidis†6,
Equal Contributors Senior Authors
Various Institutions

A comprehensive survey exploring agentic AI Benchmarks, Evaluation Metrics & Regulatory Alignment.

Agentic pipeline

Agentic Pipeline Overview

Agentic pipeline

Regulatory Alignment

Agentic pipeline

Agentic Benchmarks

Timeline of the Agentic AI papers

Description of your image

Abstract

Agentic AI represents a new generation of Artificial Intelligence (AI) systems capable of perceiving, reasoning, planning, and acting toward achieving goals with a degree of autonomy. Unlike traditional AI models that merely generate outputs, these systems maintain memory, interact with their environment, and adapt over time. However, evaluating such interactive and evolving behavior remains a significant challenge. While several recent surveys have examined agentic AI architectures, components, and applications, few have systematically reviewed their evaluation, particularly regarding performance, reliability, and governance across an evolving agentic AI ecosystem. This paper addresses that gap by reviewing recent progress in the development and assessment of agentic AI, focusing on three core dimensions: benchmarks, metrics, and governance. We analyze how current evaluation frameworks capture reasoning, planning, collaboration, and ethical alignment across single- and multi-agent systems. Ultimately, this study aims to establish a unified foundation for building trustworthy, auditable, and human-aligned AI agents.

Taxonomy of Agentic AI metrics and Benchmarks

Description of your image

Key Contributions

  • Structured Benchmark Review: The first comprehensive survey categorizing existing benchmarks and evaluation metrics for agentic AI systems by lifecycle stage and highlighting commonly used metrics.
  • Thematic Analysis: Identification of emerging trends and open challenges in the evaluation of agentic AI systems through a detailed thematic synthesis.
  • Governance Examination: Analysis of governance aspects and regulatory alignment in current evaluation frameworks for agentic AI.

Symbolic Comparison of Prior Survey Papers and Our Work on Agentic AI

Survey Focus Timeline Benchmarks Metrics Governance
Yehudai et al. (2025) Evaluation of LLM-based agents Post
Plaat et al. (2025) Agentic LLMs: reason–act–interact Post
Acharya et al. (2025) Agentic AI foundations & applications Pre+Post
Bandi et al. (2025) Taxonomy: AI agents vs. agentic AI Pre+Post
Hughes et al. (2025) Multi-expert industry analysis Pre+Post
Piccialli et al. (2025) Distributed AgentAI for Industry 4.0 Pre+Post
Mohammadi et al. (2025) Evaluation/benchmarking of LLM agents Post
Nisa et al. (2025) Agentic AI overview (org transformation) Pre+Post
Yu et al. (2025) Trustworthy & Security evaluation of agents Post
Ours Benchmarks, metrics, and governance Post

Legend: = Present / Strong,  = Partial / Limited,  = Absent,  Pre = Pre-LLM era,  Post = Post-LLM era.

BibTeX

@article{farooq2025agentic,
  title={Evaluating and Regulating Agentic AI:  A Study of Benchmarks, Metrics, and Regulation},
  author={Farooq, Azib; Raza, Shaina; Karim, Nazmul; Iqbal, Hassan; Vasilakos, Athanasios; Emmanouilidis, Christos},
  journal={techrxiv, techrxiv.176186841.18883348/v1},
  year={2025}
}