Copilot Studio Tools Upgraded to Bring AI Tests Into Alignment With Human Evaluations

Microsoft this week filled in details on enhancements to the PowerCAT Copilot Studio Kit, which augments Copilot Studio to help develop, govern, and test custom AI agents.

Tops among the new features is Rubrics Refinement, which is used to create, test, and improve evaluation standards, or rubrics, that measure the quality of AI agents and the responses they generate.

Rubrics Refinement helps ensure that AI grading of an agent’s responses aligns with human judgment and therefore measures up to organizational quality standards. It helps address the growing need among enterprises for AI agents built with robust quality, testing, and validation, in line with traditional enterprise software standards.

Other new features in PowerCAT Copilot Studio Kit include a series of enhancements that similarly focus on measuring the quality of agents and their outputs, as well as broader agent governance.

Rubrics: How They’re Used, How They’re Advancing

In an AI agent context, a rubric is a set of natural-language grading instructions used by an AI judge to evaluate agent response quality through a description of quality responses and grades.

An AI judge — in the form of an LLM — produces not only a grade but also a rationale explaining its assessment. A human also grades the responses, and the customers know the rubric is working as intended when the AI grade and human grade are in alignment. If they’re not, the rubric needs improvement. AI judges’ grading quality depends on the rubric’s quality.

In the absence of a systematic evaluation criteria, organizations are challenged to define standards, compare grades, and identify where rubric instructions may need improvement.

That’s where Rubrics Refinement comes in, with the goal of maximizing alignment between AI and human evaluations.

It does so through:

Reusable evaluation standards, which define rubrics once and reuse them across agents and tests
Alignment with human judgment in order to systematically minimize misalignment between AI and human graders
Quality assurance that establishes durable assets in the form of organizational quality standards
Confidence in AI evaluation, building trust through transparent, iterative refinement

Rubrics Refinement involves these steps:

Defining a rubric and evaluation criteria
Running tests using the rubric to generate AI grades
Reviewing agent responses and providing human grades
Comparing AI and human assessments to identify misalignment

Following these steps, users refine and repeat as needed.

Response optimization — the actual steps to improve the quality of an agent’s answers — takes place in Copilot Studio itself. Rubrics Refinement focuses purely on ensuring the organization’s evaluation criteria accurately reflect human judgment so that automated grading results can be trusted.

Microsoft noted that rubrics are used for two distinct “levels”: the test case level, for test automation with custom grading, and the test run level, for iteratively refining and improving the rubric. Rubrics Refinement supports both levels.

The company said Rubrics Refinement is designed for use by quality assurance teams, agent builders, enterprises, and anyone seeking trustworthy AI evaluations.

Additional Agent Quality Features

Several additional features in the PowerCAT Copilot Studio Kit address agent quality, response quality, and governance. They include:

Compliance Hub, which is used to define and enforce governance policies for Copilot Studio agents while it continuously evaluates agent configurations against risk thresholds and automatically creates compliance cases when violations are detected

Conversation KPIs, which are designed to track and analyze the performance of custom agents, simplifying understanding conversation outcomes by providing aggregated data in Dataverse, rather than requiring analysis of complex conversation transcripts.

Agent Inventory, which provides a tenant-wide view into all Copilot Studio custom agents, including features those agents are using, knowledge sources, and more. Agent Inventory ships with a dashboard and the data it captures can be exported for use in other applications.

Conversation Analyzer, which allows users to review the conversations of their custom agents using custom prompts to get additional insights.

Future releases of the Copilot Studio Kit, the company said, will include enhanced diagnostics and analytics, and governance features for approvals, lifecycle management, and publishing.

More on AI Agent Testing and Governance:

AI Agent & Copilot Summit is an AI-first event to define opportunities, impact, and outcomes with Microsoft Copilot and agents. Building on its 2025 success, the 2026 event takes place March 17-19 in San Diego. Get more details.

Copilot Studio Tools Upgraded to Bring AI Tests Into Alignment With Human Evaluations

Tom Smith

Areas of Expertise

Event Moment: James Oleinik on Building an Agentic System of Work

How to Pick the Best Pattern for Scalability When Building Agents in Copilot Studio

Event Moment: From Product Excellence to Process Excellence in the Agentic Enterprise

AI Agent and Copilot Podcast: OpenClaw-Powered Healthcare Assistant Builds Patient Agency

Copilot Studio Tools Upgraded to Bring AI Tests Into Alignment With Human Evaluations

Rubrics: How They’re Used, How They’re Advancing

Additional Agent Quality Features

Tom Smith

Areas of Expertise

Related Posts

Event Moment: James Oleinik on Building an Agentic System of Work

How to Pick the Best Pattern for Scalability When Building Agents in Copilot Studio

Event Moment: From Product Excellence to Process Excellence in the Agentic Enterprise

AI Agent and Copilot Podcast: OpenClaw-Powered Healthcare Assistant Builds Patient Agency