
Microsoft this week filled in details on enhancements to the PowerCAT Copilot Studio Kit, which augments Copilot Studio to help develop, govern, and test custom AI agents.
Tops among the new features is Rubrics Refinement, which is used to create, test, and improve evaluation standards, or rubrics, that measure the quality of AI agents and the responses they generate.
Rubrics Refinement helps ensure that AI grading of an agent’s responses aligns with human judgment and therefore measures up to organizational quality standards. It helps address the growing need among enterprises for AI agents built with robust quality, testing, and validation, in line with traditional enterprise software standards.
Other new features in PowerCAT Copilot Studio Kit include a series of enhancements that similarly focus on measuring the quality of agents and their outputs, as well as broader agent governance.
Rubrics: How They’re Used, How They’re Advancing
In an AI agent context, a rubric is a set of natural-language grading instructions used by an AI judge to evaluate agent response quality through a description of quality responses and grades.
An AI judge — in the form of an LLM — produces not only a grade but also a rationale explaining its assessment. A human also grades the responses, and the customers know the rubric is working as intended when the AI grade and human grade are in alignment. If they’re not, the rubric needs improvement. AI judges’ grading quality depends on the rubric’s quality.
In the absence of a systematic evaluation criteria, organizations are challenged to define standards, compare grades, and identify where rubric instructions may need improvement.
That’s where Rubrics Refinement comes in, with the goal of maximizing alignment between AI and human evaluations.
It does so through:
- Reusable evaluation standards, which define rubrics once and reuse them across agents and tests
- Alignment with human judgment in order to systematically minimize misalignment between AI and human graders
- Quality assurance that establishes durable assets in the form of organizational quality standards
- Confidence in AI evaluation, building trust through transparent, iterative refinement
Rubrics Refinement involves these steps:
- Defining a rubric and evaluation criteria
- Running tests using the rubric to generate AI grades
- Reviewing agent responses and providing human grades
- Comparing AI and human assessments to identify misalignment
Following these steps, users refine and repeat as needed.
Response optimization — the actual steps to improve the quality of an agent’s answers — takes place in Copilot Studio itself. Rubrics Refinement focuses purely on ensuring the organization’s evaluation criteria accurately reflect human judgment so that automated grading results can be trusted.
Microsoft noted that rubrics are used for two distinct “levels”: the test case level, for test automation with custom grading, and the test run level, for iteratively refining and improving the rubric. Rubrics Refinement supports both levels.
The company said Rubrics Refinement is designed for use by quality assurance teams, agent builders, enterprises, and anyone seeking trustworthy AI evaluations.
Additional Agent Quality Features
Several additional features in the PowerCAT Copilot Studio Kit address agent quality, response quality, and governance. They include:
Compliance Hub, which is used to define and enforce governance policies for Copilot Studio agents while it continuously evaluates agent configurations against risk thresholds and automatically creates compliance cases when violations are detected
Conversation KPIs, which are designed to track and analyze the performance of custom agents, simplifying understanding conversation outcomes by providing aggregated data in Dataverse, rather than requiring analysis of complex conversation transcripts.
Agent Inventory, which provides a tenant-wide view into all Copilot Studio custom agents, including features those agents are using, knowledge sources, and more. Agent Inventory ships with a dashboard and the data it captures can be exported for use in other applications.
Conversation Analyzer, which allows users to review the conversations of their custom agents using custom prompts to get additional insights.
Future releases of the Copilot Studio Kit, the company said, will include enhanced diagnostics and analytics, and governance features for approvals, lifecycle management, and publishing.
More on AI Agent Testing and Governance:
- Microsoft Advances Copilot With Automated Agent Test Tools, Built-In Collaboration in Teams App
- Microsoft Fills in Agent 365 Management and Governance Details

AI Agent & Copilot Summit is an AI-first event to define opportunities, impact, and outcomes with Microsoft Copilot and agents. Building on its 2025 success, the 2026 event takes place March 17-19 in San Diego. Get more details.




