AI Coaching Quality Evaluation Case Study | Mary Aviles, UX Research

Uncovering Learner Behavior Patterns in AI Coaching Interactions

Situation

An education technology company's platform served 190,000 U.S. college students learning constructive dialogue skills. I synthesized years of open-ended student feedback data and confirmed a persistent pattern: students consistently asked for more interactivity and more chances to practice. There was a gap between learning dialogue skills and having a safe place to actually try them. With polarization rising and demand for these skills growing, the stakes for closing that gap were higher than ever.

A competitive landscape analysis confirmed what the feedback data showed: low-stakes practice opportunities for constructive dialogue were scarce. Secondary research on AI attitudes in higher education surfaced a critical tension: the same student population that wanted more practice also harbored significant mistrust of AI, driven by concerns about data privacy, bias, and surveillance. These findings defined the opportunity space and the trust guardrails that shaped every subsequent decision about the AI coach.

But AI models can produce confident, plausible responses that miss the mark entirely. Before this coach reached students, the team needed to know whether it was performing as expected, where it was failing, and what those failures revealed about how students actually engage with dialogue practice.

Complication

The organization needed a system to accurately classify student responses, deliver feedback that helped struggling students improve, detect rude or off-topic input, and calibrate difficulty so most students could realistically succeed; all while navigating the trust concerns the research had surfaced. This required an evaluation and classification approach designed specifically for AI coaching, one that could surface behavioral insights and keep pace with the product as it scaled.

What I Did

Synthesizing Unstructured Data to Define the Opportunity

I analyzed open-ended student feedback collected over multiple years, conducted a competitive landscape analysis of existing practice tools, and led secondary research on AI trust in higher education. The synthesis revealed that students wanted a fundamentally different mode of engagement. It also revealed that any AI-powered solution would need to actively earn trust with a skeptical audience. These insights directly shaped the product requirements, defining what the coach needed to do and the guardrails it needed to have from day one.

Audience Segmentation & Behavioral Analysis

Through moderated research with college students, I identified six distinct audience segments based on how learners approached dialogue practice. These segments were shaped by prior AI experience, comfort with vulnerability, and willingness to engage with feedback. The segmentation informed how the platform framed tasks and delivered coaching responses to different types of learners.

A pilot study confirmed that 80% of users rated the coach 4/5 or 5/5 on helpfulness. But the data also showed that students who struggled across multiple attempts had a meaningfully different experience. I analyzed AI feedback instances and built a five-category error taxonomy mapping exactly how learners get stuck when practicing dialogue skills. The most significant finding: 54% of user struggles traced to instruction clarity. This single insight redirected engineering investment toward the highest-impact area for improvement. The taxonomy was designed to be directly usable in system prompts, giving the engineering team a behavioral classification framework the AI could apply in real time.

AI-Powered Classification & Evaluation Methodology

I designed the golden dataset methodology; the labeled "ground truth" the system's classifications are measured against. This involved collecting real student responses across the full range of quality, classifying each against a coaching rubric, then using those labeled examples to generate 200+ synthetic variations per activity for systematic testing. Every response was graded blind by me and the education lead to calibrate agreement on classification standards.

For each of six activities, I ran 20 to 30 manual evaluation cycles against the coaching rubric and product requirements, logging every issue by theme. I assessed whether skill detection hit 95%+ classification accuracy, whether feedback matched classifications 100% of the time, and whether difficulty calibration matched realistic student performance patterns. This methodology now serves as the repeatable standard for all future AI activity development.

Trust & Safety Testing

Before the product reached students, I ran structured stress testing with internal staff. Testers submitted hostile, off-topic, and edge-case inputs to evaluate system behavior under pressure. I categorized findings for product and engineering teams, distinguishing between bugs, design decisions, and areas where the AI's behavioral principles needed refinement.

Impact

Audience segmentation adopted as strategic framework - Six behavioral segments now inform how all future activities frame tasks and deliver feedback
Diagnostic insight redirected engineering investment - 54% of user struggles traced to instruction clarity, shifting resources toward the highest-impact improvements
Classification taxonomy embedded in production - Error taxonomy designed for direct integration into system prompts, giving engineering a behavioral framework the AI applies in real time
Analysis acceleration - AI-assisted synthetic data generation and evaluation workflows reduced insight delivery cycles by 50%+
6 AI coaching activities shipped - All met 95%+ classification accuracy, 100% feedback alignment, and 90%+ human-AI calibration
80% user satisfaction at pilot - Users rated the AI coach 4/5 or 5/5 on helpfulness
Scalable, repeatable methodology - Golden dataset and evaluation workflow now serve as the standard for all future AI activity development

Methods & Analysis Approaches