Designing a Trustworthy AI Dialogue Coach at Scale
Situation
An education technology company had been hearing the same request from students for years: more interactivity, more chances to practice. Open-ended feedback data consistently pointed to a gap between learning dialogue skills and having a safe place to actually try them. With polarization rising and constructive dialogue skills in growing demand, the stakes for closing that gap were higher than ever.
A landscape analysis confirmed what students were saying: there were limited low-stakes practice opportunities for these skills. Secondary research surfaced a critical tension: the same student population that wanted more practice also harbored significant mistrust of AI, driven by concerns about data privacy, bias, and surveillance. These insights shaped the initial product requirements for an AI coach—one that could offer real-time, repeatable practice inside the company's flagship platform serving 190,000 U.S. college students.
But AI models can produce confident, plausible responses that miss the mark entirely. Before this coach reached students, the team needed to know whether it was performing as expected and where it needed to improve.
Complication
The organization needed a system to accurately classify student responses, deliver feedback that helped struggling students improve, detect rude or off-topic input, and calibrate difficulty so most students could realistically succeed, all while navigating the trust concerns the research had surfaced. This required a quality evaluation approach designed specifically for AI coaching, one that could keep pace with the product as it scaled.
What I Did
Discovery & Product Strategy
I synthesized years of open-ended student feedback data alongside a competitive landscape analysis of existing practice tools, confirming that low-stakes practice opportunities for constructive dialogue were scarce. I also led secondary research on AI attitudes in higher education, which surfaced significant AI mistrust among the target user population. These findings directly informed the initial product requirements document for the AI coach, defining what it needed to do and the trust guardrails it needed to have from day one.
Trust & Safety Testing (Pre-Launch)
Before the product reached students, I ran structured stress testing with internal staff. Testers submitted hostile, off-topic, and edge-case inputs to see how the system responded under pressure. I categorized findings for the product and engineering teams, distinguishing between bugs, design decisions, and areas where the AI's behavior needed clearer principles (e.g., how should it respond to a frustrated student versus an abusive one?).
Building the Quality Evaluation System
I designed the golden dataset methodology, which is the "ground truth" the system is measured against. This involved collecting real student responses across the full range of quality, labeling each against a coaching rubric, then using those labeled examples to generate 200+ synthetic variations per activity for systematic testing. Every response was graded blind by me and the education lead to ensure we agreed on what "good" looked like.
For each of seven shipped activities, I ran 20 to 30 manual test runs against the coaching rubric and product requirements, logging every issue by theme. I evaluated whether skill detection hit 95%+ accuracy, whether feedback matched classifications 100% of the time, and whether difficulty calibration matched realistic student performance.
Learner Experience Research
I conducted moderated usability testing with college students. The question was not just "can they use it" but "are they actually learning?" I identified six distinct learner profiles, mapped a trust spectrum shaped by prior AI experience, and found that the biggest opportunity for improvement was in how tasks were framed and how feedback was delivered.
Diagnostic Analysis
A pilot study (n=123) confirmed that 80% of users rated the coach 4/5 or 5/5 on helpfulness. The data also showed that students who struggled across multiple attempts had a meaningfully different experience. I analyzed 286 AI feedback instances and built a five-category error taxonomy mapping exactly how learners get stuck when practicing dialogue skills. This taxonomy was designed to be directly usable in system prompts, giving the engineering team a classification framework the AI could use in real time.
Impact
Product strategy grounded in evidence - AI coach requirements shaped by multi-source discovery research reducing rework risk before development began; data sources included: open-ended user data, landscape analysis, and secondary research on AI trust
Shipped - 7 AI coaching activities deployed on time for pilot launch
Quality threshold - Activities met 95%+ classification accuracy, 100% feedback alignment, and 90%+ human-AI calibration
User satisfaction - 80% of pilot users rated the AI coach 4/5 or 5/5
Diagnostic depth - 54% of user struggles traced to instruction clarity, which redirected engineering investment toward the highest-impact improvements
Scalable process - Golden dataset and QA workflow now serve as the repeatable standard for all future AI activity development
AI integration - Using AI tools to generate synthetic test data and improve evaluation efficiency
Methods & Deliverables
Open-ended survey data synthesis | Competitive landscape analysis | Secondary research synthesis | Product requirements development | Adversarial user testing | Golden dataset development | Synthetic data generation | Blind grading calibration | Manual QA against coaching rubrics | Moderated usability testing | Learner profile development | Feedback pattern analysis | Error taxonomy design | AI-assisted evaluation