Making Progress: Metrics and Methods for Validating AI Abstraction

September 16, 2025

The Pain of Manual Validation
Anyone who has tried to validate a chart abstraction pipeline manually knows how frustrating it can be. You send an email with a sample case to a colleague, wait for their feedback, tweak the settings, run it again, and then repeat. It’s slow, inconsistent, and exhausting. Worse, it’s hard to know if you’re actually making progress.
This is the core challenge with AI chart abstraction: AI can abstract data at scale, but without a way to measure and validate its accuracy, you’re left with a black box. Clinicians and health system leaders consistently cite this as one of the biggest hurdles. As one recent study put it, “validation and evaluation of AI systems were among the most important and difficult aspects of implementation” .
What Manual Abstraction Teaches Us
If we look at human abstraction, the need for systematic validation becomes clear. In one large-scale study, researchers designed their entire chart abstraction protocol around reliability, eventually achieving 94.3% interrater agreement . That’s an impressive number, but it required careful protocol design, training, and adjudication to get there. And even with those investments, roughly 6% of values needed to be adjudicated or corrected.
In many real-world settings, the story is even tougher. A recent study reported that each manual chart review took about 30 minutes, and even then, 36% of the reviews required correction afterward. That’s a massive cost in both time and money.
These numbers make two things clear:
- Humans aren’t perfectly consistent, even with careful design.
- Manual abstraction is too slow and expensive to serve as the only solution.
The lesson: if we expect AI to take on this work, we need to hold it to rigorous validation standards, which means establishing those standards, measuring the results, and creating tools that make that validation faster and less painful.
Brim’s Iterative & Integrated Approach to Validation
At Brim, we’ve built a dedicated pre-production validation workflow so you can understand how your AI abstraction is performing as you develop it. The process is grounded in three principles:
1. Clarify the question. Create variables with clear semantics and definitions, so they're easier for both humans and machines to interpret. Brim's AI tools and guidelines can highlight opportunities to improve variables for stronger results.
2. Invest in ground truth. Use high-quality datasets to develop and test your system. These don't need to be large (especially in the development phase), but they should be carefully reviewed by curators for the best results.
3. Measure what matters. Focus validation on the variables that drive decisions and outcomes.
Following these principles helps improve the accuracy of a system, and reduce the effort required to validate the accuracy.
To validate an abstraction pipeline in Brim, you simply:
- Obtain or manually create an iteration dataset of a few dozen patients with diverse characteristics on the variables you're targeting. You can upload an existing dataset to Brim, or you can invite colleagues to label directly in Brim’s label review function, which includes detailed logging and collaboration tools. No more passing Word docs and emails around.
- Create the variables you want to compare in Brim, ensuring that they align with the variables in your validation dataset.
- Upload your validation dataset and instantly compare it to Brim's automated output on every generation. Brim gives an overall agreement score, as well as a breakdown by variable and information about individual values.
- Investigate gaps in the semantics or definition, and iterate on your variable definitions using Brim's built-in tools and feedback from validation. Continue to generate and compare, tracking how agreement improves.
Brim brings the process of validation directly into abstraction design, making it easier to design robust systems using your actual data. Iteration is built into the workflow: overwrite labels, regenerate, compare. That way, each change you make is measurable, and you know if you’re moving in the right direction.
How to Know You’re Improving (and Ready for Production)
Progress isn’t about gut feel; it’s about metrics. With Brim, you can track:
- Overall agreement between AI and your validation dataset for an overall measure of success.
- Per-variable performance, so you see where problems remain.
To prepare for production, we recommend creating or sourcing a separate pre-production validation dataset that has different patients and is a bit larger. Then run those patients in Brim and compare to your pre-production validation dataset.
When is “good enough” good enough? We suggest aiming for AI agreement that matches or exceeds human interrater reliability: roughly the ~95% benchmark demonstrated in manual studies. Depending on the application and the subjectiveness of the variable, you may get valuable results at 90% agreement, and for critical variables, you can set higher thresholds. If performance for the pre-production validation dataset is short of expectations, continue to iterate with your iteration dataset, and consider adding some additional cases, until you get to the preferred value.
Before deploying, you want stable agreement for your pre-production validation dataset close to or above your benchmarks. Once you’re live, continue periodic re-validation to ensure stability over time (we'll share more about this in future posts).
Conclusion
AI can transform chart abstraction, but only if you can trust it. Manual abstraction studies show both the potential for high reliability and the reality of how expensive and error-prone the process can be. Without validation, you risk replicating those same inefficiencies at scale.
Brim’s Validation Tool makes the process structured, measurable, and fast. Instead of emailing colleagues and guessing whether your AI is getting better, you can know exactly where it stands, improve it systematically, and launch with confidence. Book a demo to see it in action.