AI Patients: The new standard for clinical AI evaluation
Test your models in real-world conversations that are unpredictable, layered, and human — at any scale.
Current tools fail at scale
We still don't fully understand how AI models work. Progress moves faster than the tools we use to measure it.
We used to rely on benchmarks.
They were too static, outdated too fast, and couldn't handle multi-turn conversations.
So we turned to human evaluations.
At first, we had high-school-level chemistry teachers grading models. Nine months later, we were working with post-doc chemistry experts.
We asked them to evaluate a model across 8 parameters. Why 8? Because we didn't know any better, we rotated through whatever metrics we thought mattered.
So…
- How do we measure accuracy?
- How do we do it in conversations?
- How do we evaluate for cases we haven't even seen yet?
- And how do we make it scale?
Meet AI Patients
Not a new idea.
But until now, they've never really worked.
The old way? Someone had to write every patient script by hand — making them complex enough to be realistic, matching condition distributions, layering in medical history and EHR data.
Do it right? Slow and expensive.
Do it fast? Shallow, scripted, fake.
We found a better way.
Realistic patients.
Real human nuance.
Our AI Patients don’t just answer your model’s questions — they come with memories, emotions, and the quirks that make humans unpredictable.
They cover every case
Give us a condition, and we’ll map probabilities of comorbidities. You get AI Patients that reflect the real distribution of people with that condition, complications and all.
They behave like humans
They forget things. Misquote details. Get embarrassed. Get frustrated. Sometimes, they’re uncooperative. Just like real patients.
Now you can test your model
in the wild
Spin up conversations that feel real:
1,000. 10,000. 100,000. Even a million.
All multi-turn, all messy, all with patients who behave like people.