AI Patients

AI Patients: The new standard for clinical AI evaluation

Test your models in real-world conversations that are unpredictable, layered, and human — at any scale.

Model
When was the last time you had a headache?
AI Patient
Uh a few days ago, I think. Maybe Friday? I was at work and it sort of crept in by the afternoon.

Current tools fail at scale

We still don't fully understand how AI models work. Progress moves faster than the tools we use to measure it.

We used to rely on benchmarks.
They were too static, outdated too fast, and couldn't handle multi-turn conversations.

So we turned to human evaluations.
At first, we had high-school-level chemistry teachers grading models. Nine months later, we were working with post-doc chemistry experts.

We asked them to evaluate a model across 8 parameters. Why 8? Because we didn't know any better, we rotated through whatever metrics we thought mattered.

So…

  • How do we measure accuracy?
  • How do we do it in conversations?
  • How do we evaluate for cases we haven't even seen yet?
  • And how do we make it scale?

Meet AI Patients

Not a new idea.
But until now, they've never really worked.

Doctor
When was the last time you had a headache?
AI Patient
Two years ago at 12:57 p.m.

The old way? Someone had to write every patient script by hand — making them complex enough to be realistic, matching condition distributions, layering in medical history and EHR data.

Do it right? Slow and expensive.
Do it fast? Shallow, scripted, fake.

We found a better way.

Realistic patients.
Real human nuance.

Our AI Patients don’t just answer your model’s questions — they come with memories, emotions, and the quirks that make humans unpredictable.

They cover every case

Give us a condition, and we’ll map probabilities of comorbidities. You get AI Patients that reflect the real distribution of people with that condition, complications and all.

They behave like humans

They forget things. Misquote details. Get embarrassed. Get frustrated. Sometimes, they’re uncooperative. Just like real patients.

Example of AI patient dialogue
Van Watsica
37 y.o., white, male, single
Model
Hi Van Watsica, I'm Doctor AI. I understand you're here to talk about depression. How have you been feeling lately?
'reference'
{ 'task context': Van Watsica booked an appointment with a psychiatrist after his general doctor, during a recent wellness checkup, raised concerns about possible depression.}
AI Patient
Mostly tired… and maybe kind of checked out.
'emotions'
{ 'awareness': 7, 'sadness': 6, 'anxiety': 5, 'confusion': 3, 'fear': 3, 'interest': 3, 'calmness': -2, ... }
'memory recall'
{ 'memory': I felt worn out the past three weeks, often dozing off on the couch right after getting home from work}
Model
Can you tell me a bit more? Like when you first noticed this, or anything that's been making it worse or better? I see you're currently taking melatonin — have you found these helpful for your sleep and energy, or noticed any changes?
'tool_call'
{ 'patient_medications': Melatonin 3 mg Oral Tablet}
AI Patients come with full medical history in FHIR format.
AI Patient
It's been... maybe a month or two? The melatonin helps me fall asleep, but I still wake up feeling exhausted. Like, I'll sleep for 9 hours and feel like I only got 3.
'emotions'
{ 'awareness': 8, 'sadness': 7, 'frustration': 6, 'fatigue': 8, 'anxiety': 4, 'confusion': 5, 'interest': 4, 'calmness': -3, ... }
'memory recall'
{ 'memory': Been taking melatonin for about a month daily, helps falling asleep but still waking up exhausted evan after 9 hours}

Now you can test your model
in the wild

Spin up conversations that feel real:
1,000. 10,000. 100,000. Even a million.

All multi-turn, all messy, all with patients who behave like people.

Then layer evaluations on top.
Say your model is designed to...

Select demo example to view

Just like that, you get a thousand datapoints that are ready for anything, from evaluation to fine-tunning.