Synthetic users and digital twins: what they can't tell you about real customers

The research teams most often cited in the synthetic users pitch stepped back from their biggest claims. Here's what that means for anyone deciding whether to research with digital twins or with real people.

Young woman in front of her digital twin, reaching out to make contact

Key takeaways

  • A digital twin is an LLM-generated stand-in for a real person, fed enough background that it can, in theory, answer survey questions or sit in an interview as if it were them. The pitch is speed and cost, with none of the recruitment headache.
  • A Columbia-led study trained twins on each person's 500+ prior answers and still found an average human-to-twin correlation of r = 0.20. A perfect match is 1.0. At 0.20, the twin is barely beating a coin flip at predicting what its human would actually say or do.
  • The same study catalogued five systematic distortions: insufficient individuation, stereotyping, representation bias, ideological bias, and hyper-rationality.
  • Hyper-rationality is the one to worry about. Twins are coldly logical in exactly the moments where people aren't, which is where most of the interesting behaviour lives.
  • Synthetic personas still earn their keep for desk research, stress-testing a discussion guide, and generating edge cases for real fieldwork to cover. They are not the voice of the customer.

I posted a shorter version of this argument on LinkedIn last month. It became, by some distance, the most talked-about thing I've put out. That told me two things. This is a live question in the research world right now, and it deserves more than a few hundred words. So here's the longer version.

Let me get my bias out in the open first. I'm Chief Experience Officer at Indeemo, and part of my job is leading Indeemo Catalyst, our research services arm, where we get our hands dirty running all kinds of research for global brands and agencies. So I don't just sell research technology. I sell research. That's about as much skin in the game as a person can have in this debate. Synthetic users are an awkward subject for a business like mine, because they're both an opportunity and a threat. In theory, synthetic means less research. So you could read what follows and think, "of course the research guy says synthetic personas aren't the answer." That's fair, and I'd rather you held me to a higher standard than that. I've tried hard to be balanced, and there's a genuine case for synthetic personas that I'll make as well as I can before I pick it apart.

Here's where I'm coming from. I've spent more than 20 years trying to understand people and represent how they think and behave, through personas, segments, archetypes, and most of the other formats our industry has invented. Understanding people is the part of this work I've always found most interesting. And the one lesson that has stuck, the one this new research happens to validate, is that people are gloriously unpredictable. We say one thing and do another. The most useful thing I've ever done in strategy work is to get past the surface of a survey response, past the quant, past the behavioural data about what someone did, to the why underneath it. That's the whole reason I ended up at Indeemo.

So I came to synthetic users genuinely curious, not looking for a reason to dismiss them. What the research below did was confirm something I already believed from two decades of watching people: there is no shortcut to a real person.

What are synthetic users and digital twins?

A synthetic user, often called a digital twin, is an AI model built to imitate a specific person or a target audience. You feed a large language model enough background about someone, their demographics, past survey answers, maybe a transcript or two, and it produces answers as if it were them. The idea is that you can run a study in an afternoon, for a fraction of the cost, without recruiting or scheduling a single human.

There's a whole category of tools now built on this premise, and the promise from vendors is the same: research-shaped answers, without the people, at a fraction of the cost and with none of the recruitment headache. Some pitches quote accuracy figures. Synthetic users at 85% accuracy, that kind of thing. Hold onto that number, because it has a real origin, and it doesn't mean what it sounds like. I'll come back to it.

I understand the appeal. Recruitment is slow. Fieldwork is expensive. Timelines are always shorter than you want them to be. Leaders are under real pressure to move faster and spend less, and telling a team to "just use synthetic" sounds like a clean win. So when a vendor says you can skip all of that and still get "customer insight," it's easy to say yes.

The problem is what that insight is actually worth. And over the last year, the research behind the pitch has started to say something quite different from the sales deck.

Digital twin, in a sentence: an LLM-generated stand-in for a real person, fed enough context that it can answer survey questions or sit in a research interview as if it were them.

Why are synthetic users suddenly under scrutiny?

Because the two research teams most often cited in the pitch have both stepped back from its biggest claims. The Stanford paper that gave the field its scale story has been retitled, and the Columbia group that built one of the largest digital-twin datasets has published a detailed account of how twins distort real people.

The first is from a Stanford team led by Joon Sung Park. Their 2024 preprint arrived with a title that did a lot of the marketing on its own: Generative Agent Simulations of 1,000 People. A thousand people, simulated. It's a compelling number if you're trying to convince a board that synthetic research scales. Go to that same paper today and the title reads LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals. The number is gone, and so is the scale pitch, replaced by a careful, hedged description of what the method actually does. The work didn't get worse. The framing got more careful.

The second is the one that should give any buyer pause. A Columbia-led team (Peng, Toubia, Netzer, and around twenty co-authors) published Digital Twins as Funhouse Mirrors: Five Key Distortions. What makes this one hard to wave away is who it comes from. The same group built Twin-2K-500, one of the largest public datasets for building digital twins: 2,058 people, each answering 500+ questions across demographics, psychology, economics, and behavioural experiments. This is the team that gave the field some of its best training data, reporting what that data actually shows. I'll get to the number in a moment, because it deserves its own section.

The short version: the academic case for synthetic users is a lot weaker than the commercial case being built on top of it.

What does an r = 0.20 correlation actually mean?

It means the twin is barely better than a coin flip at predicting what its human would really say or do. In the Columbia work, researchers built digital twins trained on each person's 500+ prior responses, then tested them across 19 pre-registered studies and 164 different outcomes. The average correlation between twin and human was r = 0.20.

That number is worth sitting with. A perfect match between twin and human is 1.0. Total noise, no relationship at all, is 0. At 0.20, you're closer to noise than to a match. And this is the average across a large body of work, using twins that had been fed hundreds of real answers from the very person they were meant to imitate. This wasn't a thin persona knocked together from a demographic profile. It was about as well-informed as a twin gets, and it still landed at 0.20.

The one number to remember:
r = 0.20. A perfect twin-to-human match is 1.0. Pure noise is 0. Twins trained on 500+ real answers per person averaged 0.20 across 19 studies and 164 outcomes (Peng et al., 2025).

Here's why that matters beyond the methodology. Think about the decisions that sit downstream of research. Pricing. Positioning. Brand. The experience a customer actually has when they use your product. If you base any of that on personas with a 0.20 correlation to real humans, you haven't saved money. You've taken on a reputational risk that only looks like efficiency. The cost just shows up later, in a launch that lands flat or a message that doesn't connect, and by then it's much harder to trace back to the shortcut that caused it.

What are the five ways digital twins distort real people?

The Columbia study didn't just report a weak correlation. It explained the shape of the failure, cataloguing five systematic distortions in how twins represent people. Systematic is the key word. These aren't random errors that average out across a big sample. They bend the results in consistent directions, which is worse, because it makes the output look coherent while it's quietly wrong.

DistortionWhat it meansWhy it matters for research
Insufficient individuationTwins blur together and sound more alike than the real people they represent.You lose the outliers and edge cases, the exact people who often reveal the most.
StereotypingTwins fall back on broad-brush assumptions about a group instead of the specific person.You get a caricature of a segment, not a customer.
Representation biasSome groups are modelled more accurately than others.Findings skew toward whoever the underlying model knows best, and away from whoever it doesn't.
Ideological biasTwins lean in consistent directions on values-laden questions.Attitudes, values, and sensitive topics come back tilted.
Hyper-rationalityTwins behave like coldly logical decision-makers.Real people aren't logical in the moments that matter most, so the twin misses the behaviour you actually care about.

Any one of these would make me cautious. Together, they describe a mirror that reflects something recognisably human-shaped but bent out of true. A funhouse mirror, as the paper's title puts it. You can see yourself in it. That's exactly what makes it dangerous.

Why is hyper-rationality the finding that should worry you most?

Because it fails precisely where real people are most interesting, and where the money usually is. Twins are coldly rational in exactly the situations where humans aren't. And most of the behaviour worth researching lives in that gap.

Think about the last few genuinely human decisions you made.

You kept a gym membership you haven't used in four months, because cancelling it feels like admitting something about yourself. You put a product back on the shelf because the person behind you in the queue was watching. You said yes to a work drink you didn't want, because saying no felt awkward in the moment. You told a researcher a feature sounded "really useful," because you liked them and the room was warm and it seemed like the polite thing to say.

None of that is rational. All of it is real. It's the substance of how people actually spend, choose, switch, and stay. A digital twin does none of it. Ask a twin whether it would cancel the unused membership and it will give you the sensible answer, the one that optimises cost. That's the answer no real gym chain's revenue is built on.

This is the finding that should land hardest for anyone being pitched synthetic users at some suspiciously round accuracy figure. The pitch assumes people are prediction problems. They aren't. They're contradictory, socially anxious, loss-averse, and easily swayed by the temperature of a room. Strip that out and you haven't modelled a customer. You've modelled a spreadsheet that talks.

This is also the moment to come back to that "85% accuracy" number, because it has a real origin and it's worth knowing exactly what it measured. It comes from the Stanford paper. Their agents predicted participants' answers to held-out General Social Survey questions with accuracy equal to roughly 85% of participants' own consistency when re-asked the same questions two weeks later. Read that carefully. It's accuracy at replaying survey answers, graded on a curve set by how inconsistent people are with themselves. It is not accuracy at predicting what a customer will do in a situation nobody has asked them about, which is the thing you're actually buying research for. When someone quotes you an accuracy figure for synthetic users, the question to ask is always: accurate at what, measured against what?

Where do synthetic personas actually help?

They're genuinely useful for the work that happens before you talk to a real person, and for pressure-testing your own thinking. I want to be fair here, because the honest position isn't "synthetic bad, real good." It's about using each for what it's good at.

A few places I'm comfortable reaching for synthetic personas:

Desk research and orientation. When you're getting up to speed on an unfamiliar category or audience, a synthetic persona is a fast way to map the basics and work out the obvious questions before you invest in fieldwork.

Stress-testing a discussion guide. Run your draft guide past a synthetic persona and you'll quickly spot the leading questions, the jargon, and the places where the flow breaks down. It's a rehearsal, not a performance.

Generating edge cases. Synthetic personas are good at throwing up scenarios you might not have thought of, the awkward situations and unusual users you'll want to make sure your real fieldwork actually covers.

Notice what all three have in common. They're preparation. They make your real research sharper, cheaper, and better targeted. They speed up the work around the study. What they don't do is replace the study. The moment a synthetic answer stops being a prompt for better fieldwork and starts being treated as a finding, you've crossed the line that the r = 0.20 number is warning you about.

Doesn't grounding a synthetic persona in real research fix it?

No, and that's the part of the Columbia findings that surprised me most. The obvious objection to everything above is that a persona conjured from a demographic profile was always going to be thin, so why not do the real research first and build the twin from that? Say I want to understand millennial coffee drinkers. I go and talk to them, watch how they shop and brew and choose, then feed all of that into a synthetic persona I can keep putting new questions to. It feels safe, because now it's grounded in evidence.

That is close to the setup the Columbia team actually tested. Their twins weren't thin sketches. They were trained on each person's 500+ real prior responses, about as well-grounded as a twin can be. And they still correlated with the real humans at r = 0.20, with hyper-rationality among the five distortions. Feeding the model real data did not rescue it.

Here's the trap, in my coffee example. You can give the persona everything your real coffee drinkers told you, and it will handle questions the research already answered. But put a genuinely new scenario to it, a different pack size, a price rise, an unfamiliar store layout, and you have no good reason to trust what comes back. The sensible, cost-optimising answer it defaults to is just what the hyper-rationality finding would predict. The real research was the valuable part. The synthetic layer on top doesn't stretch it to new questions. It just smuggles the distortions back in the moment you step outside what the research covered.

So this isn't the strong case for synthetic. It's the tempting one, and it's the one the evidence warns against most directly. The uses that hold up are the humbler ones above.

What do you give up when you replace real people with twins?

You keep the speed and lose the thing you were paying for. Synthetic research is faster and cheaper on the surface, and that's real. What it quietly gives up is the unpredictability, the context, and the confidence to bet a decision on the result. Here's the trade laid out plainly.

Synthetic users / digital twinsReal in-the-moment research
Cost and speedVery low cost, near-instant outputHigher cost, though AI has cut turnaround from weeks to days
RecruitmentNone neededReal, though a global panel removes most of the friction
What you learnA plausible average, smoothed toward the expected answerWhat a specific person actually did, in their own context
Irrational behaviourMissed. Twins optimise where people improviseCaptured. The hesitation, the backtrack, the contradiction
Confidence for decisionsLow. Around r = 0.20 against real humansHigh. It's the behaviour itself, not a guess at it

The pattern in that table is the whole argument in miniature. Synthetic wins on the inputs, cost and time. Real wins on the output, which is the only thing a decision actually needs. If the question in front of you is low-stakes and you just need to get oriented, the trade can be worth it. If a pricing call, a positioning bet, or a product launch rests on the answer, it isn't close.

Why can't synthetic users be the voice of the customer?

Because the voice of the customer is a real person doing a real thing in a real context, and a twin is a plausible guess about that person built by a model. The gap between the two is the whole game.

Here's the line I keep coming back to. The gap between data richness and customer understanding doesn't close because the data got cheaper to generate. You can produce an enormous volume of synthetic responses for almost nothing. It will look like insight. It will fill a report. But volume was never the constraint. Understanding was. And understanding comes from watching a real person hesitate, backtrack, contradict themselves, and do the thing they told you they'd never do. It's the depth versus scale problem in a new outfit: more responses, not more understanding.

That's the case for getting closer to actual behaviour instead of modelling it from a distance. When you watch someone navigate their morning routine, or record a screen-recording of themselves abandoning a checkout, or talk through why they switched brands, you see the messy, in-the-moment reality that a twin smooths away. Mobile ethnography exists to capture exactly that, real people sharing videos, photos, screen recordings, and texts from their own lives, before the moment is forgotten and post-rationalised. Routine studies do something similar for behaviour as it happens, in real life, rather than reconstructed after the fact.

Neither of those is a twin's best guess. They're the thing itself.

After 20 years of this, that's the conviction I keep landing back on, and it's what the new research validates. There is no replacement for talking to real people and getting into the moments where these experiences actually live, in their kitchens, their cars, their commutes, the ordinary corners of their day where the real decisions get made.

Examples of uploads from the Indeemo dashboard.

How should teams combine synthetic and real research?

Use synthetic personas to prepare and real people to decide. That's the whole rule. Anything upstream of a decision, orientation, guide design, edge-case hunting, is fair game for synthetic work. Anything a decision actually rests on should come from real humans.

In practice that looks like a sequence, not a substitution. Start with desk research and a synthetic persona or two to get oriented and rough out your questions. Use them to stress-test your discussion guide until it's tight. Then go and do the real study, with real participants, and let the synthetic work you did earlier make that fieldwork faster and more focused. The synthetic stage earns its place by improving the human stage, not by standing in for it.

This is also where the speed argument gets turned on its head. The reason teams reach for synthetic is pressure to move faster and spend less. But real in-the-moment research has closed that gap more than most people realise. With Indeemo you can recruit from a global panel, run studies in 30+ languages, and use AI to transcribe, translate, and analyse in minutes rather than weeks, then turn the highlights into subtitled reels your stakeholders will actually watch. You get the speed the synthetic pitch promises, without trading away the real behaviour that makes research worth doing. The point of the AI here is to accelerate the work around real people, not to replace them.

And if you have the research ambition but not the capacity, that's what Indeemo Catalyst is for. Whether you want to run a study yourself or have us handle design, recruitment, moderation, and analysis end to end, the goal is the same: to help you get closer to your customers, understand them deeper, and move faster on the decisions that matter. Book a demo and we'll talk through what that could look like.

Frequently asked questions

Are synthetic users accurate?

Not as accurate as the pitch suggests. A large pre-registered study from a Columbia-led team found an average human-to-twin correlation of r = 0.20 across 19 studies and 164 outcomes, even when twins were trained on 500+ real answers per person. A perfect match is 1.0, so 0.20 is closer to noise than to accuracy. Treat any headline "accuracy" figure with real caution and ask what it was measured against.

What's the difference between a synthetic user and a digital twin?

The terms are mostly used interchangeably. Both describe an AI model, usually built on a large language model, designed to answer as if it were a specific person or a target audience. "Digital twin" tends to imply a model of one named individual, while "synthetic user" or "synthetic persona" is often used more loosely for a modelled segment. The limitations discussed here apply to both.

Can AI replace user research?

No, but it can make user research faster and better. AI is genuinely useful for desk research, for stress-testing a discussion guide, and for handling the heavy lifting of transcription, translation, and analysis once you have real data. What it can't do is stand in for the real person whose behaviour your decisions depend on. Use AI to accelerate the work around real research, not to skip the research itself.

When are synthetic personas worth using?

When you're preparing, not deciding. They're good for orienting yourself in an unfamiliar category, sketching early questions, testing whether your discussion guide holds up, and generating edge cases you'll want your real fieldwork to cover. The line to hold is simple: the moment a synthetic answer gets treated as a finding rather than a prompt for better fieldwork, you've gone too far.

Why does hyper-rationality matter so much in this debate?

Because it's the distortion that hits the behaviour you most want to understand. Digital twins act like coldly logical decision-makers, but real people are loss-averse, socially anxious, and easily swayed in the moment. Most of the decisions that drive pricing, retention, and brand happen in exactly those irrational moments, and that's precisely where twins are least like the people they're meant to represent.