Here's an uncomfortable truth about retargeting: a significant share of the people in your retargeting audiences would have converted without seeing your ads.
That's not a criticism of retargeting as a channel, it can genuinely move the needle. The problem is that most measurement systems can't tell the difference between ads that caused a purchase and ads that showed up to people who were already going to purchase. Attribution models, which assign credit based on tracked touchpoints, are fundamentally unable to answer this question. Only an experiment can.
That experiment is called an incrementality test.
How a holdout test measures true incremental impact
What incrementality actually measures
Incrementality is the counterfactual answer to the question: what would have happened if this person had never seen this ad?
In a perfect world, you'd run two identical copies of your business simultaneously, one with a given campaign active, one without, and compare outcomes. You'd know exactly how much the campaign contributed. In the real world, you can't do this, but you can get close by randomly splitting your audience and withholding ads from one group.
The group that sees ads is the "exposed" group. The group that doesn't is the "holdout" or "control" group. The difference in conversion rates between the two groups, after adjusting for any baseline differences, is your incremental lift. Convert that to revenue and divide by your ad spend, and you have your incremental ROAS (iROAS). That's the real number. Not the attributed ROAS.
Types of incrementality tests
There are several ways to run an incrementality test, each with different tradeoffs.
User-level holdout tests
The simplest approach: randomly assign users to exposed or control groups and withhold ads from the control group. Most major ad platforms offer some version of this natively, Meta's "Conversion Lift" studies, Google's "Brand Lift" tests, and TikTok's similar tools all use this approach.
User-level holdouts are fast and relatively easy to set up. The limitation is platform-controlled: you're trusting the platform to correctly randomize and withhold ads, and the platform has an obvious financial interest in showing that its ads are working. The methodology is usually sound, but the fact that you're running a test on infrastructure owned by the entity being tested is a legitimate concern.
Geo-holdout experiments
A more rigorous approach is to split geography rather than individual users. Geo experiments assign certain geographic markets (cities, regions, or DMAs) to a treatment group that runs advertising as normal, and a control group that goes dark for the test period. You compare outcomes between the two groups.
Geo experiments solve the platform-trust problem because you're measuring outcomes in your own data (sales, revenue, app installs from your own systems) rather than in the platform's attribution. They're also better at capturing cross-channel effects, if you pause Facebook ads in your control markets, you can see whether that affects Google-attributed conversions too.
The downsides are complexity and cost. You need a statistically meaningful number of geographic markets, a clean way to assign them, and enough time for the test to reach significance. Underpowered tests, ones that don't have enough volume to detect a real effect, are a common and serious mistake.
Ghost ads (PSA testing)
A third approach involves serving users in the control group a "ghost ad", usually a public service announcement or an irrelevant placeholder ad, so that both groups see the same number of impressions. This controls for the "ad experience" and isolates the specific creative effect, rather than just the presence vs. absence of advertising.
Ghost ads are the cleanest method in theory but the most logistically complex. They're used primarily by researchers and large brands with sophisticated measurement teams.
Designing a test that actually means something
The most common mistake in incrementality testing isn't methodological, it's underpowering. A test that can't detect a 10% lift at statistical significance isn't a test; it's a waste of time and money. Before running a test, you need to calculate your minimum detectable effect (MDE) and ensure your expected volume is large enough to reach it within your test window.
The second most common mistake is cherry-picking windows. Selecting the time window for your analysis after seeing the data, "let's use the 10 days where the lift was highest", invalidates the test. Pre-register your test window before you start.
Third: contamination. In geo tests, the geographic boundaries between your treatment and control markets need to be clean. If you're running YouTube ads in Boston (treated) and people in New Hampshire (control) watch those same YouTube ads because they commute to Boston, you have contamination, your control group is partially exposed, and you'll underestimate your lift.
A clean incrementality test needs:
- Pre-registration of hypothesis, test window, and primary metric
- Statistical power calculation before launch
- Clean separation between treatment and control
- Measurement in your own data, not the platform's
What you'll actually find
Many brands run their first incrementality test expecting confirmation that their ads are working. Some of them get it. Many are surprised.
The retargeting example at the top of this post isn't hypothetical. Multiple major brands have run holdout tests on their retargeting audiences and found that the organic conversion rate, the rate in the control group, was 80-90% of the rate in the exposed group. Meaning that only 10-20% of the conversions attributed to retargeting were actually caused by the ads. The rest would have happened anyway. For a deeper look at what the data typically shows, see are you wasting money on retargeting?
This doesn't mean retargeting is always inefficient. It means that if you've never tested it, you don't know. And your attribution model is not going to tell you, because it can't distinguish between caused and correlated conversions.
How incrementality fits into triangulation
Incrementality testing is the calibration layer in the marketing triangulation framework. You use it to check whether what your attribution says matches what a real experiment finds. You use it to validate whether your MMM model's channel efficiency estimates reflect reality.
You won't run incrementality tests on every channel every week, they're too expensive and too slow for that. But running 3-5 well-designed tests per year, on your highest-spend channels, gives you ground truth data points that are worth more than months of platform reporting.
When your incrementality results and your attribution results agree, you can make budget decisions with confidence. When they diverge, when attribution says something is working but an experiment shows modest or no lift, that's not a problem. That's the most valuable finding in marketing measurement. It tells you exactly where your budget assumptions need to change.