How to Design a Holdout Test for Paid Social (Step-by-Step)

A holdout test, also called an audience split test or conversion lift test, is the simplest form of incrementality measurement. You divide your target audience in two: 80–90% see your ads (the test group), and 10–20% are held out and shown nothing (the control group). You compare conversion rates between the two groups. The difference is your incremental lift.

It sounds straightforward. The design details are where it either works or produces numbers you can't trust.

How a holdout test measures true incremental impact

What platforms support holdout tests natively

Three major platforms have built-in holdout testing tools:

Meta Ads: Conversion Lift, found in the Experiments section of Ads Manager. Meta randomizes users into test and control before the campaign starts, then withholds ad delivery to the control group. The randomization is at the person level, which is cleaner than most DIY methods.

Google Ads: Conversion Lift in Campaign Manager 360. Works similarly to Meta's version, though you need to be running on Campaign Manager to access it.

TikTok: A holdout option within Reach and Frequency campaigns. Less mature than Meta's implementation, but available.

The advantage of platform-native holdout tests: the randomization and ad suppression are handled reliably, at a scale and with signal quality you can't replicate by manually splitting audiences. The disadvantage: the platform is designing and reporting its own test. Review the methodology carefully before accepting the results at face value. Specifically: is the control group seeing zero ads, or are they seeing PSAs? What conversion window is being used? How are cross-device conversions handled?

Designing a holdout test you control

If you want a test that doesn't depend on the platform's methodology, you can build one yourself, but it requires clean first-party data infrastructure.

The basic approach for a retargeting audience:

Export your retargeting audience from your CRM or pixel data as a list of identifiers (email addresses, phone numbers, or customer IDs)
Randomly assign each person to test (80%) or control (20%) before uploading to the platform
Upload only the test group as your ad targeting
Track conversions for both groups in your backend by matching order records to your identifier list

This gives you an outcome measure that doesn't come from the platform's attribution system, which is exactly what you want. The limitation is that it requires an identifier that persists across ad exposure and conversion (email match or logged-in sessions work; cookie-based matching is increasingly unreliable).

Setting the holdout size

Ten to twenty percent is standard for most campaigns. The right size depends on your conversion volume.

If your holdout group is too small (under 5%), you won't collect enough conversions in the control group to detect a statistically meaningful difference. You might run a four-week test and have 12 control group conversions, that's not enough data to conclude anything.

If your holdout is too large (over 30%), you're forgoing a significant portion of your potential conversions during the test period. That's an opportunity cost that has to be weighed against the value of the measurement.

A useful heuristic: aim for at least 50–100 conversions in your control group by the end of the test. Work backward from that to set your holdout percentage and test duration.

What to measure

Do not use platform-attributed conversions as your primary metric.

Here's why: if you use the conversions the platform attributes to the test group as your numerator, you're measuring attributed ROAS, not incrementality. The platform will attribute some organic conversions, people who would have bought regardless, to the ads. That's the very inflation you're trying to measure.

Instead, use your backend order data. Pull total orders during the test period from your Shopify, database, or analytics warehouse, then split by test vs. control group using the identifier list you created when you divided the audience. This gives you actual conversion rates that aren't inflated by the platform's attribution logic.

Test duration

Two weeks is the minimum. Four weeks is better. Here's why duration matters:

Purchase intent varies significantly by day of week. A two-week test might happen to include two unusually strong (or weak) weekends, which will distort the comparison between groups. Four weeks smooths out day-of-week variation.

You also need time to accumulate enough conversions in the control group. If your baseline conversion rate is low and your holdout group is small, it can take three to four weeks to collect enough data for a statistically meaningful result.

Don't look at interim results and make decisions. Set your test duration at the start, commit to it, and analyze the data after the test ends. Early peeking and early stopping inflate false positive rates.

Calculating statistical significance

Once the test ends, calculate the incremental conversion rate and test whether it's statistically significant.

The basic calculation:

Test group conversion rate: conversions in test group / test group size
Control group conversion rate: conversions in control group / control group size
Incremental lift: test rate minus control rate
p-value: run a two-proportion z-test or chi-square test to determine whether the difference is statistically significant

A result with p greater than 0.05 doesn't mean the ads aren't working. It means your test didn't have enough power to detect the effect. If that happens, look at whether you had enough conversions in the control group. If not, extend the test or increase the holdout percentage.

A confidence interval on your lift estimate is more useful than a binary significant/not significant conclusion. "Incremental lift is 18%, with a 95% confidence interval of 8%–28%" tells you far more than "p = 0.03."

Interpreting the results

If your test group converts at 4.2% and your control group converts at 3.5%, your incremental lift is 0.7 percentage points. That means roughly 17% of test group conversions are incremental, the ads caused those. The other 83% were organic, buyers who would have converted regardless.

Calculate iROAS from there:

Incremental conversion rate (0.7%) × test group size × average order value = incremental revenue
iROAS = incremental revenue / ad spend

If iROAS is above your profit threshold, maintain or scale the campaign. If it's below, and especially if it's below 1x, you're spending money to take credit for conversions that would have happened without the ads. That's a clear signal to reduce spend in that campaign type.

If your incremental lift is negative (the control group converts at a higher rate than the test group), that's usually a sign of ad fatigue or audience saturation. It can happen with retargeting audiences that have been seeing the same creative for months.

What to do with the results

The test result is only valuable if you act on it and document it.

If incrementality is low: reduce retargeting budget and reallocate to prospecting, where organic intent is lower and incremental lift is typically higher.

If incrementality is high: consider scaling carefully, and plan to re-test in three to six months, because incrementality rates change as audiences saturate.

Either way, document the iROAS from this test and use it as a calibration factor for your attribution data. If attribution shows this campaign at 5x ROAS but your holdout test showed 17% incrementality, your true incremental ROAS is approximately 0.85x. That's the number to use for budget decisions, not the attributed figure.