Incrementality Testing: The Complete Guide

Here is a version of a story that plays out at many e-commerce brands: someone pauses the retargeting campaign for a weekend, budget ran out, an oversight, and sales do not drop. Maybe they go up slightly. When the campaign turns back on, conversions climb back. Platform reporting looks exactly the same as before. The question that follows is uncomfortable: were those conversions real? Were we paying to reach people who would have bought anyway?

That is the incrementality question. And for most brands, it is worth asking directly rather than assuming the answer.

Incrementality testing follows a four-step process from hypothesis to decision

What Incrementality Means

Incrementality measures the difference between what happened and what would have happened without the advertising. The counterfactual: if these specific people had not seen this specific ad, how many of them would have converted anyway?

The incremental conversions are the ones your advertising actually caused. Everything else is organic demand, purchase intent that existed before your ad reached them, that would have found your product regardless. You are not creating that demand by running the ad; you are simply paying to be present at a moment that would have ended in a purchase anyway.

The metric that captures this is iROAS: incremental return on ad spend. Where standard ROAS includes all conversions attributed to a channel in the denominator, iROAS counts only the incremental ones.

iROAS = Incremental Revenue ÷ Ad Spend

Incremental Revenue = Revenue in exposed group − Revenue in holdout group (adjusted for group size)

If your holdout group converts at 70% the rate of your exposed group, then 70% of your attributed conversions are organic, they would have happened without the ad. Your iROAS is roughly 30% of your reported ROAS. Same campaign, very different picture.

The Three Main Testing Methods

No single approach to incrementality testing works in every situation. The right method depends on the channel, the audience, and what business outcomes you can measure.

Holdout Tests (Audience Split Testing)

A holdout test randomly splits your target audience: one group sees your ads, one group (the holdout) does not. You compare conversion rates between the two groups. The difference is your incremental lift.

The setup is conceptually simple. In practice, the execution depends on the platform. Meta's Conversion Lift study suppresses ads to the holdout group and shows them a PSA (public service announcement) instead, providing a clean counterfactual where the only difference is ad exposure. Manual holdouts (excluding a segment of your audience from all campaigns) are simpler to set up but more prone to contamination: the holdout might still see your ads through other placements, lookalike audiences, or organic reach.

Holdout tests are best suited to digital channels where you can deterministically target individuals: paid social retargeting, email, display. They require sufficient volume, Meta's Conversion Lift needs at least 30,000 users in each group and approximately $30,000 in spend over the test period for reliable results.

The limitation: holdout tests cannot work for channels where you cannot suppress ad delivery. You cannot run a TV holdout. You cannot run a paid search holdout (because users initiate the search themselves; you cannot prevent the ad from showing to a holdout group without excluding them from the auction entirely, which changes the experiment). For these cases, geo experiments are the right approach.

Geo Experiments

Geo experiments use geography as the unit of randomization instead of individual users. You select a set of geographic markets (DMAs in the US, regions or countries elsewhere), assign some to test and some to control, run your campaign in test markets, go dark (or reduce spend) in control markets, and measure the difference in business outcomes.

The key advantage: geo experiments measure business outcomes, not attributed digital conversions. You compare actual orders, revenue, or new customer acquisitions across markets. This removes the attribution layer entirely, you do not need to know who clicked what. You are measuring the total business effect of your advertising in a market.

Geo experiments are the right approach for television, YouTube, podcast, OOH, and any other channel where individual-level suppression is not possible. They are also valuable for digital channels when you want to validate results against a real-world business outcome rather than a platform-attributed digital conversion.

The limitation: geo experiments require careful market matching and statistical discipline. If your test and control markets have different baseline trends, your results will be biased. You need a sufficient number of markets on each side, typically 10 or more, for reliable statistical inference. Small companies with concentrated customer bases in a few markets may find this method impractical.

Tools that help with geo experiments: Meta GeoLift (open-source R package for market matching and analysis), Google's Causal Impact library, and internal tools at larger companies like Google GeoX.

Ghost Ads (PSA Testing)

Ghost ad testing shows a control group a Public Service Announcement, an ad for a neutral, unrelated cause, in the exact same placements where your test group sees your real ad. The control group sees an ad; they just do not see your ad. This controls for the mere exposure to advertising (the act of being served an ad might affect behavior), not just the absence of any ad.

This is the cleanest experimental design for measuring ad effectiveness specifically, as opposed to the effect of ad suppression. Meta's Conversion Lift uses this approach. The limitation: it only works on platforms that support it. You cannot run ghost ads on TV or in most programmatic environments.

Designing a Clean Incrementality Test

The difference between a reliable incrementality test and a misleading one comes down to a few design decisions made before the test starts.

Statistical Power

Most incrementality tests are underpowered. An underpowered test cannot detect the actual effect size reliably, you risk concluding your ads have no incremental value when they do (a false negative), or overestimating incrementality based on noisy data.

Calculate your minimum detectable effect before you start. The minimum detectable effect depends on: your expected baseline conversion rate in the holdout group, the variance in that rate, your sample size, and your desired statistical confidence (typically 90% or 95%). If you need to detect a 15% lift but your test is only powered to detect 40% or more, the result will be unreliable regardless of what it shows.

Online calculators and tools like Meta's GeoLift power analysis make this calculation accessible. Do it before you commit to a test structure, not after.

Test Duration

Run incrementality tests for a minimum of two weeks, preferably four weeks. Tests shorter than two weeks are vulnerable to day-of-week effects: if your audience purchases more on weekends, a test that runs Thursday through Sunday will show inflated results in the exposed group simply because the groups were exposed to different day compositions.

Four weeks is the standard for capturing at least two complete weekly cycles and smoothing out most short-run variability. Longer tests are sometimes needed for products with long purchase cycles.

What to Measure

Measure business outcomes, not attributed platform conversions. Orders, revenue, new customer acquisitions, the numbers in your CRM or order management system.

Platform-attributed conversions in the exposed group will always be inflated, because the platform is counting its own ad exposures as the cause of conversions that may have happened for other reasons. Your holdout comparison should use a clean, platform-independent measurement of outcomes in each group.

Contamination Risks

Every incrementality test has contamination risk. In geo experiments, people travel between test and control markets. In audience holdouts, people use multiple devices or clear their cookies and fall back into the exposed group. In PSA holdouts, the holdout audience might see your organic social posts or branded search results and self-select into your purchase funnel.

Design around contamination. For geo experiments, pick markets with low cross-market travel. For audience holdouts, use a household-level suppression if possible. Factor your expected contamination rate into your minimum detectable effect calculation. Do not pretend contamination does not exist.

What Brands Typically Find

The single most common finding from incrementality testing, one that surprises nearly every marketing team that runs their first serious test, is that retargeting has much lower incrementality than its platform-reported ROAS suggests.

Typical retargeting holdout results show that 40-75% of attributed conversions in retargeting campaigns are organic. The people being retargeted visited your site, viewed your products, or added items to a cart. They have strong purchase intent that exists independently of whether you show them an ad. When they convert after seeing a retargeting ad, the platform takes credit. But many of them would have converted anyway.

This is a structural feature of retargeting, not a failure of any specific campaign or platform. You are deliberately targeting your highest-intent audience. High intent means high organic conversion rate. High organic conversion rate means low incrementality.

Brand search shows a similar pattern. Most branded search conversions happen because the user is already looking for your brand. They type your brand name into Google, click the paid result instead of the organic listing, and convert. The paid ad captured a click that the organic listing would have captured anyway. Holdout tests on branded search often show true iROAS below 1x, the ads are not worth the cost because the conversions would have happened organically.

The channels that often show high incrementality in testing are not the ones that look best in attribution. YouTube, connected TV, podcast, and display prospecting tend to show strong incremental lift in geo experiments, because they are reaching cold audiences who would not have found you without the ad. These channels look weak in last-click attribution but can be doing substantial real work in driving awareness and intent.

Building an Incrementality Testing Program

A single holdout test answers one question about one channel at one point in time. A systematic testing program builds a body of evidence that lets you make reliable cross-channel allocation decisions.

Prioritize your testing calendar by three criteria. First, spend level: test your largest channels first, because accurate measurement has the highest expected value where the most money is at stake. Second, prior suspicion: channels where you have reason to believe attribution might be overclaiming, typically retargeting and brand search, should be tested early. Third, actionability: only test channels where you are willing to change behavior based on the result. A test you cannot act on is a waste.

A reasonable annual testing calendar for a mid-size brand: one major geo experiment per quarter (four channels per year), one to two audience holdout tests on digital retargeting campaigns, and one brand search holdout test. This generates enough calibration data to give you a reasonable picture of true incrementality across your main channels within 12-18 months.

The data from your testing program should feed back into how you interpret your MMM and attribution data. If your holdout test shows Facebook retargeting iROAS is 1.2x when platform attribution shows 4.5x, you have a calibration factor: Facebook is overclaiming by approximately 3.75x. You can apply a similar adjustment to other Facebook campaigns and update your MMM priors accordingly.