A geo experiment is one of the cleanest ways to measure whether your advertising is actually driving business results. Unlike attribution models that work backward from observable data, a geo experiment is prospective, you design it before you run it, choose your control group before you look at the outcomes, and measure the difference in real business outcomes (revenue, orders) rather than attributed conversions.
It's also harder to run well than most guides suggest. Here's the real process.
A geo experiment divides markets into test and control groups to measure causal ad impact
What a geo experiment is
You split geographic markets, DMAs in the US, counties, cities, or countries, into two groups. Test markets are where you run (or scale) your campaign. Control markets are where you go dark, or maintain a baseline spend level. You measure the difference in business outcomes between the two groups during the test period. The incremental lift is what you attribute to the advertising.
The reason geo experiments are valuable is that they bypass the fundamental problem with attribution: you're not asking "which channel got credit?" You're asking "what happened to revenue in places where we ran ads, compared to similar places where we didn't?" That's a much cleaner question.
Step 1: Define exactly what you're testing
"Does our Facebook advertising drive incremental revenue?" is too broad to test cleanly. You need a specific, falsifiable question: "Does scaling our Facebook prospecting spend from $15,000 to $30,000 per week in these markets drive incremental orders during a 6-week test window?"
Write this down before you touch any platform. It forces you to specify the channel, the spend change, the duration, and the markets, all of which you need to design the test.
Step 2: Select your markets
Market selection is the most consequential part of geo experiment design, and the step most often rushed. You need test and control markets that are genuinely comparable, similar in size, demographics, industry mix, seasonal patterns, and baseline conversion rates.
Tools that help with this: Google's GeoX (built into Google Ads for Search and YouTube experiments), Meta's GeoLift (open source, available on GitHub), or Uber's CausalImpact framework. All three use pre-period data to identify market pairs that move together historically.
Never assign markets randomly without pre-testing their comparability. Spend at least one to two weeks confirming that your candidate test and control markets show similar trends on your key business metric before the test starts. This is called pre-period matching, and skipping it is the single biggest source of bad geo experiment results.
A few practical rules:
- Avoid markets at geographic edges where consumer spillover between test and control is likely (people who live in a control city but work or shop in a test city)
- For national brands, DMAs work well. For local or regional businesses, zip codes or counties may give cleaner boundaries
- Exclude markets with unusual one-time events during the test period (local disasters, large sporting events, construction that blocks a major retail location)
Step 3: Size the test correctly
Before you start, calculate the minimum detectable effect (MDE). This is the smallest lift your test can reliably detect, given your baseline revenue, number of markets, and test duration.
The inputs you need:
- Baseline daily or weekly revenue in your candidate test markets
- The smallest lift you'd consider meaningful (5%? 10%? 20%?)
- How long you plan to run the test
- How many markets you can realistically put in each group
Most marketers dramatically underpower their geo experiments by running them for too short a period or with too few markets. A test with 3 test markets and 3 control markets cannot reliably detect a 10% lift, the statistical noise is too large relative to the signal. As a rough guide: aim for 6+ markets per group and 4+ weeks of test duration before drawing conclusions.
If you can't achieve that with your market footprint, a holdout test (audience-level, not geography-level) may be more practical.
Step 4: Run the test cleanly
Once the test starts, don't change anything. This sounds obvious but breaks down in practice: a promotion gets launched in test markets, a sales team runs a local event, the media team adjusts targeting mid-flight. Any of these will contaminate your results.
Before launch:
- Document exactly what "test" means (spend level, channels included, targeting parameters)
- Document exactly what "control" means (zero spend? Category-level holdout? Reduced spend?)
- Communicate the test parameters to everyone who touches media, pricing, or promotions
- Lock the test configuration in the platform and confirm it's running as specified on day one
Set a calendar reminder to check mid-test that nothing unexpected has happened. Not to optimize, just to verify the test is still clean.
Step 5: Measure the right outcome
This is where most geo experiments go wrong. Do not measure attributed conversions in the test group and call that your lift number. Attribution models will credit the channels you scaled in test markets for many conversions that would have happened anyway, which is exactly the bias you're trying to measure past.
Measure business outcomes from your backend: total orders, total revenue, or new customer acquisitions, pulled from your Shopify, ERP, or analytics warehouse by geography. Compare the test market aggregate to the control market aggregate, controlling for pre-period differences.
If you're using Meta's GeoLift tool, it runs a Bayesian structural time series model that estimates what would have happened in your test markets if you hadn't run the campaign, then compares that counterfactual to what actually happened. The output is an estimated lift with a confidence interval, which is what you want.
Step 6: Interpret the results
Compare test vs. control outcomes over the test period. If test markets showed 12% higher revenue than control markets (after accounting for pre-period baseline differences), that's your estimated lift.
Then calculate iROAS:
- Incremental revenue = (revenue in test markets above counterfactual)
- iROAS = Incremental revenue / Incremental ad spend
Some results that commonly surprise brands running their first geo experiment:
Retargeting campaigns often show 30–60% lower incremental lift than their attributed ROAS suggests. The people in your retargeting audience already visited your site, many of them were going to buy anyway.
Brand awareness channels (YouTube, Connected TV, podcasts) often show higher incremental lift than attribution gives them credit for. They appear at the top of the funnel where attribution undercounts them, but they move real business outcomes.
Branded search often shows very low incremental lift. People who type your brand name into Google were already planning to visit your site. You may be bidding on traffic you'd get for free.
Step 7: Act on the results
A geo experiment only earns its cost if you change something based on the output. If retargeting shows low incrementality, reduce that spend and reallocate to the channel that showed higher lift. Document the iROAS from the test so you can calibrate your attribution model for this channel going forward.
The calibration step is important: if your geo experiment shows Facebook prospecting has an iROAS of 2.8x while your attribution shows an attributed ROAS of 5.1x, you can apply a rough correction factor of 0.55x to Facebook prospecting attributed revenue in your dashboards. This isn't perfect, but it's far more accurate than trusting the raw attributed number.
Common mistakes
Running for too short a window. Two weeks is rarely enough. Day-of-week variation, weather, and random noise can easily swamp a real effect in a short window. Four weeks minimum; six weeks if your baseline conversion volume is low.
Using too few markets. Statistical power scales with the square root of the number of markets. Three vs. three gives you very little power. Push for six vs. six or more if your footprint allows.
Measuring attributed conversions instead of business outcomes. Use your backend revenue data by geography, not what the platform reports.
Changing things mid-test. Promotions, creative changes, pricing adjustments during the test window invalidate the results.
Not pre-testing market comparability. Markets that move together in the pre-period are much more likely to give you a clean counterfactual during the test period.