Beta

Statistical significance and multivariate testing: setting a new standard

Statistical significance is an important hallmark of data validity when it comes to testing ad creative. But multivariate testing has rendered the old 50-conversion benchmark obsolete.
Pierce Porterfield

Statistical significance — or stat sig — refers to data that can be attributed to a specific cause and not to random chance.

In the case of testing ad creative, reaching stat sig means that if you ran the test multiple times, you would see similar data at least 80% of the time.

One of the greatest determining factors of stat sig is the sample size. Again, in the case of testing ad creative, this is the number of people who were presented with one version of ad creative.

Facebook was the first to bring this measure of validity to most marketers’ attention in the world of paid social. Exiting Learning Mode (i.e. reaching 50 conversions) has long been the stat sig standard that most marketers stand by.

But for today’s most reliable ad testing method — multivariate testing (MVT) — it’s not always realistic. Because MVT involves testing a large number of ads and creative elements at once, reaching statistical significance for each ad and element is more difficult than with, say, A/B testing.

There are two silver linings to this cloud.

  1. With the right tool, you can reach stat sig without hitting 50 conversions.
  2. Not reaching stat sig does not render MVT unreliable.

Let’s double click on both of these insights.

Reaching stat sig in a multivariate test

With the right tool, you can reach stat sig without hitting 50 conversions. Marpipe is the only automated ad testing platform with a live statistical significance calculator built right in. We call it the Confidence Meter, and it tells you if your data for each variant group is scientifically proven — or not.

Each multivariate test run on Marpipe contains multiple variant groups — some of which reach high confidence sooner than others. When a variant group reaches high confidence, it means you have enough data to make creative decisions. And when enough variant groups reach a high confidence level, you can move on to your next test.

Look to the Confidence Meter to understand:

  • whether or not a variant group has reached high confidence
  • if further testing for a certain variant group is necessary
  • whether repeating the test again would result in a similar distribution of data
  • when you have enough information to move on to your next test

The Confidence Meter does NOT tell you that:

  • one variant or variant group is the all-time best (or worst)
  • a variant group will always (or never) impact your KPIs
  • there’s no need to challenge winners in subsequent tests

How to read the Confidence Meter

Gray means:

  • 0–55% confidence; fluctuations in performance are likely due to chance
  • further testing for this variant group is necessary
  • you do not have enough information to move on to your next test
  • try testing variants with more substantial differences between them
Screenshot of Marpipe UI depicting low statistical confidence

Yellow means:

  • 56–79% confidence; fluctuations in performance might be due to chance
  • further testing for this variant group is necessary
  • you do not have enough information to identify a definitive winning or losing ad or creative element
  • try looking at another KPI or continue to put spend behind this test to reach high confidence
Screenshot of Marpipe UI depicting mild statistical confidence

Green means:

  • 80–100% confidence; fluctuations in performance are not due to chance
  • further testing for this variant group is not necessary
  • if enough variant groups are green, you have enough information to move on to your next test
  • continue to challenge winning elements and drop low performers in future tests
Screenshot of Marpipe UI depicting high statistical confidence

Our methodology

What underlying statistical methods do we use?

Marpipe uses the G-test (also known as the likelihood ratio test) which is used to determine if the proportions of categories in two or more group variables significantly differ from each other. It has been the standard for decades in science and mathematics as a test for significance.

Why don't we do multiple analysis corrections? 

Marpipe lets you break down and analyze your results in a nearly infinite number of ways — or just one. If you accept the results of multiple analysis breakdowns, from a statistics point of view you are more likely to think that there is no meaningful result when, in fact, there is one. 

Because of this, we highly suggest customers decide on a primary hypothesis prior to running a test. And when looking at results across tests, we also suggest creating a new test to validate any specific patterns that seem to emerge.

Not reaching stat sig in a multivariate test

Not reaching stat sig does not render MVT unreliable. There are still clear winners and losers, even with smaller sample sizes per ad. It just means we have to look at early indicators of success rather than stat sig to help us make quick decisions about which ads and creative elements are or aren’t performing.

In short, we need to analyze any creative outliers rising to the top to see if those ads and creative elements are worth further testing that could eventually get us to statistical significance.

Early indicators can look like this:

  • One or a few ads generate a few leads while others do not
  • One or a few ads increase engagement by a small percentage while others do not
  • A specific creative element has a consistently lower CPA
  • A specific creative element consistently increases conversion rates while others do not

It’s also wise to take another look at your results using another KPI. Because while an ad or creative element may not reach stat sig for, say, purchases, it might for, say, clicks. This is a good sign that, given more time and budget, that ad or element could reach stat sig for purchases, too.

The bottom line

Statistical significance is an important hallmark of data validity. But the old 50-conversion benchmark isn’t always possible — or necessary. Combine this knowledge with the right multivariate tool, and you can achieve stat sig while testing large numbers of ads and creative elements at once.

Boost ad performance in days with a 7 day free trial.
Claim Trial

How to Run a Multivariate Test

The Beginner's Guide

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Tiffany Johnson Headshot

How to Run a Multivariate Test
The Beginner's Guide

Plus, Get our Weekly
Experimentation newsletter!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Outliers is the weekly newsletter that over 10,000 marketers rely on to get new data and tactics about creative testing.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Are you crazy...

about catalog ads? You’re not alone. Join over 8,000 other marketers in The Catalog Cult - the world’s best newsletter about catalog ads.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.