Statistical significance and multivariate testing: setting a new standard

Statistical significance is an important hallmark of data validity when it comes to testing ad creative. But multivariate testing has rendered the old 50-conversion benchmark obsolete.

Pierce Porterfield

September 14, 2022

Statistical significance — or stat sig — refers to data that can be attributed to a specific cause and not to random chance.

In the case of testing ad creative, reaching stat sig means that if you ran the test multiple times, you would see similar data at least 80% of the time.

One of the greatest determining factors of stat sig is the sample size. Again, in the case of testing ad creative, this is the number of people who were presented with one version of ad creative.

Facebook was the first to bring this measure of validity to most marketers’ attention in the world of paid social. Exiting Learning Mode (i.e. reaching 50 conversions) has long been the stat sig standard that most marketers stand by.

But for today’s most reliable ad testing method — multivariate testing (MVT) — it’s not always realistic. Because MVT involves testing a large number of ads and creative elements at once, reaching statistical significance for each ad and element is more difficult than with, say, A/B testing.

There are two silver linings to this cloud.

With the right tool, you can reach stat sig without hitting 50 conversions.
Not reaching stat sig does not render MVT unreliable.

Let’s double click on both of these insights.

Reaching stat sig in a multivariate test

With the right tool, you can reach stat sig without hitting 50 conversions. Marpipe is the only automated ad testing platform with a live statistical significance calculator built right in. We call it the Confidence Meter, and it tells you if your data for each variant group is scientifically proven — or not.

Each multivariate test run on Marpipe contains multiple variant groups — some of which reach high confidence sooner than others. When a variant group reaches high confidence, it means you have enough data to make creative decisions. And when enough variant groups reach a high confidence level, you can move on to your next test.

Look to the Confidence Meter to understand:

whether or not a variant group has reached high confidence
if further testing for a certain variant group is necessary
whether repeating the test again would result in a similar distribution of data
when you have enough information to move on to your next test

The Confidence Meter does NOT tell you that:

one variant or variant group is the all-time best (or worst)
a variant group will always (or never) impact your KPIs
there’s no need to challenge winners in subsequent tests

‍

How to read the Confidence Meter

‍Gray means:

0–55% confidence; fluctuations in performance are likely due to chance
further testing for this variant group is necessary
you do not have enough information to move on to your next test
try testing variants with more substantial differences between them

Screenshot of Marpipe UI depicting low statistical confidence

‍

‍Yellow means:

56–79% confidence; fluctuations in performance might be due to chance
further testing for this variant group is necessary
you do not have enough information to identify a definitive winning or losing ad or creative element
try looking at another KPI or continue to put spend behind this test to reach high confidence

Screenshot of Marpipe UI depicting mild statistical confidence

‍

‍Green means:

80–100% confidence; fluctuations in performance are not due to chance
further testing for this variant group is not necessary
if enough variant groups are green, you have enough information to move on to your next test
continue to challenge winning elements and drop low performers in future tests

Screenshot of Marpipe UI depicting high statistical confidence

‍

Our methodology

What underlying statistical methods do we use?

Marpipe uses the G-test (also known as the likelihood ratio test) which is used to determine if the proportions of categories in two or more group variables significantly differ from each other. It has been the standard for decades in science and mathematics as a test for significance.

Why don't we do multiple analysis corrections?

Marpipe lets you break down and analyze your results in a nearly infinite number of ways — or just one. If you accept the results of multiple analysis breakdowns, from a statistics point of view you are more likely to think that there is no meaningful result when, in fact, there is one.

Because of this, we highly suggest customers decide on a primary hypothesis prior to running a test. And when looking at results across tests, we also suggest creating a new test to validate any specific patterns that seem to emerge.

Not reaching stat sig in a multivariate test

Not reaching stat sig does not render MVT unreliable. There are still clear winners and losers, even with smaller sample sizes per ad. It just means we have to look at early indicators of success rather than stat sig to help us make quick decisions about which ads and creative elements are or aren’t performing.

In short, we need to analyze any creative outliers rising to the top to see if those ads and creative elements are worth further testing that could eventually get us to statistical significance.

Early indicators can look like this:

One or a few ads generate a few leads while others do not
One or a few ads increase engagement by a small percentage while others do not
A specific creative element has a consistently lower CPA
A specific creative element consistently increases conversion rates while others do not

It’s also wise to take another look at your results using another KPI. Because while an ad or creative element may not reach stat sig for, say, purchases, it might for, say, clicks. This is a good sign that, given more time and budget, that ad or element could reach stat sig for purchases, too.

The bottom line

Statistical significance is an important hallmark of data validity. But the old 50-conversion benchmark isn’t always possible — or necessary. Combine this knowledge with the right multivariate tool, and you can achieve stat sig while testing large numbers of ads and creative elements at once.

Boost ad performance in days with a 7 day free trial.

Claim Trial

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Join The

Over 10,000+ Subscribers

Not your average newletter

The world's biggest newsletter about catalog ads.

Written by the category leader in catalogs.  

This is your trusted (and fun) source for DPA news, strategy and expert commentary.

Thank you! Please fill the additional info in the pop up window

Oops! Something went wrong while submitting the form.

Statistical significance and multivariate testing: setting a new standard

Reaching stat sig in a multivariate test

How to read the Confidence Meter

Our methodology

Not reaching stat sig in a multivariate test

The bottom line

How to Run a Multivariate Test

The Beginner's Guide

How to Run a Multivariate Test
‍The Beginner's Guide

Plus, Get our Weekly
Experimentation newsletter!

The world's biggest newsletter about catalog ads.

Are you crazy...

Statistical significance and multivariate testing: setting a new standard

Reaching stat sig in a multivariate test

How to read the Confidence Meter

Our methodology

Not reaching stat sig in a multivariate test

The bottom line

How to Run a Multivariate Test

The Beginner's Guide

How to Run a Multivariate Test‍The Beginner's Guide

Plus, Get our Weekly Experimentation newsletter!

The world's biggest newsletter about catalog ads.

Are you crazy...

How to Run a Multivariate Test
‍The Beginner's Guide

Plus, Get our Weekly
Experimentation newsletter!