A while back I did a workshop on "Analytics Tricks for 21st-Century Copywriters" for members of The Copywriter Club Underground where, hilariously, I only got to about HALF of the stuff I wanted to share.
Thankfully, attendees found the stuff I DID share pretty useful ...
... Unfortunately, I failed to get around to the ONE THING I really wanted to share with fellow copywriters, namely: a "mathless" way of interpreting split-testing statistics.
(Hat-tip to Helen Peatfield above for asking about it the Facebook group!)
So without further ado, here is "Momoko's Non-Technical Way To Understand Statistical Significance":
(Please disregard my "pineapple head" hairdo ... forgot to un-scruff myself before recording!)
Edited transcript (for old-school folks — like me! — who actually prefer text over video)
So, this graphic is the kind of stuff you'll see whenever you look up "statistical significance" and "statistical power" on Wikipedia or even most articles on stats:
These explanations immediately dive into the "null hypothesis" and the "alternative hypothesis" and "rejecting the null" and "failing to reject the null" and all this stuff. "Alphas" and "betas" and ... all that crap.
We are NOT going to be talking about that stuff. This should be an entirely intuitive (i.e. mathless) way to wrap your head around statistical significance. Let's get started.
Running an A/B test is like taking a picture (the resolution matters!)
Personally, I find the best way to conceptually understand split-testing is to think of it like photography.
Your test is like a snapshot of how your customers are interacting with any kind of change you've implemented on the site relative to what was there before (i.e. your test treatment or variant, vs. the original, or control).
It's literally just a way to observe what people are doing on your website. This contextualizes why sample size is important for testing.
You hear this all the time, someone says: "my test wasn't statistically significant ... what happened?" And almost invariably a CRO consultant or a data scientist / statistician will be like, "well, how big was your sample size?"
Now, if you're a photographer or videographer, the resolution of your camera — i.e. the density of pixels in your image — is super-important, right? If you have terrible resolution for your camera the images that you get are going to be pretty crap.
[An extreme example of where the camera's resolution impacts your ability to make decisions is CCTV footage for security systems. In fictional scenarios like CSI, detectives get video surveillance of a crime and say, "We need to see the person's face! Zoom and enhance, then we'll be able to figure out who they are."]
In reality, "zoom and enhance" is not a thing.
If you don't have enough pixels to determine what you're looking at, then you're blind — you can't make any deductions.
Split testing is like that, too.
The more visitors you have in a test, the more "pixels" you have in your snapshot, which improves your ability to see what really is going on.
And the more "pixels" you have, the more detail you can see. In some tests, you expect to see a big change. In others, you're expecting to see a small change. Sometimes, you have no idea what the change is going to be.
You can still see big changes or big differences or trends with a crappy camera or low resolution. You can see a giant tree that's in a field that's 10 feet in front of you with a garbage camera from 20 years ago.
So oftentimes you can still see big effects (like 50%+) with not-so-huge test sample sizes.
But when it comes to detecting small things — either in a test or in an image — you have to have really GOOD resolution, i.e. really high pixel density.
If you want to be able to see small changes in a split test, like if someone says "we saw a 5% difference in this test," they better have like a really big sample size because being able to see a statistically significant 5% difference, that will take quite a lot of visitors — or a lot of pixels — to be able to see that.
Extending the analogy (with bears!): significance, confidence, power
Now let's break down how each of the different terms that we always hear about in relation to split testing — like significance, confidence, power — fit into the photography analogy.
POP QUIZ: Do you see a bear in this picture?
If you have functional eyeballs, you immediately think: "Duh. Yes."
Okay, then what are the chances that the bear that we think we're seeing isn't a bear, it's actually just a random smudge on the camera lens?
Again, your instinctive answer will be: "You've gotta be joking, right? That's CLEARLY not a random smudge. This is not just like a piece of lint on my camera."
So let's say for the sake of argument that the chances of this "bear" being just a smudge on the camera is like 0.000000 ... [however many zeros you want] ... 1%. Infinitesimally small.
Well, what's the inverse of that? In other words, how confident are we that this is NOT a smudge that we're looking at?
It'd be 1 - our "smudge factor," i.e. a 99.999999999 ...% chance that this is not a smudge on the camera.
The chance that whatever your test is picking up is just a random smudge or blip, THAT is your test's statistical significance. It's the chance of a false positive, of seeing something that LOOKS like an effect due to the changes you've implemented, but actually it's just completely due to something else.
The inverse of that is your "confidence level," which is a term that's a little bit more intuitive.
In this example, we'd say we have 99.5+% confidence that this is NOT a smudge.
So significance and confidence are basically two sides of the same coin.
Now, this leaves the one really confusing one that gets people a lot: statistical power.
POP QUIZ: Is there a bear in THIS picture?
Again, our instinctive answer when we look at this is like "Duh, no."
Okay, then what are the chances that there actually IS a bear in this picture and our camera is not able to pick it up or detect it for some reason?
The chances are obviously very, very low. For the sake of argument, we'll say 0.00000001%. This is our chance of calling a "false negative."
If the chance of NOT detecting a bear — i.e. calling a false negative — is very, very low, then clearly we are very confident in our [camera]'s ability to detect bears. Let's say 99.9999999% confident.
Your confidence in NOT accidentally missing a bear that's actually in front of you ... THAT is your "statistical power."
Power is how confident we feel that if our test treatment DID have an real effect on conversions, our test would be able to detect it.
From the wildlife photographer perspective, our "statistical power" is our confidence in our ability — or our camera's ability — to detect the bear.
All statistical tests are like blurry pictures
Once you break statistical power and significance down in relation to photography, understanding their importance becomes very simple, almost absurdly simple.
It's like, "oh, that totally makes sense. Of course these are important things to track."
But with the examples that I've shown so far — with the bear clearly in the picture and the bear clearly NOT in the picture — our instinctive, interpretive brain makes the role of confidence and power seem unnecessary. Our eyes and brain make the judgment for us: OBVIOUSLY the bear is/is not in those pictures, so we can tell by looking that the confidence & power of our test will be high.
But it's important to realize that in terms of being able to assess REALITY, all statistical tests are essentially pictures that look like this:
POP QUIZ: Is there a bear in THIS picture?
We can kind of make out some stuff ... but not really.
With stats, there's no way to know what's ACTUALLY real. This blurry snapshot is all you have.
There's no option to look away from the camera and rely on our own eyeballs to see what's "really" going on.
That's why we need statistical measurements like power and significance to make a cutoff and decide when you're seeing something important/real, when you're not, and whether or not your test (or picture) is detailed enough (or "high-res" enough) to make the call either way.
So back to the quiz ... IS there a bear in this picture?
What about that dark thing in the middle of this picture? Is that a bear?
How confident are we that this smudge is a bear?
Not that confident, right? We wouldn't say that our confidence is 99.9+% in this case. It could honestly go either way.
So let's say our confidence is ~50%, maybe.
And how confident are we in our camera's ability to detect bears in general?
Again, NOT very confident. The resolution is low. If this was a split test, we would say this is a pretty low-powered test.
This is what reality is like through the lens of statistical tests, including AB tests.
With statistics, it is assumed that we cannot be 100% sure of what reality actually is. That's the whole principle of statistics, that we NEVER have the full picture, so we need to calculate the chances of what's actually going on, and set clear cut-offs so we can make a call one way or another.
For simple A/B tests, there are some conventions to follow. A good test generally has 95% confidence or more, which means that you'll see a "smudge factor" of less than 5%.
We also want to want make sure that our test has about 80% statistical power. In other words, we want our confidence in our test's (i.e. "camera"'s) ability to pick up a real effect — if it's actually there — is at least 80%.
A lot of the time people focus solely on statistical significance (chance it's just a smudge) and they don't focus on power (the test's "resolution") at all. This can result in weak or false conclusions.
How statistical literacy can help YOU help your CLIENTS
A common scenario for well-meaning copywriters who write for small businesses is a client saying:
"We ran an AB test on your copy, but we didn't actually see a statistical lift! Nothing changed!"
If you're a statistically uninformed copywriter, this is the point where you crumple on the inside and feel like a complete fraud. (You should never feel like that, by the way.)
However, if you are statistically INFORMED, the first thing that you're going to think is, "Okay, how conclusive is this test? Did the client actually adhere to the right statistical cutoffs to come to their conclusions?"
For example, let's say our client had a test where they had 750 users go to a version A and 750 users go to version B.
And let's say version B is our copy. In version A, we got 72 conversions. And then in version B we only got 65 conversions.
If you plug these numbers into a free testing calculator, you can get a clear breakdown of everything you're seeing and assess the validity of the test.
The first thing to note is we're seeing about a 9.72% decrease in conversions in this test.
The second thing we see is the significance. To be clear: "p-value," "alpha," "statistical significance" ... those are ALL the same thing, AKA the "smudge factor."
In the calculator's breakdown, it says there's a p-value of about 0.7348, or a 73% chance of it being a smudge.
So there is a 73% CHANCE that the results that we're seeing are due to just random chance.
Next, let's look at the observed power. (Remember, our cutoff for power should be 80% or higher, right?)
Well, the power for this test is about 16%, so this is a brutally underpowered test. If this test was a picture, it would have come from a horrible-quality camera — like a Kodak disposable camera that you'd get at a wedding back in the 80s.
So the client should not be making any assumptions or any conclusions from this test — let alone assuming that your copy is bad. They need more pixels, AKA more visitors, to get a better sense of what's actually going on.
How to set a minimum sample size for your A/B tests
Now, let's say you tell your client that they need to run this test longer and they say, "well, how long should I run this test?"
You can go to a good sample size calculator for A/B tests to calculate this.
We know from the numbers we just plugged into the previous calculator that the baseline conversion rate we were dealing with — which I think was like 72 out of 750 for Version A — comes out to like 72/750 = 9.6%.
So we can plug that into the baseline conversion field of our sample size calculator.
Next: What is the smallest amount of change — or the greatest amount of detail," in photography terms — that we want this test to be able to detect?
We can estimate that from the amount of lift that our client saw in their test, which was -9.72%.
Once you plug in the numbers, a sample-size calculator will just straight-up TELL you, right off the bat, what your sample size should be.
Our client only included about 750 users per treatment. But when you actually run the numbers with a calculator, it tells you that the minimum amount of the visitors you should have per treatment to be able to detect an effect as small as +/-9.72% is almost 16,000 visitors per treatment.
That's a LOT more than what they collected!
In conclusion ...
I hope this helps you guys understand the concept of statistical significance, confidence, power, and all that confusing stuff, and how you can use these terms to evaluate your [or your clients'] test results.
[ For the record, I love the illustrative power of a good analogy, but this is probably the most laboured metaphor I've ever come up with to explain something. So I'm curious to know what real, professinal number-crunchers (statisticians, data scientists, CROs) think of it. Have something to add? Add your thoughts below! ]