Return to Articles 8 mins read

The four different benefits of different test outcomes

Posted June 22, 2024 by Will Critchlow

Long-time readers will be unsurprised to see me writing once again about a 2x2 that helps me think through a particularly gnarly problem. In this case, the question I’m thinking about is one that comes up a lot in our discussions with prospective customers in particular: what is the true value of an experimentation program?

Fully answering this question is a huge endeavour that needs a lot of business-specific data and insights. What I’m hoping to do today is flesh out a framework for how I’ve been thinking about this.

The line of thinking began with a discussion triggered by this post - and in particular the line:

For example, you may have collision car insurance but have had no accidents over the past year. What was the value of the collision insurance, zero? You sure? The value of insurance isn’t equal to the amount that ultimately gets paid out. Insurance doesn’t work like that

I don’t love the insurance analogy in the case where harm is (or would be) done - insurance typically pays out cash compensation, whereas a testing program can actually prevent the harm from occurring in the first place. I think that analogy is more like a seatbelt than insurance. It’s true, though, that a positive test result on a change you wanted to make anyway is a bit similar to unused insurance. You believed it was going to be positive; you tested it and would have caught it if it was negative, but sure enough, it turned out how you thought it would.

The value of this is not zero, but it’s also not the full story.

The value of testing changes you would have made anyway

I have previously fallen into the trap of considering the value of tests in three buckets:

  1. Positive tests (value is the uplift)
  2. Inconclusive tests (value is the effort saved in not having to maintain a feature)
  3. Negative tests (value is the negative impact avoided)


In thinking about the insurance analogy, I realised that the other axis is critical - are we talking here about testing a change that you would or wouldn’t have made (blindly!) in the absence of the ability to test it?

Setting aside for a moment the value of inconclusive tests - which we’ll return to in a moment - we end up considering four kinds of test - positive tests and negative tests in each category:

  • Testing a change we would have made anyway if we were unable to run experiments
  • Running an experiment searching for uplifts - changes we would not have made blindly if we were unable to test



First, we are considering the right-hand side of the matrix - the changes we would roll out even if we couldn’t test them:


Unclaimed insurance value of expected positive tests

The whole set of positive test results you get in the bucket of changes you would have made anyway if you were unable to run any tests is the set of expected positive tests. And as described above, there is some value here - but it’s akin to the value of having had insurance during a period of time when you didn’t need to make a claim.


Seatbelts have higher value in a crash - unexpected negative tests

The real value on the right-hand side of the 2x2 comes in the bottom corner. When we discover that something we intended to do would have a negative impact, we have gained something of real value. I characterise this as being like the value of a seatbelt in a crash - reducing the damage to the thing we care about the most (hopefully to zero).


Tests have a cost to run, and even just running a negative test can temporarily have negative impacts, but like our analogous seatbelt, hopefully we get to minimise those impacts.




The crucial realisation here is that we should only really count the “seatbelt” value in the straightforward calculation of “damage avoided” for those changes on the right-hand side of the matrix - the ones we were going to roll out (blindly) if we weren’t able to test them. Those are the real damages mitigated. If we are just testing something while hunting for value - testing things we wouldn’t naively roll out - then a negative test doesn’t have damage limitation value in the same way. This leads us, though, to figuring out the value of the left-hand side of the matrix.


The value of “hunting” for positive changes

The left-hand side of the matrix consists of tests of changes that are more speculative. These are the changes that we would not think to roll out in a world without testing. Typically they are either deliberately risky with high variance, they come without a strong enough hypothesis to explain why we’d have high confidence in them, or we assess them to be unlikely to have a major impact.


Panning for gold - the most-commonly-cited benefit to a testing program - unexpected winners

When we hit an unexpected winner, we really have cause to celebrate. Winning tests are clearly what everyone is hoping to see every time they hit “publish” on a test, but we should really only include the raw % uplift in our ROI calculations when we find that all-too-rare example of a winner that we weren’t confident would be positive.


There’s nothing too complicated going on here - we panned, and we found gold - and these wins are the lifeblood of any experimentation program. If we focus on counting the intersection of positive and unexpected tests then this is totally valid and legitimate.




The value of a safety net is not (just!) in injuries avoided - failing to find an unexpected winner is fine

I labelled the benefit of the bottom left quadrant as being the value of a safety net. This is, at first glance, similar to the value of a seatbelt: something goes wrong, and it saves you from (serious) injury. It’s a deliberate distinction though. The goal of a seatbelt is to save you if something goes wrong while you’re going about your normal life whereas a safety net is more explicitly there to enable you to act differently.


I recently went to see a wonderful performance by Cirque du Soleil. The grand finale was a breathtaking flying trapeze act - with a safety net. Few of their acts leading up to this point had visible safety apparatus, and it got me wondering: how much better might their most spectacular tricks be if they allowed themselves the possibility that, on any given night, there was a realistic chance of needing the safety net we could see in front of us.


Just as I had this thought, one of the acrobats lost his grip after a flying leap, and tumbled into the safety net. Uncowed, he brushed himself off and climbed back up to the platform to continue. That answered my question: they were pushing so close to the limits that on any given night, they might step over them.


The best way of thinking about the value of that safety net was not that it prevented serious injury or death - if there was no net there, they’d have stuck to safer tricks - it was that its presence allowed the performance of tricks that would be unacceptably risky if it wasn’t there.



This is the role our experimentation program plays in the bottom left quadrant: if you aren’t testing tons of things that are likely to fail, you aren’t being bold enough, and aren’t going to reap the rewards of the risky play that pays off.


Cirque du Soleil isn’t slightly better because they are doing tricks right near their limits - they are way better - and the rewards are non-linear because the best performer can outstrip the market.




So we shouldn’t think of the value of these negative tests that we weren’t going to roll out anyway in terms of costs, or drops avoided, but rather in unlocking the potential of the whole left-hand side of the matrix: we know that web experiments typically find rare, but often large uplifts, and we can only capture them if we are bold enough to experiment with potential failures. In other words, if we are confident in our safety net.


What about inconclusive tests?

There is value to inconclusive tests - which is good, because they are often the majority of test results! - but they mainly confuse the analogies above, so I set them to one side.


I think that the clearest way to consider them in this framework is to realise that on the left-hand side of our chart, they are simply an extension of the bottom left quadrant - the safety net - because without positive evidence that these things are a good idea, you weren’t going to do them anyway, so the only benefit they bring is the ability to be more adventurous.


On the right-hand side, the picture is slightly more complicated - sometimes you have reason to want to roll out inconclusive changes in this half because they’re things you want to do for other reasons, and all you’re testing for is to avoid really bad outcomes (the seatbelt scenario):




For some changes, though, you really only want them if they’re positive, and being inconclusive on this side is similar (though less bad) to finding something you thought was a sure winner was in fact negative. In this case, since it’s not preventing a seriously bad outcome, but just protecting you from smaller problems, I’ve labelled it the “umbrella” zone. You carried an umbrella hoping it wasn’t going to be needed, but it rained, so you stayed dry. This is the “avoided having to maintain a feature that wasn’t going to bring provable performance improvements”:




Want to quantify all of this?

Me too! If you work on a large website suitable for SEO testing and you’re interested in developing the ideas here for your specific business, and want to quantify the commercial benefits of extending an experimentation program to cover SEO, get in touch. Similarly, if you’re sold on the value of an experimentation program and just want help figuring out how to incorporate SEO testing, you should see the SearchPilot platform in action.


How Skyscanner grew its traffic to +27% vs pre-COVID levels with SEO testing

Sign up to receive the results of two of our most surprising SEO experiments every month