SEO A/B testing: realizing the value of a good experimentation program

Here at SearchPilot, we’ve spent the last decade developing and championing true SEO A/B testing, which groups pages, not users. This kind of testing helps you deliver ROI-attributable SEO results to your business. And we know the SEO community has been paying attention to these game-changing results.

Getting experts on board with true SEO testing is a great start. But it’s not the end of the story. Today, we want to explore the next step in the SEO testing maturity curve. We want to talk about going from running a series of tests to setting up and managing an experimentation program.

When you build a program of tests, each built on lessons learned from the last, you can make more intelligent hypotheses and see more consistent results. What’s more, you can more effectively report the activity, impact, and ROI of your SEO testing efforts to the business and get them on board, too.

In this guide, we’ll look at what an experimentation program includes, how to set goals, and how to develop good test ideas. We’ll also think about the value of all results - positive, negative, and inconclusive - and how they all feed into an evolving experimentation program.

Strap in, and let’s level up your SEO testing.

What a good experimentation program looks like

An experimentation program is a structured approach to hypothesis testing. It includes your entire portfolio of run tests and the ROI those tests have obtained.

When we talk about ROI, we aren’t only talking about uplifts in organic traffic sessions. With robust test analysis, you can also report on:

An extrapolation of organic traffic impact to revenue generated.
Learnings about what works and doesn’t for your website.
How much you have de-risked by not rolling out negative changes.
How much time and money you have saved your engineering team by only deploying winning changes.

The benefits of being able to demonstrate this bigger picture are clear. When you can see precisely what does and doesn’t work to boost your website’s organic traffic, you can fine-tune your website into a consistent conversion machine.

You can also more effectively communicate to senior stakeholders how your work and investment is paying off.

Getting to this kind of insight doesn’t happen overnight, though. It takes time and commitment. It’s about gradually building and learning. So, let’s look first at the essentials you need to start.

A note on the groundwork

As we mentioned, developing an SEO experimentation program relies on a foundation of SEO testing within your organization. It also requires effective tools for running and analyzing your tests.

Our SearchPilot platform is designed to help you run server-side SEO A/B tests. This means it allows you to break up groups of statistically similar pages into control (no change on the page) and variant (some change or changes on the page) ‘buckets.’ You can then run your tests and, from the dashboard, see when and if you get statistically significant organic traffic differences between those groups.

Getting a tool like this set up and your tests running is step one. It helps you to start engaging with the data and spotting patterns in your tests. It also enables you to work through the backlog of ‘obvious’ hypotheses that most SEO teams have and are desperate to test if only they had the tools to do so.

Getting this foundation in place means you can:

Design experiments.
Run experiments.
Analyze the data and either roll out winning changes or strike off the list anything that will have negative or no impact.

The crucial extra steps of an experimentation program

The heartbeat steps of running effective tests remain the same in an experimentation program. However, they are bookended by additional steps that elevate your testing efforts and deliver greater ROI.

Before you start designing your experiments, you need to:

Define strategic goals and set benchmarks for your program.
Develop more robust hypotheses.

There’s also more to do once the tests have run. Unlike the piecemeal nature of individual tests, an experimentation program is iterative and continuous. All tests, successful or otherwise, feed into a more informed and data-driven SEO approach.

So, once your tests have run, you need to:

Draw conclusions and define potential iterations. What happened? What did you learn? What could you change and test for next time?
Report on the details, results, and conclusions of your tests. Keeping this record helps you review progress against your goals and effectively report to senior stakeholders.

So, let’s dig into these differentiators.

Step one: Setting strategic goals and benchmarks

The point of an experimentation program is that you don’t place all the pressure on individual test results. Everyone wants a win, of course, but as your testing culture matures, redefining what it means to ‘win’ is important.

For most senior stakeholders, the ROI of any testing program is what matters. They want to understand how what you are doing is supporting the revenue of the business.

As we said before when we explored the value of losing SEO tests:

Ensuring you never again roll out a negative change keeps you ahead of the competition. And losing tests highlight dead-end testing, meaning fewer tickets to your engineering team. (And those that do have data-backed business cases.)

This means your most important strategic measure is your testing cadence, i.e., how many tests you want to run per month or over a year. It’s also the number you have the most direct control over.

If you talk about your testing program's cadence and scale first, followed by overall traffic impact, your scope for demonstrating business impact significantly broadens. These factors impact your ability to ensure consistent traffic and retain competitive advantage (rather than short-term boosts).

A note on win-rate

Sometimes teams approach us wanting to target a particular win-rate. It makes sense that stakeholders want to see positive results from an investment. But it’s not the be-all and end-all.

The win-rate benchmarks we‘ve observed from all the SEO tests we ran across 2022-2023 might surprise you.

Nearly 75% of tests are inconclusive.
7-8% of tests are negative and statistically significant.
About 15% of all tests run are positive and statistically significant.

That 15% is our standard benchmark. If your results consistently meet that level, you’re doing great. It might seem like a low number, but the point is that win-rate is only a leading indicator for your program, not its strategic measure of success.

All results provide information, whether positive, negative, or inconclusive. Even if only 15% of tests provide positive results, 100% are useful. That said, win-rate can be a leading indicator of the quality of your hypotheses, which leads us to our next step.

Step two: Developing more advanced hypotheses

Every SEO test needs to start with a hypothesis. It’s a crucial component of the scientific method that defines what you will test and what a successful result will look like.

For example, a hypothesis for your next experiment might look like this:

‘Changing X on Y pages will deliver an uplift in organic traffic by improving the existing rankings of our core product keywords.’

While that might seem straightforward, be warned: not all hypotheses are equally valuable. A weak hypothesis leads to poor experiments. Even if the result is positive, you can’t draw reliable conclusions on which to form permanent changes. So, you don’t gain any useful insights.

The benefits of a strong hypothesis, on the other hand, are:

Clear follow-up actions for any given outcome of a test.
Better prioritization of the tests that are most likely to have a significant impact.
Clear expectation setting for all stakeholders of the likely outcome of a test.

So, how do you develop good, strong hypotheses?

The foundations of a good hypothesis

Every change that has the goal of improving organic traffic involves one or more of three core levers:

Improving existing rankings.
Ranking for more keywords.
Improving organic click-through rates (CTRs), independent of improving rankings.

Basing your hypothesis on at least one of these levers will generally produce a strong hypothesis. We created this simple flowchart to ensure an idea is a hypothesis:

draft-flow-chart-for-an-seo-hypothesis-1-

When starting a testing program, teams often want to work through some of the more ‘basic’ hypotheses around page titles, descriptions, or on-page content. And these are important tests. But they don’t always reveal the full value of SEO testing.

At this point, you want to start diving deeper into what you can test with robust server-side tools, like SearchPilot. For example, you might look at page layout, structured schema, or in-depth internal linking. Many of our successful customers spend their first year working with our professional services team testing and gradually exploring increasingly complex possibilities, which offer greater value.

But our SEO experts are only half the story. Our customers’ knowledge is critical in developing valuable hypotheses.

You know your industry best

SEO split testing is a scientific endeavor, but that doesn’t mean you should underestimate the value of a hunch. You know your industry, your site, and your customers a lot better than anyone else (including us). This makes your involvement in hypothesis development vital.

Ultimately, it’s your business. You’re in the driver’s seat. You will already have a sense of what is likely to work well or not. So, if you believe a website change has value, it’s probably worth testing.

You can then craft those insightful hunches into carefully designed hypotheses and tests. From there, an experimentation program enables you to iterate those ideas and refine them in a rigorous and meaningful way.

Step three: Draw conclusions and define potential iterations

Interpreting your results goes beyond win or lose, even for one-off tests. But in a testing program, it’s even more important to understand the value and learnings you can take from every test. Let’s dig deeper.

Winning by losing

A failed experiment is still worthwhile. It stops you from rolling out a change that could harm your traffic and revenue. That alone is important. Beyond this, it can also help you refine your hypothesis. Does the result point to an alternative hypothesis? Or maybe you already knew a negative impact was a risk from the hypothesis and it simply hasn’t paid off. In this case, the fact you were able to test and measure that impact means you have data to feed into future ideation and research.

What feels like wasted effort often acts as the first step towards a more effective and powerful hypothesis—one that could not have been discovered without first running the ‘failed’ test.

Insights from the inconclusive

So, learnings from wins and losses are probably the easiest to understand. But what about the - on average - 75% of inconclusive tests? That’s a considerable amount of investment and time to spend to seemingly get nothing back. But here’s the thing: inconclusive doesn’t mean irrelevant.

Like losing tests, spotting that a change you thought was a dead cert has almost no impact can be a sign to iterate and refine that hypothesis. Why might it not be working? What’s a different way to influence the same page element? This is where ‘inconclusive’ is just the start of an impact story.

The other thing to understand about inconclusive tests is that they are not a win or a loss, nor do they mean ‘no effect’. Digging into the nuance of a test and understanding its role in your broader program can still give you the information you need to roll out a change and get results.

Taking a business approach to risk tolerance

For every experiment, you need to define your risk tolerance.

In the scientific world, defining and proving statistical significance is incredibly important. And because SearchPilot’s platform is built on scientific principles, the results you get follow the same rigor. However, as we’ve said before, we’re doing business, not science. And that means:

“Our default approach is to seek the greatest business benefit for our customers rather than the greatest certainty.”

And this comes to life most critically when helping you decide what risk tolerance to accept and how to interpret the results you get. For example, you might accept that for a strong hypothesis based on a small, easy-to-implement change, a risk tolerance of 90% is sufficient to roll out the change.

You might also decide it's still worth rolling out a simple change that doesn’t appear to be significantly changing the traffic. You might only see a marginal gain, too slight to detect on a test, but cumulatively, across the site, it still impacts the bottom line.

In interpreting inconclusive tests, we follow this four-quadrant approach. The stronger your hypothesis and the simpler the change, the lower the risk of rolling out a change from an inconclusive test:

blog-business-not-science-2x2-Jul-05-2023-12-04-08-5798-PM-3

While you might be tempted to roll back any change that isn’t positive, ultimately, that could lead to you missing out on crucial marginal gains against your competitors. It also stops you from evolving and creating new baselines for further testing and hypotheses. You aren’t aiming for a single state of perfection. You’re striving for constant iteration and improvement.

Step four: Reporting on your success

Finally, you need to be able to track and report on this cumulative experimentation program. And you need to be able to communicate it effectively to stakeholders.

Keeping everyone accountable to the testing cadence goals
Benchmarking win rates to make sure hypotheses and test setup are strong
Measuring and reporting on ROI/revenue numbers where possible (as mentioned in the next subsection)

Establishing a business culture that values experimentation can be tricky, especially in the face of some inevitable negative and inconclusive tests. The frustrating reality is that you can do an excellent job in every aspect of your testing and still not exceed a 15% win-rate.

In fact, it’s hard for market leaders to do better; changing things can knock them off the top of the perch. As your testing maturity increases and your site moves closer to ‘perfect’ win-rates actually tend to drop. This is because it gets harder to come up with an iteration that beats the status quo. In these cases, the value of testing is to avoid loss more than it is to gain wins.

We’ll repeat this point because it’s so important: all tests are valuable. A successful experimentation program will combine quality AND quantity.

Quality: strong hypotheses and running an effective testing process.

Quantity: running the highest volume of quality tests you possibly can.

So, when it comes to reporting to stakeholders, return to your strategic goals and benchmarks and:

Keep everyone accountable to the testing cadence goals.
Benchmark win rates to make sure hypotheses and test setup are strong.
Measure and report on ROI/revenue numbers where possible.

Championing your program

Lessons learned from one experiment inform the design and hypothesis of the next. They can even provide valuable insight to other teams in the organization. Over time, this cycle of learning and adapting to changing SEO landscapes can drive significant innovation and improvement.

And these are the stories to tell. We can help you translate wins/losses into revenue numbers; for example, you might have a $/session value. Together, we can also help you showcase stories and learnings from iterations of an experiment concept that finally saw a winner.

The emphasis is a deeper understanding, continued growth, and a lack of unsightly dips in traffic that in the past might have been solved by scrambling an already-stretched engineering team to roll back an untested change.

Taking it one step further: Integrating CRO testing

In our view, CRO-only testing – in which you randomly assign cookies to users and show them a landing page corresponding to that cookie – is limited. Googlebot does not accept cookies and will always see the same version of every page, which means no insight into SEO changes. It’s almost certain that there are inconclusive CRO tests that would have been SEO winners or seemingly straightforward winners that harm SEO performance.

SEO and CRO testing work better together. Our full-funnel testing allows you to measure the impact of SEO and CRO changes at the same time.

How does it work? We run a CRO test, where users are randomly bucketed. Once we have a result in that test, we end the CRO test, bucket pages as per our normal SEO testing methodology, and run the SEO test, during which we can monitor the combined impact.

CRO testing. Users are randomly bucketed, regardless of the page they land on, to test the impact of a change on conversion rate.
SEO testing. Once you have a result from the CRO test, we end the test. The pages are then split into control and variant groups as per our normal SEO testing methodology, and we run the SEO test.
During the SEO test, we can then monitor the combined impact on the two metrics.

The result? You can see the impact of SEO and CRO simultaneously and use this knowledge to inform future hypotheses and strategies.

This will lay the foundations for more successful and targeted testing. It will also enable greater collaboration and understanding between departments. Your SEO team can test changes without fear of harming conversion rates, while your product team can deploy UX changes with a greater understanding of organic search impact.

Let’s experiment

Your experimentation program can and should be unique to your business. We can work with you to construct your approach from the ground up, from hypotheses to final reporting.

Book a meeting with one of our experts to discover how an experimentation program can accelerate your SEO progress and boost your bottom line.

Share this post