Blog

Do it Yourself SEO Split Testing Tool With Causal Impact

Dominic Woodman

7 min read

SearchPilot enables SEO A/B testing on large and enterprise websites. For people just starting out experimenting with SEO testing, you might want to be able to play with more basic mathematical approaches. Before we moved to a neural network model we used an approach based on a modified version of causal impact. We have a free version of a DIY tool that uses the causal impact approach that you can play with yourself using your own data. If you want, you can jump to get the tool here. This post will walk you through how to use it, but before we do that, let’s jump back a step:

What is an SEO split test?

An SEO split test is where you make a change to your website on a subset of particular page template to see how those pages perform differently to the other half.

For example, you might change the title tags on 50% of your product pages and see how they perform compared to the other half.

This is different from a CRO split test, where you show users different versions of the SAME page.

Time for some quick definitions

  • Variant - This is the set of pages with the change. In our example the altered title tag.
  • Control - This is the set of pages where we made no change.

What do you need to use this tool

You’ll need to have run (or be running) an SEO split test (see ‘How does A/B testing for SEO work?’ in this post for help). You’ll need to have separated your pages into two groups, made a change to a percentage of them and then downloaded the total organic traffic to each of the two buckets.

(One way to do this with GA is sending a hit level custom dimension which contains “control” or “variant” and then measuring organic entrances, or you could also do this, by downloading the data for each of the individual pages and then matching them to your control and variant buckets.)

Specifically what you need is:

  • Total organic entrances (or sessions) day by day for the sum of your control set of pages.
  • Total organic entrances (or sessions)  day by day for the sum of your variant set of pages.

For both of these groups, you’ll need 100 days of this data before the test begins, plus however many days your test has been running for.

So if your test has been running for 14 days, you would need 114 days data.

Why 100 days of historic data? In short, this is what allows the maths behind this to work correctly.

Want to see an example data set? Here’s one we’ve put together in a Google sheet. You can then enter this data in the tool.

How do you use it?

You enter the control and variant data into the boxes in the tool, choose the start date for your test and click run.

(The tool knows your test begins 100 days in, so the start date input is purely to set the axis correctly. )

The tool will then plot your variant against the control using the Causal impact model, the start date will be highlighted on the graph and you can see how they perform relative to each other.

If the red line is positive, your change was good. If the blue line is higher then your change was bad.

You can also download the data in a CSV to calculate how much better they perform.

How does this work?

We’re about to enter the wonderful world of maths, so brace yourselves.

This tool uses Google Causal Impact model (you can find the academic paper here, there isn’t much written on this if you’re not maths inclined although I think this post was better than some of the others).

It’s a form of regression model and works kind of like this (simplification ahead):

Causal impact lets you break down time series data (data which is day by day) into its component parts i.e.: seasonality, industry effects, and the underlying trend.)

You provide causal impact with data to model those effects (seasonality, industry demand etc.) and then it creates the model using those inputs and your time series data. By isolating the other effects, it allows you see the true performance beneath those.

So how does it work in this case?

Well, our time series data is the variant set. We want to know how would that set of pages have performed if there was no change, so we can use the causal impact model to mimic that.

We provide a variable for time (you never see this) and a control set of data (what you enter) which then helps the model to account for any swings like sales or Google updates which should affect both the control and variant equally. This allows us to isolate and compare the variant and modelled control, which will have accounted for seasonality and site wide swings.

Why not just directly compare control and variant? We can’t directly compare them because of possible differences in the variant and control groups, the most obvious example of this is the two groups may be different sizes depending on how the pages got sorted.

For example, your variant may have an average of 5,000 organic sessions a day, where you control may only have an average of 4,000 organic sessions a day, so we can’t compare the absolute fluctuations in our two sections.

There’s more to it than that, but that is the easiest to follow example. Get the tool here.

Why don’t we show statistical significance?

Statistical significance is an important concept. With any kind of statistical modelling there will be a band of error.

This gets a little more complicated however when looking at a prediction over time. If we were just comparing two days we might be able to say that A > B by such an amount that the result is statistically significant.

However, if we’re comparing two-time series, then what is important is the performance over time and not a one off date. If one consistently outperforms the other, then what is important is the total aggregate sessions, not any individual day. All the individual days may be within the margin of error and yet the total makes it notably significant.

This basically makes day by day significance misleading, which is what this graph would show. Instead you need to calculate significance on total aggregate sessions i.e. total sessions to control vs total sessions to the variant, which you’ll need to manually with any standard significance tool. The SearchPilot platform has this kind of functionality, and you’ll see it in our published case studies, but it is not in the free tool.

Why have we made this tool?

We’re fully bought in on split testing. We think it’s the future of SEO and the way the industry is going. We even built our entire platform around it!

But we also recognise that not everyone can afford large scale enterprise tools, so we wanted to make the basic maths available to everyone and encourage more industry testing.

While the maths here is a simpler version of a mathematical approach we have now moved on from in SearchPilot (we can’t invest the same scale of resources into testing different models in this tool as we can in a full piece of software), we did successfully run SEO tests using a more complex model based on the same underlying causal impact approach.

Anyway, enough waffle. I hope you all find it useful!

Get the tool

We also publish regular SEO A/B testing case studies to our email list - you can register to receive those here:

SearchPilot

Get a Demo

If you're interested in a short demo, please fill in this form and one of the SearchPilot team will ping you an email.

Alternatively, if you have any other questions, feel free to drop us a line at contact@searchpilot.com.

https://

© 2015-2020 SearchPilot. All rights reserved.