How to Document Your A/B Tests

Documenting A/B and Multivariable tests should be a very straightforward exercise, but many folks dread this aspect of running experiments in their organizations.  Documentation doesn’t need to be such an onerous task, and following the template presented here will help tremendously with expediting the process. Beyond providing an outline by which to structure test documentation, we will provide guidance on how to create the document in stages that are synchronized with the entire scientific method, spreading out the labor to make the process feel less overwhelming. 

In fact, writing the introductory sections of your test documents upfront helps reinforce the focus of your entire testing program. We have witnessed testing teams at multiple companies discovering that in putting off all documentation until after tests concludes, for some period they had been haplessly testing solutions in search of a problem. (To be fair, some mature test programs are susceptible to falling into this rut when the team has run low on innovative ideas. We will provide advice on how to course correct in such a situation in a future post.) Getting into the habit of defining the problem upfront helps keep everyone intellectually honest and focused on trying to address the most urgent areas of need with their testing program.

We recommend structuring the test document in a manner that follows the outline below:

  1. Introduction/Background Context
  2. Problem
  3. Hypothesis
  4. Experimental Design
  5. Results & Observations
  6. Conclusions
  7. Suppositions

As previously mentioned, this outline strongly adheres to the scientific method we all likely encountered at some point in our science curriculum. Below are details on how to populate these sections appropriately to produce sound, insightful, and actionable documentation.

I. Introduction & Background

The purpose of this section of your document is to clearly explain the premise of the test. Quantitatively speaking, what is it about the performance of this population segment that has prompted this particular test? Have any broader situational (economic, regulatory, etc.) factors impacting your entire industry or target audience made this test of strategic importance?

Truly, this portion of the document should be completed in advance of even designing the experiment. You should prove not only to your audience but also to yourself that you can clearly articulate the impetus behind the test prior to coming up with an execution plan.

II. Problem

This may already be stated in the introduction, but the goal of the Problem section is to distill the test objective down into a more concise statement. For example, “Based on [Segment ABC’s] deficiencies in conversion that appear to be in relation to XYZ in the signup experience as evidenced by [data], we will examine the optimal presentation of [Element X] within the hero section of our landing page to determine whether [Element X] itself has a statistically significant impact on conversion rate, and, if so, which exact design treatment of [Element X] produces the best conversion rate performance.”

This does not need to be a lengthy sentence modeled exactly on the example above, but I encourage clearly capturing the intended audience for the test, substantiating what element seems to be ripe for optimization, the supporting data for that assumption, and into what aspects the experiment aims to provide insight.

As with the introduction and background, the problem section should ideally be composed in advance of designing the experiment in full detail.

III. Hypothesis

The hypothesis section justifies the choices of test cases being used in the test. The experimenters do not necessarily need to play favorites with any one particular test case if there are multiple test cases challenging the control; however, it is important to understand why the test cases are believed to possibly have an advantage over the control. Otherwise, someone will question why a decision was made to run one or more test cases that no one involved in the conception of the experiment truly thought would outperform the Control.

As a caveat, there may be scenarios in which you are forced to remove or modify an element in a manner that could be detrimental to conversion rate, and the purpose of the test is to identify the best “damage control.” You aim to (a) find out if outright removal of the element is truly so detrimental and (b) if so, find the best replacement possible, even if an interim solution until a comparable alternative is found. You are using the experiment to find a new treatment that degrades KPIs the least relative to other available options. And you iterate via subsequent tests from there.

IV. Experimental Design

The experimental design portion of the document should demonstrate sound decision-making in determining the audience segments involved in the test, requiring upfront quantification of what minimum sample size per test case is required to detect a certain magnitude of difference at a desired confidence and power.

This should immediately be translated into an estimated test duration , which is helpful in considering the macro view of the test program and how and why a test that may require a large amount of bandwidth over an extensive duration was justified at that time. There are free resources online that can help with this exercise, but in the event of being unable to create or access measurement tools to come up with an exact sample size or test duration requirement, one can use historical test performance data on the population segment to approximate test length and sample size, and many A/B testing tools today will provide near real-time updates to indicate statistically significant differences along with the confidence level. Don’t let perfection be the enemy of progress, as they say, but – as an aside -- it will ultimately help the management of your ongoing test program once you have the means of accurately estimating test lengths upfront.

As part of the test duration estimate, the number of test cases should be called out, and ideally screenshots of the test cases will be included with prominent annotations of where they differ from the Control.

Other points of due diligence should be mentioned to ensure that any possible bias is removed from the test. Will the test cases be randomly and evenly rotated by the test platform in use? When testing landing pages, did the developers maintain similar page weights and page load times? If other changes to the test cases were required to accommodate the changes to a variable that otherwise was desired to be measured in isolation, that should be explained in this section, too.

V. Results and Observations

Lead this section with proof that the test execution was valid and that there was no detectible source of bias that arose over the test duration. If there were any anomalous events (e.g. unforeseen website outages, abrupt changes in volume from a given traffic source involved in the test, etc.) during the test, they should be mentioned here along with either how these issues were handled in a way to not corrupt the test’s outcome, or it should be acknowledged that there is possible impact to the test outcome.

The results with respect to the primary KPI should be shared next. The details of presentation format are largely a subjective preference, but for each test case the results should indicate of the test case’s sample size, the number of results observed, the KPI measurement, the relative difference from the Control’s KPI, and the % confidence and power. Don’t forget to also compare the test cases against one another (not shown in the examples below).

Example of conversion rate data presented for a landing page conversion optimization test. Conversion rate (CR) is typically the primary metric for such a test, but downstream metrics should be examined as well to ensure the quality (or profitability) of acquired customers did not degrade.

Any intermediate metrics, funnel data, or downstream retention and revenue data should be shared in a similar fashion, as applicable.

Example of conversion funnel data (Not shown: columns containing relative differences between micro conversions).

Example of downstream test case profitability comparison in context of a subscription service or other business model that relies upon returning customer revenue, but whose fulfillment costs of products and/or services vary proportionately with revenue.

Example of downstream test case profitability comparison in context of a subscription service or other business model that relies upon returning customer revenue; fulfillment costs may not be proportionate with the amount of revenue driven and therefore a profit per visit metric is more appropriate than revenue per visit.

Stick to the facts in this section.  Conclusions and suppositions have their own sections as to keep this portion of the document fixated on the data.

VI. Conclusions

This section need not dive into detailed data, but declarative data-substantiated statements about the test outcome should be included here.  Be thorough, taking care to highlight any interesting takeaways that may be observed in secondary KPIs, but ultimately the statement as to whether or not the problem originally laid out in this document was “solved” or at least mitigated based on learnings from this test should be prominent in this portion of the document.

VII. Suppositions

Use this as a section to speculate about what could have been done differently, about whether the test concept should be expanded upon and tested upon other audience segments and in what format – or whether the idea should be shelved and why.  This is a good place to log future test ideas inspired by what was learned in this experiment.

Finally, come up with a good record keeping system that is readily searchable so that team members can easily reference your test documents as the occasions arise.  You may wish to institute some type of forum within your organization that allows an opportunity to present on experiments. These documents serve as the ideal foundation on which to create these presentations, and anyone who wants a deeper dive should be referred to the document for supporting details.

Questions about building a better testing program in your organization?  We would love to have a conversation with you. Contact us today.