Methodology
This page contains requirements to get significant result. We use statistical tests to determine if the success metric of the variant is significantly different compared to the control. This is particularly helpful when we have borderline cases where control and variant perform similarly. For more technical detail please follow your escalation process.
Traffic Threshold
Minimum traffic should be 5,000 orders and 20,000 visitors per bucket to ensure the results are statistically reliable, with enough data to detect real differences and avoid misleading conclusions from random variation.
Suggested Duration
The standard suggested duration of an experiment is 4 weeks. This provides confidence that results are representative and that there are no significant website changes to affect the results. Moreover, this provides more data, which allows to make more granular decisions, e.g. by locale or device breakdown. For guidance on any duration other than the standard 4 weeks, please check the FAQs section.
Early Termination
It is recommended to let the experiment to run for the duration of 4 weeks however if it is required from commercial point of view, then they can be paused early in two cases:
- Experiment is showing conclusive results after two weeks, i.e. there is enough data, metrics stability and there are no large discrepancies in segmented data e.g. by locale or device.
- Experiment is showing inconclusive but consistently worrying negative results. You will be notified if your experiment is paused at any time.
Extension
An experiment can be extended when it is showing inconclusive results, due to any of the following:
- Metrics instability/fluctuations - making it hard to detect a clear trend
- Results discrepancies between sites within a division
- Results discrepancies between locales within a site
- Insufficient data because of unexpectedly low traffic
Aggregation
Aggregation can be done by division, e.g. Beauty or Health & Lifestyle, or by site. However, locale aggregation should be done only if we have a specific hypothesis, e.g. aggregate locales of a region that show similar behaviour in a commercial aspect, or in a metric, e.g. page load speed. Therefore, avoid aggregation by locale without a specific hypothesis, and instead monitor performance of each site and locale upon breakdown, as well as altogether. If multiple locales are performing badly, investigate what they have in common.
Trap Falls to Avoid
Cherry picking - aggregating sites/locales based only on performance.
- Example 1: Negative but non-significant RPV uplift for MyProtein Germany and France can make up a significant negative RPV uplift upon aggregation.
- Example 2: Significant negative RPV uplift for MyProtein Germany and non-significant positive RPV uplift for MyProtein France can make a non-significant RPV uplift upon aggregation, hiding bad performance of Myprotein Germany.
Site/Locale Selection
For site/locale selection, we want to run experiments in as many sites and locales as possible that are affected by the change or our roll-out recommendation. We inquire the stakeholders about this at test planning stage, as this affects data aggregation and duration estimate decisions. By default we don’t recommend to roll out on any site/locales without an experiment. Exception to that is very small sites or locales, where the cost of running an experiment is higher than the potential impact. It’s not straightforward to measure this, but we usually drop site/locales with <50 or 100 orders per week, depending on the overall size of the sites included.