The problems of over-relying on A/B testing

Published in

UX Collective

8 min readFeb 27, 2021

Photo of 2 doors, on that has an A and another with a B — Photo from Jason Dent on Unsplash

Don’t get me wrong, I’m a big fan of experimentation and metrics-driven product development. It brings scientific evidence to decision making which is much better than blindly shipping features without validating the expected impact they may have. Here’s a famous example:

“Around 2009, Google tested 41 different colors of blue for Gmail ads and search result links, a change that ultimately earned the company an extra $200 million a year in ad revenue.” — The Guardian

This is the ‘‘holy grail’’ in Product, to have full awareness of the outcome of what is shipped. However, this sense of certainty can sometimes make us over-rely on A/B tests as a solution to all the problems we face when building products. As someone once said, “if the only tool you have is a hammer, you will start treating all your problems like a nail”.

So let’s talk about some of the “silent” problems that occur from over-relying on A/B testing:

Deprioritising disruptive innovation
Deprioritising opportunities that “only” move the needle in the long term
Deprioritising opportunities that don’t necessarily impact a major business KPI

#1 Deprioritising disruptive innovation

In the acclaimed book ‘Innovator’s Dilemma’, Harvard professor Clayton Christensen distinguishes 2 types of innovations: the first, ‘sustaining innovations’, are the ones that improve a product’s performance; the second, ‘disruptive innovations’, are about breakthroughs and discovering and satisfying future customer needs. An example of ‘sustaining innovation’ was when Airbnb started verifying the pictures of the homes listed on its platform (as an attempt to improve the booking experience, and as a consequence the churn rate in the funnel I assume). On the other hand, launching ‘Airbnb experiences’ was an example of a ‘disruptive innovation’ since it required looking at the journey of travellers holistically and pursuing an opportunity not satisfied in the market yet.

Verifying host photos to reduce churn in the booking funnel is an example of how A/B tests are great to drive incremental innovations. However, they are pretty bad at leading to creative disruption. Why? A/B tests encourage a mindset of relentlessly designing experiments that beat the control version because that is what success means in the A/B test framework. However, this mindset of prioritising the hypotheses that are more likely to succeed on an A/B test, naturally leads to deprioritising opportunities that are disruptive and haven’t been explored before. In other words, testing the first iteration of a new solution against one that is mature and optimised will likely be a ‘failed’ experiment. As so, A/B tests are great for incremental product improvements but solely relying on this framework for product discovery is likely to make you drift away from ‘disruptive innovations’.

testing the first iteration of a new solution against one that is mature and optimised will likely be a ‘failed’ experiment

The beauty of an A/B experiment is that by transforming a hypothesis into an MVP, we get quantitative confidence on whether that hypothesis is worth pursuing further or not. This implies low development effort and allows us to shift direction fast, which is key to building great Products. However, if you think about it, many ‘disruptive innovations’ are difficult to build a minimum-effort experiment around them (iPhone, Snapchat Spectacles, …). Take Instagram stories’ face filters for example: how would you validate if people want to change the way their faces look with effects and would actually share it in their Instagram stories? How would you get statistical confidence that this is an effort worth making? It’s tough to build an MVP around this feature, probably the most minimal way is to develop a few filters (maybe some basic ones) and see if users actually engage with them. Sometimes you just have to take the risk, develop it and see if it sticks. (obviously, you should gain confidence in other ways, like qualitative research and having users interacting with a prototype of that feature).

Photo of a girl adding face filters effects to her story on Instagram — Instagram face filters

Andrew Chen (former VP of Growth at Uber) also wrote about this phenomenon, arguing that designing with the goal of moving the needle in metrics often leads to prioritising features that are already proven to work elsewhere over innovative features or interactions. The result of this, he argues, is that you may end up with a “mish-mash of features that your audience has already seen elsewhere, and done better too…it’s a recipe for mediocrity”

#2 Disfavouring opportunities that “only” move the needle in the long term

A/B testing culture is built upon the following mantra: evaluate if opportunity X has an impact on KPIs Y and Z, in a quick timeframe. Do this quickly and you’ll be able to promptly validate many opportunities and identify which ones are worth building further. However, what if some of your hypotheses are expected to only have an impact in the long term and not so much in the first couple of weeks you test it? These types of hypotheses that are more likely to move the needle in the long run (opportunities that impact customer retention, loyalty, delight features) end up being deprioritised due to the need to see results in the short term.

Fiverr — example of a freelancer that composes songs

Imagine Fiverr (which is an online marketplace for freelance services) is looking to improve the quality of the services provided by the freelancers registered in the platform. Suppose they have the following assumption: “by showing client feedback to freelancers, they will improve the quality of their work (which is expected to be reflected in an improvement of their rating)”. To see the real impact of an initiative like this, it may take a lot more than just a few weeks, which is the standard duration of an A/B test. In order to see the expected effect, firstly, clients will have to be prompted to give feedback; secondly, they have to do it enough so that at least some of it is actually constructive and actionable; thirdly, freelancers will need some time to consume all the information and actually apply it in different projects; and finally, the new clients will have to provide feedback and ratings to freelancers again so that we can measure if the rating improved or not. Therefore, this loop means the experiment can take a lot more time than we wish. So, should we deprioritise this hypothesis and pursue other hypotheses that have shorter testing times? It depends. In this case, this is a high-value opportunity that has the potential to have a significant impact on the business since it’s likely to exist a correlation between customer satisfaction and customer retention. Therefore, we should evaluate opportunities not just based on metrics or testing speed, but also on the level of confidence we have and its potential value. As Jens-Fabian Goetzmann (ex-Product lead at 8fit) wrote, “overreliance on A/B testing also leads to short-term thinking”… it makes us “myopically focus on metrics as opposed to customer and business value”.

#3 Deprioritising opportunities that don’t necessarily impact a major business KPI

Over relying on A/B testing means you’re optimising for change in KPIs. However, not everything that is good for a user will be reflected in a change in KPIs. This is a hard truth to take but it just is the truth. Think of some of the products you use most often… maybe some of their features delight you but won’t necessarily make you use them more because of it. Take Facebook Messenger for example, you didn't start using it more because “now they have dark mode”. However, if you’re a dark-mode fan, you were definitely excited about this feature. You may have even encouraged some of your friends to change their app from light to dark mode.

Not everything that is good for a user will be reflected in a change in KPIs

This means that major KPIs (like activation, conversion, average revenue per user) can’t be the only way to measure success. We need complementary ways to understand what success means. In the Messenger example, these could be “how satisfied are our users with this feature?” and “how much are they using it?”.

Example of what looks like an A/B test from Google where users are asked “How satisfied are you with the notification that brought you here?” — Example of what looks like an A/B test from Google. After clicking on a notification from Google News, users are directed to the news article and prompted with a question that aims to gather more insights on how users feel about being notified of news. This is an example of not just looking at the obvious KPIs (like CTR) but also gathering qualitative data.

Blindly focusing on metrics also means deprioritising a more delightful product experience. In some cases, it may make sense to deprioritise ‘delight’ due to the company strategy, limited resources or the nature of the product. However, most often than you may realise, this is done unintentionally because ‘delight’ features are more difficult to impact a KPI short term, and consequently, they end up being deprioritised. On the other hand, when you focus on ‘delight’, your users stick to your product in the long run. This happens because when many delightful moments aggregate in a product, you have built a better overall experience, which leads your users to not only keep using your product over the competition but also to become ambassadors and promote it in their social circles.

Spotify canvas feature — Spotify video loops feature doesn’t necessarily impact a KPI in the short term (like music listened per session) but it does create a more engaging and delightful experience. Arguably, this impacts users’ preference for Spotify over other streaming platforms in the long run

Overlooking opportunities that have an impact on the long run happens not just at a feature prioritisation level (as shown before), but also at the level of prioritising development increments. Agile brought many improvements to how products are shipped, but a less obvious one is that it emphasised the need to articulate the value that each feature increment brings (before the team commits to ship it). This is good because it encourages us to keep a ruthless focus on customer value. However, it isn't reasonable to expect that every increment will impact the KPIs we’re aiming at and because this doesn’t happen, it may lead to a tendency to only building a portion of a full-fledged solution (the portion we expect to impact KPIs the most). Optimising for metrics over business/customer value is a dangerous road to follow, do this many times and you end up with a mediocre product.

Conclusion

Experimentation is extremely important to build Product, especially to generate optimisations. However, A/B tests can give an unreal sense of certainty which makes us over-rely on it as a tool, and leading to the problems mentioned before. In conclusion, there isn’t a one-size-fits-all way to solve problems in Product and being aware of the shortcomings of A/B tests is a good start to builduing better products.

Thoughts? Do you have more examples? Leave me a comment!”

👋 Let’s be friends! Follow me on Medium for more Design-related articles!

The UX Collective donates US$1 for each article published on our platform. This story contributed to Bay Area Black Designers: a professional development community for Black people who are digital designers and researchers in the San Francisco Bay Area. By joining together in community, members share inspiration, connection, peer mentorship, professional development, resources, feedback, support, and resilience. Silence against systemic racism is not an option. Build the design community you believe in.