The Wild & Wonderful World Of A/B Testing

Paul Armstrong
UX Collective
Published in
11 min readJun 25, 2019

--

In the dark and distant past, when our design ancestors — the cave painters, the illuminated manuscript doodlers, the church muralists, the stained glass artisans, the poster printers, the annual report elitist, the cyber pixel-pushers — lived in a world where the design profession was regulated to the dark corners of subjectivity, between art and marketing, between personal perception and emotion. Designers were haunted by the painfully unhelpful phrases uttered by clients, such as “I’ll know it when I see it”, “This doesn’t pop”, or “Could you make the logo bigger?”.

Every designer after getting useless feedback

Client feedback was an onslaught of verbal ambiguity, and unless you had some notoriety or prestige, this type of opinion-based reaction was nearly impenetrable. Beauty is, after all, in the eye of the beholder, and good taste has no playbook. If your patron, or client, or that distant uncle (who sold refurbished hubcaps on eBay was your first paying customer while you decided on what summer job to take), didn’t like what you designed, you had little to no recourse, because they were paying the bills.

But as our technological landscape has evolved and developed, most specifically because of the creation of smartphones and devices, so too has the tools by which design and designers can measure and bolster the success of their solutions — and the somber days of “I’m not just feeling it” can become a relic of the past.

One of the more powerful and insightful tools to emerge from the expanding utility belt of legitimized, data-centered design weaponry is the A/B test. While many UX specialists, front-end developers, data analysts, user sherpas, and statistic gurus understand and use A/B tests, there are still many people who have yet to utilize their power.

What is an A/B test?

An A/B test measures an equal number of people’s choice for a one variant over another, roughly speaking

It’s easy to understand the purpose and meaning of an A/B test from context. Simply put, it’s comparing A to B, one thing to another thing. Or slightly less simply put, it’s showing an equal number of individuals either a control object (A) or a variable object (B) in order to learn which performs better.

The tendency for many of us is to make assumptions about what others like and understand and how they behave based on our own personal preferences and insights (which is so many logical fallacy red-flags). These individualized motives then guide recommendations for changes and updates to design or content. But when one personal preference contradicts another personal preference, how can one decide which is correct?

Rather than devolve into winless emotional arguments, an A/B test becomes a methodical and productive way to learn which content or design solutions are the most effective by tracking and comparing user preferences.

By observing the actions and behavior of a broad set of people, decisions can be made based on informed, verifiable, and statistically relevant data and not just personal opinions. And in order to avoid emotionally charged, often impulsive and reactionary impulses to create drastic and dramatic modifications, a properly implemented A/B test focuses changes to those that are linked with specific goals, actions, and outcomes.

Why do an A/B test?

If you have an assumption on how someone will behave, it’s time to test!

So what’s the point in even doing an A/B test? Won’t it will take forever to determine what specifically to test, plus the effort involved in setting up the test and monitoring the results, not to mention the reliability of the data, and make it impractical and time consuming? While it’s true that there is effort, coordination, and time involved conducting an A/B test, that’s no reason not to do one.

Assumptions are a foundational ingredient in all ideas. Our assumptions inform every initial solution we create. But our assumptions don’t always lead to the best results. That’s why nearly a hundred years ago Madison Avenue ad agencies (or more specifically a man named Daniel Starch) began conducting market research. In order to move past assumptions, researchers would conduct tests on groups of customers to gather insights on their intended behavior and overall reception to products.

The problem is that insights are just that — insights. While insights are important and necessary to establish a baseline of understanding, they do not provide precise data. An insight is the hypothetical intent of someone to do something. As we all know, everyone lies (or at the very least, they don’t always do what they say they will do).

House is right. It’s also never Lupus.

🛠 EXAMPLE:
Your friend asks if you are planning on going to the Whiskeytown reunion concert in six months, and you say yes — insight, yes you are going to the concert. But in the time between your answer and the concert, you learn about the sexual allegations against Ryan Adams and decide not to go. The insight and the result do not match.

That’s why a thorough test that observes action, not just intent, is a necessary research step, as it helps clarify if insights are accurate by understanding the difference between how we believe people will behave versus what people actually do.

Whenever there are assumptions made about a customer or user, there should always be an accompanying test to verify or invalidate those biases. The importance in all testing, not just A/B testing, is that regardless of the results, the outcomes are always learning opportunities. There are no good or bad results, there is only information.

When to do an A/B test

It’s important to know how your product is performing to know if you should do an A/B test.

Now that you understand what an A/B test is and why they are important in order to deliver validated, data-driven, and quantitative user solutions, the next question to answer is when you should perform an A/B test. Whenever there are claims of knowledge based solely on assumptions, especially assumptions of personal experience, it’s time to run a test. Whenever you hear someone say “there’s no one anyone would do X”, it’s time to run a test. If you ever think “if you did X, you’d have to be an idiot”, it’s time to run a test. After you get an email where your client wrote “I believe X because it’s what I like”, it’s time to run a test.

Any product that engages with actual users or customers will have specific goals that are calculated in metrics. Whenever there is a need to incrementally improve or understand the performance around key events of your product, that’s when an A/B test should be performed.

🛠 EXAMPLE:
Imagine you’re creating a new social media app — one that protects user privacy, champions polite conduct, and is exclusively for cat owners — called Prrr. In order to gain traction and overtake Twitter, you need to focus on getting new users. The key metrics to measure for new users would be conversion, from a visitor to a user (account creation).

There are many commonly used event metrics, all of which provide the complete journey of how your users or customers are interacting, understanding, and engaging with your product. Here’s a list of some standard key event metrics:

  • Daily Active Users (DAU)
  • Customer Acquisition Cost (CAC)
  • Bounce Rate
  • Conversions rates (which includes various funnels, such as: Account Creation > Product Search and Discovery > Add To Cart > Purchase)
  • Lifetime Value (LTV)
  • Click Through Rates (CTR)
  • Usage/Behavioral Metrics (session length, timeline based measurements)
  • Monthly Revenue
  • Net Promoter Score (NPS)

Accompanying the vast amounts of measurable behavior are an even greater ocean of tools and software solutions that allow you to gather that data. But don’t worry you need to drown in the sea of abundance, there is a boat of standard methodologies (boy, I really hate this analogy) which provide different types of information and research, and more accurate insights into your metrics, which in turn will help you determine the best tools to use.

While it’s important to collect data, it’s equally important that you gather accurate and applicable data. Just because you’re able to string together a set of words and sentences and paragraphs doesn’t mean you’ve told a story. Good data is the story of your users behavior and how that directly impacts your product goals. Don’t let your data be like a game of Mad Libs (“The threatening kitten has a candy bar improvement show, because tomorrow constructed of lifeless USB cords”. See, those are all words, strung together, but means nothing; at least not intentionally).

Overall there are two types of patterns to understand your users and how they are interacting with and feeling about your product, through direct feedback which is gathered through things like customer support tickets, user reviews, or NPS and other survey results — and observed behavior which is discovered through event tracking, heat and scroll maps, or click recording.

Here are some common triggers of feedback and behavior that help lead you toward when and what to A/B test:

  • Consistent user complaints
    Whether they come from direct emails, customer support phone calls, product reviews and ratings, or support ticket submissions; patterns quickly emerge on what users are unhappy about.
  • Issues associated with functionality
    It’s important to have a system by which you can monitor and gather issues with your product, often reported as bugs. These either point to simple engineering solutions, or built-in issues with overall functionality.
  • Decreasing performance
    As you track your products metrics on a daily basis, you will inevitably find areas of decreasing performance within certain funnels; like decreasing number conversions, or increasing cart abandonment. When this happens, it’s time to test to find out why.
  • Insights into viability of new design elements
    If you’re like any other designer who has lived, you will quickly grow to loath your solutions, deeply, passionately and with profound malice. When inevitably begin to change specific design elements, it’s recommended that you test the impact these changes might have on product metrics. There are almost always downstream consequences to any changes.
  • Test effectiveness of content wording
    If you have a marketing or brand team, they will inevitably want to test copy and content around a specific campaign or initiative. BOOM — perfect time for an A/B test.

How to create an A/B test

Something about the David Bowie song “Changes” and other smart words. I’m not sure you’re even reading this.

There are many types of in-market (by “in-market”, I mean something where actual users are engaging with your actual product, regardless if it’s physical or digital) product testing methods. Depending on what you want to change and how much you want to change, will help determine if an A/B test is the best method.

A/B testing ought to involve small, focused, incremental element adjustments; whether it’s testing wording, language, color, size, or style. Why limit changes? Because adjusting too many things at one time makes it nearly impossible to determine which change is impacting your data; was it the color, or the size, maybe the shape, or the content? If you want to change more than a simple headline, or a background color, or parts of a button, then you should perform a different test (such as multivariate testing, feature flags, split-URLs, or a staging environment).

Let’s consider the ordinary button. A simple button can have up to 8 testable variations: color, shape, style, size and scale, text wording, typeface options, visual icon, and placement within a component. While it is good to focus on one variation at a time, there are some variations that don’t typically yield any measurable improvements. For example, changing a font from bold to extra bold or from one sans-serif typeface to another doesn’t (typically) impactful results, unless they happen to also affect legibility. You might also find changes in color — whether they’re subtle shade and tint adjustments or even drastic hue shifts — don’t create significant variations in your results. But don’t my word for it, test and find out!

It’s important to determine what you want to test, such as wording, shape, color, style, size, placement, icon use, or type style

So now that you’ve analyzed your event metrics, by gathering data from direct feedback and observed behavior, and have discovered patterns that need improvement, what do you do next? Develop a hypothesis! A hypothesis is the foundation of learning and discovery that you want your test will reveal.

🛠 EXAMPLE:
Through analysis of your data you’ve learned that users are not creating an account on Prrr. In the event tracking data you can clearly see many users visiting the account creation page, but only 8% of those that visit actually create an account. Based on user feedback you see a pattern of people complaining about providing too much personal information. When you observe their behavior you see many begin to fill the form except their address (which is a required field). Your hypothesis is that removing the address field or by not requiring it will increase account creation.

As with any hypothesis, it is never meant to be right or wrong but only to yield unbiased data. If your assumption is wrong, then adjust your thinking and try again. Perhaps as in the example above, it’s the format of the inputs, or the amount of information required, or the language or explanation. There is always something to learn, because no user base is the exact same as another.

Now it’s time to choose the tool that will create, manage, and report on your A/B test. And there are many (many many many, like so many you guys) solutions to choose from. But here are just a few that I’ve either used, have heard of others using, or find intriguing:

All of these solutions typically require the installation of a script to run the software, in order to properly control the flow of user traffic to the variants you create, either using a WYSIWYG “overlay” on your product site, or for those with some basic coding knowledge, creating custom javascript and CSS overwrites. Most tools also require a handful of baseline metrics, such as average monthly visitors, conversion rates, and percentage of change you want to see in the test (here’s a handy tool to calculate how long you should run an A/B test). It’s also important to know the specific goals and actions you wish to track, such as button and element clicks, input submissions, or page visits. Lastly and most importantly, you need an element you wish to change to build your test around, the variation (B) to your control (A).

💡Insider Tip:

Don’t think of an A/B Test as an isolated learning. Let’s say you want to change the entire visual layout of an item card, rather than change the entire card all at once, update the component through a series of successive tests. This methodical, albeit slow, approach will strategically optimize individual elements to determine the best solution for each and in the process update the entire item card.

Don’t be intimidated by testing. A/B testing is a tool and mechanism that anyone can implement — anyone. You don’t have to be a designer or a developer to learn about how to improve your solutions. So start testing!

A/B Test Cheatsheet

Common information that you will need to know or gather in order to run a successful A/B test:

  1. Define and understand the goals of your product or project.
  2. Create or monitor event metrics based on the product goals.
  3. Collect and analyze direct and observed user behavior, in order to prioritize testing.
  4. Create an A/B test around a specific hypothesis, linked to a specific goal, showing a variance on an individual element.
  5. Let the test run as long as it needs in order to produce a statistically significant result.
  6. Use the results to implement change or test new hypothesis.
  7. Repeat.

--

--

Head Of Design at Pixel Recess, pixel fabricator, artisanal vector craftsman, creative thinkvisor, husbandist, fathertian, one-time baby, long-time idiot