Humanizing Machine Learning

Lessons learned from designing adaptive page testing for HubSpot

Tyler Beauchamp
UX Collective

--

I’ve been working as a Product Designer at HubSpot since March 2018. Of all the features I’ve worked on, I’d have to say my favorite has been adaptive page testing — a new and more powerful way for our users to experiment with the design and content of their website.

A quick primer if you’re unfamiliar with HubSpot: our products are built around the philosophy of Inbound Marketing. We think it’s better for businesses to attract customers with valuable content, rather than bothering them with interruptions they don’t want.

Businesses use HubSpot’s CMS to create landing pages and website pages that contain this valuable content. But sometimes it’s hard for a business to know what content will attract customers, and what might drive them away. The only way to know for sure is to test out different ideas — hence the need for robust experimentation tools in HubSpot’s CMS.

The value of experimentation

HubSpot has long recognized the value of experimentation — it released basic A/B page testing years ago, and it’s served our users well. Before I even started researching and prototyping ideas for improving this tool, I wanted to make sure that it would be worth the design and development time that we’d have to invest.

I don’t think anyone would question the value of experimentation; digital marketers are constantly experimenting with pricing, ad campaigns, and site design. But after reading a few case studies, it was surprising to me just how valuable experimentation can be, when done right.

One of the most compelling examples I found was from a blog post by Dan Siroker, Director of Analytics for Barack Obama’s 2008 presidential campaign. In 2007, the campaign ran a simple experiment on their website, with the goal of maximizing campaign contributions. The original landing page — the control — looked like this.

The campaign tested 24 different variations of this page to see how copy, images, and videos might affect form submission rates. Below are a few of the variations they tested, along with their performance (percent increase or decrease of submission rates) compared to the control.

The winner — a picture of Obama with his family and a “learn more” button— garnered 40.6% more form submissions than the control. If you do the math, you’ll find that this equates to 2.8 million more email addresses collected. With the average email subscriber donation being $21, that’s an additional $60 million raised, just for running a simple experiment.

TL;DR: experimentation is a big deal.

The problem with A/B testing

A quick refresher on how A/B testing works: visitor traffic is evenly split between the page variations. At some point, it becomes clear which variation is performing the best, and the experimenter manually chooses a winner. From then on, the winning variation is shown to 100% of visitors.

It’s a very simple testing method that works well enough for many of our users. But after playing with the feature myself, chatting with users, and reading support cases, I noticed a few limitations that I thought we could improve upon.

  1. Users must (or at least should) wait for a sufficient sample size before manually selecting a winner for every test run.
  2. During the testing period, half of the site’s visitors see the inferior page variation.
  3. Users are limited to testing just two variations at a time.

If we wanted to make experimentation even more valuable, we’d have to solve these three problems.

Research

In the fall of 2018, I connected with Mark Collier & Hector Urdiales Llorens of HubSpot’s Machine Learning team. They were researching the potential applications of a more sophisticated testing methodology called multi-armed bandit testing. It seemed like it could be the perfect way to test page variations.

With multi-armed bandit testing, traffic is split evenly between page variations at first. As HubSpot learns how these variations are performing, we adjust the traffic automatically, so that better-performing variations are shown more, and poorer-performing variations are shown less. Since the experimenter doesn’t have to wait to see how variations are performing before traffic is adjusted, multi-armed bandit testing delivers results (i.e. improvements in form submission rates) much faster.

Multi-armed bandit testing could solve all three of the aforementioned limitations of HubSpot’s A/B testing tool:

  1. Users would no longer have to wait for a sufficient sample size before manually selecting a winner for every test run.
  2. During the testing period, fewer visitors would see the inferior version of a page, since traffic is automatically routed to the better-performing variations.
  3. Users would not be limited to testing just two variations at a time.

We were all excited about these potential benefits, but I wanted to make sure our users would find it useful too. So I did some digging and compiled a list of users who regularly used HubSpot’s existing A/B page testing feature, since I figured they’d be most interested in helping us improve it. I asked how they used A/B testing in HubSpot today, and how they might feel if we could automate some of the testing work for them. I explained how the multi-armed bandit model worked, and how it could solve some of the shortcomings of A/B testing they expressed. While most of them seemed excited about the idea, I also began to notice a pattern in their reactions: apprehension about ceding some control of page testing to HubSpot.

It’s an understandable response. HubSpot users (rightly) feel that they know their businesses and processes better than we do. It takes a lot of trust for marketers to give up some control of their website traffic to some “mysterious algorithm,” as one user put it.

But this wasn’t an insurmountable problem. If we could make this feature less mysterious, more human, and more approachable, perhaps we could build some trust with our users. I wrote up the two principles that would guide all of our work going forward:

Maintain clarity and transparency about how the multi-armed bandit model makes decisions.

Clearly communicate test results in a way that demonstrates the value of experimentation.

What we built

Many weeks of sketching, prototyping, testing, and iterating went into building our new testing tool, but before I get into that, let me walk you through the finished product.

Here’s a step-by-step guide of how to create a multi-armed bandit test (presented to users as an “adaptive test”) in HubSpot.

1. Run a test

Start in the HubSpot CMS page editor and click “run a test.” A modal appears that asks which type of test you would like to run: A/B test or adaptive test. Descriptive copy helps explain the difference:

  • With an A/B test, your page traffic will be split 50/50 between two page variations. You choose the winner based on the test results.
  • With an adaptive test, you can test up to five page variations. The one that performs best will automatically be shown to visitors the most often.

Choose “adaptive test.”

2. Select a goal & number of variations

In the next step, there’s a dropdown for you to select the goal of your test. This goal is the metric used by the multi-armed bandit model to quantify performance of each of the variations. For now, the only goal is “maximize submission rate,” but more goals are coming soon.

Once you choose a goal, you can add more page variations (up to 5 total), and give each one a unique name.

3. Testing tips

Once you’ve created additional page variations, you’ll see simple instructions that tell you what to do next: toggle between your variations, and edit each one. If you’re not sure what to change for each variation you’ve created, click “see testing tips” to see strategies for testing different elements of your page, like CTAs, copy, media, layout and design.

4. Toggle between variations & edit

Once you’ve created your variations, the “run a test” button gets replaced with a dropdown that allows you to toggle between your variations and edit them to your liking. In the example below, the user is varying the main background image for each variation.

If you’d like to change a goal, rename variations, add variations, or delete variations, click “manage test” and make the changes in the resulting modal.

5. Publish

When you’re done editing all your variations, click “publish.” A modal appears to communicate that making major changes to variations after publishing may affect test results.

After you click “publish” in the confirmation modal, you’re brought to the page’s test results.

6. View test results

Once the page variations have been visited and (in this case) received form submissions, data from all page variations will appear in this “test results” tab. Until then, you’ll see a message that indicates testing is in progress.

Together, these test results aim to fulfill one of the principles I established at the beginning of this project: clearly communicate test results in a way that demonstrates the value of experimentation. Let’s go through each part in detail:

Goal & test start date. At the top of the page, you’ll find the goal you set for the test, the day the test started, & the option to end the test.

Goal metric comparison. A comparison of the goal metric across page variations. The header and thumbnail image clearly show which variation has been performing the best.

Traffic distribution over time. This is one of my favorite parts of the adaptive testing feature, and it’s one of the most important, since it fulfills the other principle I established: maintain clarity and transparency about how the multi-armed bandit model makes decisions.

This graph shows how HubSpot’s multi-armed bandit model allocates traffic to the page variations being tested, over time. You can see that it starts by splitting traffic evenly (though not exactly evenly, due to randomness) between the variations. Then, as the model learns that the “Beach” variation is receiving more submissions per view than the other variations, it starts to allocate more and more traffic to that variation. This behavior is succinctly explained at the top of this card.

Comprehensive table. This table shows all the metrics that HubSpot collects for the page variations being tested, automatically sorted by the ‘goal metric’ (submission rate, in this case).

7. End test

One of the best things about adaptive tests is that they’re designed to run forever. You don’t have to remember all the tests you’re running across your website; just start a test, and HubSpot will take care of the rest. But if you don’t like the idea never-ending tests, you can end them.

If you click “end test,” a confirmation modal will appear that asks you whether you really want to end the test, and which variation you would like to keep showing to visitors. The goal metric for all variations is shown, and the best-performing variation is automatically selected for you. Simply type “end test” (a precaution to prevent accidents) and click “yes, end test.”

Once the test is ended, you’ll see a card with a summary of the test’s results, including start date, end date, goal, and a comprehensive table of metrics.

And you’re done! You’ve created four variations of your page and let HubSpot take care of the rest. All you need to do is wait for us to automatically adjust traffic for you and maximize your submission rate.

It took a lot of testing and iterating to get to this point. Next I’ll go through what we learned and changed along the way to make the adaptive testing experience as clear and smooth as possible.

Testing & iterating

As soon as my team finished building a minimally viable version of the above flow, we began testing it. I set up a 2-hour meeting with a few dozen HubSpotters — designers, developers, and product managers — to see if they could successfully navigate this new feature, as if they were HubSpot users.

Throughout the session, everyone contributed to a shared document to record overall impressions, problems, points of confusion, and bugs. Here’s just a tiny sampling of the issues we uncovered:

It was pretty stunning to see how many potential issues came up, but I was grateful that we found them before our users did.

After several weeks of bug fixes, copy updates, and redesigning, I felt comfortable releasing adaptive testing to some of our users. I reached out to the same HubSpot users I interviewed about A/B testing a few weeks before, and asked if they’d be willing to try out adaptive testing and share their feedback. This gave us plenty more to improve upon. Below are just a few of the changes we made based on their feedback.

Name change

When we first presented this feature to users, we referred to it as “multi-armed bandit testing” since that’s the technical term for the model that powers it. It became clear very quickly that this was a mistake — nobody knew what the heck we were talking about.

To better understand how users were interpreting this term, I worked with HubSpot’s UX Research team and sent out a survey to a few hundred users with this question, among others:

Imagine there’s a new feature called “Multi-armed bandit test” next to “A/B test” in the HubSpot page editor. How would you describe what this feature might do?

172 users replied, and their responses were pretty hilarious:

As you can see from the responses, not only was the term unclear, it had a negative connotation that could prevent adoption of the feature.

So I worked with a few of HubSpot’s content designers to come up with a new name for multi-armed bandit testing. After much deliberation, we decided to change the name of the feature to “adaptive testing” since we thought it was a simpler, more approachable term that still accurately described the behavior of the multi-armed bandit model.

“Multi-armed bandit test” wasn’t the only confusing copy — the test descriptions weren’t super clear either. So, we updated them and added instructional copy to give users a better idea of what running a test involves: creating page variations and testing them out.

Goal setting

Initially, the second step of this modal involved just naming page variations. We thought all users would have the same goal for their pages: maximizing submission rate. But some of our users had different goals in mind, like minimizing bounce rate. In response, we’ll soon be adding a way for users to choose from a wider variety of test goals.

Pre-publish warning

Any scientist worth their salt knows that making changes to variables during an experiment could affect its outcome. The same goes for experimenting with variations of a page. We wanted to make sure users understood this, so we replaced the standard publishing warning modal with a more descriptive one.

Test results

Our first iteration of test results was pretty basic. After testing, we discovered some opportunities for improvement.

Several users told us that the number one thing they want to know is which page variation is performing the best. We presented this information in our first iteration, but I thought we could make it even more prominent. We now clearly state the best-performing variation in the header of the first card along with an accompanying thumbnail image.

We also added a traffic distribution chart to help users understand how our multi-armed bandit testing model allocates traffic over time. You’ll notice in the example below that the best performing variation (“Beach”) is also the variation that’s getting shown to the most visitors; the colors of the chart and graph match to help make that connection clear.

Rollout & Impact

After we resolved most of the above issues, we felt confident about releasing adaptive testing to more Marketing Hub Enterprise users. I developed a rollout plan that brought us from 20 users to over 7,500 users in 10 weeks.

Of course, things don’t always go exactly to plan — about 7 weeks in, we had to take a multi-week pause to improve the security, performance, and reliability of the HubSpot CMS. But on Thursday, August 8th, we released adaptive testing to all our Marketing Hub Enterprise users. 🥳

Adaptive testing in the wild

After many weeks of hard work, it’s been so gratifying to see all of the cool ways people are using adaptive testing. Here’s just a small sampling showing what they’re experimenting with.

Video

This Germany-based fitness center is testing two videos on their page to see which leads to more memberships.

Image background

This tax software business is hoping to find out which background style leads to the most eBook downloads.

Layout

This publisher is experimenting with different page layouts to see which one leads to more consultation bookings.

Button copy

They’re also testing how the words on their main call to action button affect bookings. In this test, the differences between form submission rates have been huge: “free consultation” performed almost 10 times better than “talk to us.” Words matter!

Featured image

And finally, my personal favorite example. This senior living center is testing which featured image generates more interest from its visitors. Unsurprisingly, the image of the ladies drinking wine is nearly twice as successful as the image of the ladies playing mini golf. Cheers! 🍷

In the first weeks of the adaptive testing beta rollout, both feedback and usage metrics have been very promising. For users who were presented both options, adaptive testing was chosen more than 10x as often as A/B testing.

It took months of hard work across multiple teams, but we’re finally at the finish line. We’ve used machine learning to make page testing in HubSpot more powerful while making it approachable and easy to use.

Hundreds of users have already used adaptive testing in creative and surprising ways. Starting today, it will be available to thousands more — I can’t wait to see what else they come up with. 🤖

--

--

UX & visual designer. I like to write about design, science, technology, and politics.