How can AI help you write emails?

Insights from 156 people writing emails with AI in the loop

Published in

UX Collective

6 min readMar 26, 2021

Photo of hands of a user typing on a laptop, with an illustration of a little robot that points a pen at the screen. — Original photo by cottonbro from Pexels.

We built a prototype AI email editor with text generation from Natural Language Processing (NLP) and asked 156 people to write emails with it. We compared different UI settings and use by native or non-native English writers. Our motivation was to explore AI text generation to support writers — instead of aiming to replace them with AI. Here‘s what we’ve learned.

The prototype

As researchers in Human-Computer Interaction and AI, we set out to build and test an interactive AI system for writing. The figure below shows its user interface, realised as a web app: It offers a simple text editor view, with typical functionality (e.g. type, edit, delete, move the caret), plus the suggestions.

Suggestions could be selected by mouse and keyboard (tab: accept first suggestion; up/down arrows: change selection; enter: accept selection).

In addition, our user interface included study-specific information, namely an emailing task (scenario) and a button to submit the email and complete the task. The video below shows our prototype in action.

Video showing our prototype email editor with AI text suggestions in action.

The general UI is inspired by related projects (e.g. the Write with transformer demo by HuggingFace). To the best of our knowledge, our project presents the first formal, detailed evaluation of such a system in a user study.

As a model for text generation, we used GPT-2 as provided by HuggingFace, and finetuned it on a large email dataset (Enron), including some additional preprocessing.

We tested and iteratively improved our prototype in a prestudy with 30 people: For example, based on their feedback, we tweaked the UI and improved the model serving to increase the speed of text generation.

User study

We recruited 156 people to test our prototype. About 40% were native English speakers. Each person completed four email writing tasks (i.e. short scenarios). Thereby, each participant used four different versions of our system, as shown in the figure below: Concretely, we compared using zero, one, three and six suggestions. The task order was counterbalanced across people to mitigate bias through learning effects.

Illustration of the four suggestion conditions: Four images of abstract text (as lines) plus 0, 1, 3 and 6 suggestions (as rectangles) in a list. — People tested four versions of our system: Writing with zero, one, three and six suggestions.

We logged interaction data and asked people for opinions and feedback via questionnaires. The study ran online and took about 22 minutes on average.

Results

Let’s take a look at the data and results.

Nine fundamental interaction patterns

We modelled and analysed users’ interactions as sequences (e.g. type, type, delete, type, pick suggestion, …). This revealed nine fundamental interaction patterns for writing with AI in the loop, in three categories: Producing text, revision, and navigation. The figure visualises these elementary behaviours.

Overview of nine behaviour patterns. Each pattern is shown as an abstract illustration placed in a grid with rows: Production, Revision, Navigation; and columns that indicate the degree of AI involvement (left: user-driven email writing to right: AI-driven). — Overview of the found behaviour patterns when writing emails with AI phrase suggestions.

Key interaction metrics

Moreover, we analysed key metrics based on the logged interactions. These bar charts show an overview. I’ll break it down into concrete insights below, combined with further data not shown here (e.g. people’s feedback).

Six bar charts showing the results from the user study. — Overview of six key metrics measured on the interaction data from the user study. See text for interpretation.

Benefits: Multiple suggestions help to find phrases

Multiple suggestions support writers in finding useful phrases, as evident from several results:

Showing more suggestions increases the chance of accepting one of them.
Showing more suggestions reduces the need for users to manually modify suggested text after acceptance.
Giving users a choice of suggestions increases the use of suggested text in the email overall.

Costs: Suggestions cost time and actions

However, suggestions can come at a price for efficiency:

As can be expected, users take longer to choose from more suggestions.
In particular with six suggestions, people take longer to write the emails.
Suggestions change users’ actions in the UI: For example, showing more suggestions partly shifts user actions away from typing in favour of navigation (of suggestion lists).

Engagement with suggestions varies

As indicated in the figure with the nine patterns above, engagement with suggestions varies. From low to high engagement, examples include: Typing in bursts while ignoring suggestions, intermittently integrating suggestions, more densely using them, and chaining multiple ones in a row. Variations can occur both between people and within a person over the course of writing.

Language proficiency matters

Finally, our results give insight into the impact of using text suggestions when writing in one’s native or non-native language:

Non-native speakers accept more suggestions and gain relatively more from seeing more suggestions in parallel.
Time spent with multiple suggestions is less of an overhead for non-native speakers.
Non-native speakers perceived suggestions as slightly more positive and influential, regarding wording, content, and inspiration for using other phrases and words.

Discussion and takeaways

Overall, this work motivates three concrete takeaways for the design of writing tools that integrate AI text generation:

First, designers should be open to explore a larger variety of UI parameters than previously considered for such systems. For example, we could try more than the de facto defaults of suggesting one phrase (e.g. Gmail Smart Compose) or three words (e.g. smartphone keyboards).
Second, we should consider design goals beyond efficiency: Current text generation often aims to reduce typing and save time. However, our results highlight opportunities to design for other goals as well, for example inspiration or language learning. Such goals might also better align with a vision of human-AI collaboration, instead of replacing writers.
Finally, we should more explicitly consider user groups and their needs and preferences, when designing writing tools with AI. For example, as shown here for language proficiency, different people (or one user in different contexts) may benefit from different AI and UI settings.

To put it briefly:

If you integrate AI text generation for speed and efficiency, show a single suggestion. If you design for inspiration or language learning, give users a choice of multiple phrases. In any case, consider diverse user groups.

Bias potentially hides not only in the model but also the UI

There is one more broader takeaway here: Our results imply that designing suggestion UIs exclusively for efficiency may hinder some user groups in using the system as desired by them. Concretely, optimising AI writing tools for efficiency might not be in the best interest of e.g. non-native speakers. Considering recent discussions about biases in NLP models, this indicates that UI design might be another potential source of bias to consider, with respect to the interactive experience for certain user groups.

Conclusion

Many fundamental UI design choices have not yet been explored for interactive uses of AI models from Natural Language Processing. To address this, we studied one key design dimension (parallel number of suggestions), plus one important user-related aspect (language proficiency) that previously lacked investigation in this context.

We conclude that showing multiple suggestions is useful for ideation, at a cost of efficiency. How many suggestions to show in your UI thus depends on your design goals and target users. In this regard, the observed usage differences between native and non-native speakers clearly underline the importance of designing interactive AI with a consideration of diverse backgrounds.

More details can be found in our paper, which will be published at the CHI’21 conference later this year. A preprint is already available on arXiv.

The UX Collective donates US$1 for each article published on our platform. This story contributed to Bay Area Black Designers: a professional development community for Black people who are digital designers and researchers in the San Francisco Bay Area. By joining together in community, members share inspiration, connection, peer mentorship, professional development, resources, feedback, support, and resilience. Silence against systemic racism is not an option. Build the design community you believe in.