What content strategists should know to get started with NLP

Your work concerns words. Why not concern yourself with natural language processing (NLP)? Many features include NLP in the products we both use and make. As it expands further into content, we should consider it one of our content strategy tools rather than a mystery.

Jennifer Schmich

Published in

UX Collective

6 min readMar 2, 2020

Some tools with NLP that my team uses: Qordoba, Sketch Engine and PoolParty

Where is NLP used in content?

NLP is all over the place. Common use cases that you’re probably familiar with:

Grammar checkers and quality scoring like what’s offered by Qordoba
Taxonomy curation as in PoolParty
Machine-generated content and metadata
Auto-summarization of existing content
Content auto-tagging
Recommendations
Sentiment analysis of social media posts and customer reviews
Prospecting based on sentiment in social posts
Search ranking, auto-correct and auto-complete
Finding gaps in content topics and keywords for search
Transcription of videos for search
Chatbots and voice search
Email spam filters
Machine translation of content between languages

No one’s limited to these use cases. Though these will keep you busy, you may also find novel ways of using NLP by collaborating on other problems. What might it do for you?

NLP fits in with analysis, targeting, creation, translation and search.

NLP gets into the structure of grammar

Underneath the use cases is linguistic analysis for a fairly standard set of tasks, including for example:

Entity and relationship extraction
Name and attribute recognition
Question and answer
Part of speech tagging
Parsing and stemming
Disambiguation

No doubt you can geek out with the language part of NLP

You might also think of NLP as the math of language, explaining to computers how to understand us.

We’ve learned the code of computers, now they’re learning ours so we can give them instructions.

Computers process through tons of content fast. Doing so, they give us the ability to do things with content at a scale that’s not humanly possible.

Negotiate scale v. quality early

NLP is part of the machine learning field. If you do NLP, you do ML at the same time. Getting into it feels like embarking on a research project.

Expect the output to be far from good at first. My partners often hide the worst from us. As the content strategist on the project, don’t miss the chance to have a voice early on. Align on what’s acceptable. How good is good enough for your users, where and when in the experience?

In the end, people make or break NLP by setting goals and training it on what is accurate through vetting, trial and error. Without being involved, you risk machines running a mess of content—at scale. One that takes more time and effort to change because the output is algorithmically derived.

Hire a professional linguist

When all of this is new, I’ve seen organizations confuse the expertise that’s needed. Developers code these services and data scientists produce the output. That doesn’t mean they can provide the linguistic skills your project might require. How much experience does your team have with discourse analysis? NLP toolkits can provide a false sense of readiness. You might be able to start just fine, but then over time, end up in a ditch.

The corpus really counts

Getting started, the teams need to line up some foundational pieces like taxonomy and linguistic text corpora.

A text corpus is a gigantic bin of content. Millions and billions of words is typical. Data scientists and linguists carefully collect and curate it. The corpus acts like a sample data set that NLP uses for statistical validation. Here’s how language works in this space generally. How does our analysis on our own content compare?

For many subject areas, you can license proprietary or use an open, general purpose corpus for machine learning. If you’re like us, you may need to build a specialized corpus from scratch.

Building corpora is precarious. A flimsy corpus produces flimsy output. Are you choosing high quality content? Representative content? Has it been cleaned up to improve performance? The best way you can help is by pointing to good sources and assessing topics and coverage.

The model takes tedious training

Recently, for example, our team implemented a central taxonomy and had to re-classify and auto-tag our article and user-generated content. We were also bringing on NLP to start detecting (through extraction tasks) potential tags outside the taxonomy based on analysis of the content as it was being created.

First, the data scientists needed to train the machine learning. This happens on any ML project.

On ours though, it meant we had to check the auto-tags applied to the content and correct or suggest new ones. Early passes at auto-tagging can be entertaining, especially if you have a unique domain, low quality content or a patchy taxonomy. Training gives the computer the correct answers to compare and learn from.

Which subject matter experts could review our tags accurately? Our own content creators. If you don’t have experts internally, you may need a budget to hire experts externally.

Recruit subject matter experts to help

We organized a series of tagathons. Fortunately, the amazing content teams at Intuit (thanks, all!) will jump to our common cause. Going in, we estimated the effort would take 500 work hours and 6 weeks to complete:

6,000 document sample, divided into three sets and iterated in succession
2–5 mins per document / 12 docs/hour

The tagathons generated work after each, with required time for modifications to adjust for over-tagging. Although the team was trained at the start of each tagathon, afterwards we had concerns about the consistency of the tagging across individuals in the group, and from one tagathon to the next. We definitely underestimated what it would take. So set realistic expectations, even though they may not be particularly welcome.

Ultimately, our training process contributed to improvements to both the taxonomy and auto-tagger, and gave us a vetted set of tagged content for future use cases. Plus, we released V1 with tagging precision at 77%. Some 7 points higher than our initial target.

Natural language processing can help you execute content strategy. Find out how. It’s not just for data scientists, but for word nerds too.

Watch free courses online for an introduction to natural language processing. Start with a courses with emphasis on linguistics and less on the statistics, toolkit or code. You’ll be fine.