Reconstructing case information

October 17, 2018
Reconstructing case information

The problem

When you register a business at Companies House in the UK, the names are stored in upper case regardless of input. That's a loss of information. This isn't the case (pun unintended) for some other jurisdictions like Germany, but is for the UK, France and Ireland to name a few. While this may not sound like a big deal, working with a long list of company names with blaring capital letters isn't very pleasing to the eye. Title case helps alleviate this a bit in some places, but also makes it worse in others (Hsbc Bank). We wanted to see if we could reconstruct the "intended" case for these names.

A naive approach might be to only set non-dictionary words to upper-case, and title-case the rest. That works in some instances like

HSBC Insurance Holdings Limited
KLM UK Limited

But it quickly falls apart in most other examples.


That's because the pronouns in company names don't tend to be simple words from the dictionary, and aren't acronyms either. So how do you identify an acronym?

What makes an acronym?

Have you heard of YKK Corporation? If not, you probably should. They're a Japanese company that holds a 98% global monopoly in the market they're in: they make zippers. You probably are wearing at least one of their products right now. Even if you haven't heard of them, we can agree that YKK is an acronym and needs to be capitalised, as opposed to Skyscanner or Transferwise, for instance.

And that's because YKK looks nothing like an English word! It's short, lacks vowels, has unlikely consonants next to each other and can only dream of one day being pronounced. An easy way to detect acronyms would be to look for words that don't follow the typical consonant-vowel structure of English words.

Building an acronym classifier

Instead of defining these rules ourselves, we can build a tiny machine-learning application to do the work for us. For a list of English words, we don't need to look any further than a dictionary. We do not care about the frequency of words in literature, but just need a thorough list of English words.

By computing n-grams (small continuous sequences of characters of length n) from these words, we can build a distribution of how often these grams appear in English words. For example, here's a list of bigrams (n-grams of length two) for a few words.

apple  -> ['ap', 'pp', 'pl', 'le']
banana -> ['ba', 'an', 'na', 'an', 'na']
chair  -> ['ch', 'ha', 'ai', 'ir']

Computing these for all words is a fairly inexpensive task. Once we've done that, we can group them by the individual bigrams and count the occurrences. This resulting lookup table, or model in ML parlance, is a representation of the kind of pairs of letters that make up English words.

Some interesting facts: er is the most common bigram in the English language, occurring 2% of the time, and vd appears only once. (Can you tell which word?). More interesting are the bigrams that don't appear. The only bigram that begins with q is qu, meaning that whenever there's a Q in the middle of a word, it's always followed by a U, without exception.

The entire lookup fits on a single screen

Now, given a new never-before-seen word, all you have to do is check whether the individual n-grams are frequent or not. If there's too many uncommon n-grams, you can be fairly certain it's an acronym and unlike an English word.

Barclays vs HSBC

If you look at the bigrams for Barclays (['ba', 'rc', 'cl', 'ay', 'ys']), you'll notice that they're all fairly common. You can come up with several words for each one of them, indicating that it is like very much like regular English words.

Meanwhile HSBC on the other hand (['hs', 'sb', 'bc']) has fairly atypical bigrams, that don't appear often in the language. Can you think of any word containing any of these chunks? This indicates that it's most likely an acronym, and thus capitalised.

An open source library

All of the above model building, evaluation and application methods are wrapped up in a simple open-sourced Python class available here. We use companycase in most places throughout DueDil when surfacing data on the API or the site. While some names are ambiguous (Yo! Sushi vs YO! Sushi), and despite some instances where the model incorrectly classifies a word, we've found that over the majority of cases it works really well, and is a significant improvement over other methods, resulting in clean correctly cased company names throughout our product.

Does this seem like a case you'd be interested in solving? Come and work with us, we're hiring.

Sign up to the latest DueDil news!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.