Frequently, robotic process automation (RPA) collides with the problem of unstructured data. A process to be automated might, for example, require data scraping from emails or websites with no specific format.
This kind of natural language processing (NLP) is one of the more obscure areas of machine learning. Things like Zipf's law and quasi-logical forms of sentences might cause restless nights even for hardened coders.
Description of the problem
For a human, a carelessly written email is easy to interpret and minor spelling errors are, at most, annoying. But when a robot is coded to fetch data from a specific location or from around a specific keyword, it does exactly that without any extra interpretations. If the piece of data is not where it should be, it is written wrong, or the format is a bit off, the robot can´t do anything.
In the worst-case scenario, the robot thinks that there is nothing wrong. Afterwards, it can be burdensome and costly to fix these unnoticed errors.
Furthermore, it is common that the document under evaluation has to be interpreted without specific keywords. Interpretation has to be done solely from the overall sentiment of the document.
In traditional RPA, this is solved by the developer by giving the robot instructions for every imaginable situation. In NLP, this is impossible since there is a myriad of ways people can express the same ideas.
Instead of hard-coded grammar, recent NLP solutions emphasize statistical methods. At the core of NLP there are different kinds of language models: models that predict the probability distribution of language expressions. For example, the language model can predict that the sentence ‘A man walked his dog’ would continue with the term ‘outside’ with 40 % probability, with the term ‘on the street’ with 23 % probability, with the term ‘on Monday’ with 4 % probability, and so on.
Training a language model is seemingly simple: one just needs example material from which the algorithm can learn typical patterns. At the lowest level, there are so called n-gram models. N-gram is a continuous sequence of letters, morphemes, or words. For example, the word ‘word’ is a 4-gram or a 1-gram depending on if the model is processing language on level of characters or on the level of words.
Typically, it is not enough to just interpret probabilities of different kinds of n-grams. Instead, a variety of methods of interpolations, normalizations, and feedbacks are required. Particularly, artificial neural networks with LSTM units are the tools of trade in modern NLP.
Recently, variants of self-attention mechanisms have been shown to be computationally more scalable and, in general, better than models based on feedbacks loops.
A language model, learnt in an unsupervised manner, can be applied in various ways.
One of the most entertaining examples is the GPT-2 of OpenAI. It generates arbitrarily long stories by predicting continuation for a text that it has previously predicted. GPT-2 creates so convincing text that OpenAI has decided not to publish the fully trained algorithm since it could be used to generate an infinite amount of fake news and spam. Instead, instructions for building the algorithm are freely available.
Instead of generating text, a language model can also be used to calculate the probability of an observed text or document belonging to a certain class. Also, it can estimate which keywords could be linked to the document. Such automated context recognition makes data scraping easier. For example, it´s of no use to search total cost of traveling from a taxi receipt.
Perhaps the most important thing is that language models give robots a measure of uncertainty. If a document or information gathered from documents seems unlikely, robots can point it out to a human expert.
Machine learning methods of NLP extend the possibilities of automation to such textual contents, which require interpretation. In the context of RPA, the most interesting applications seem to be data scraping, evaluation of quality of data, and enhancement of quality of data.
For developers, there are very good pre-trained open-source solutions, such as BERT of Google AI Language. Developers just need to fine-tune the algorithm for their own application needs.
The problem with these ready-made solutions is that they are primarily optimized for texts written in English. Lack of data or poor data quality can prevent the usage of many good algorithms. One beauty of RPA compared to traditional software automation is that it doesn´t require changes to clients' IT systems. If these systems don´t support heavy computing, some of the heavier machine learning algorithms might turn out useless.
The NLP algorithm developed by Knowit Oy for Finanssivalvonta (Financial Supervisory Authority of Finland) is a good example of good results being achieved even with a small amount of data.[4,5] The developed algorithm learns to recognize the contexts of new documents and collects relevant data after seeing just a few dozen examples by a human expert. This is possible by defining a domain-specific language model which gives much more focused probability distributions than the alternative general model. Also, the algorithm can handle various different languages.
In the future, I would hope to see more easy-to-use robots for data extraction. These robots should be trainable by the end-user with statements like ‘From this kind of document, I would like to extract this type of information’.
 A. Vaswani, I. Polosukhin, et al., Attention is All You Need, NIPS, 2017.
 A. Radford, et al., Language Models are Unsupervised Multitask Learners, 2019.
 J. Devlin, et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018.
 H. Toivonen, Robotti työkaveriksi, 2019.
 Knowit, Case: Finanssivalvonnan älykkäät robotit työkavereina, 2019.