Get Started with Text Classification
Making sense of raw text is a hot topic, whether it is understanding financial data from a receipt, finding data security risks and vulnerabilities in a codebase, or improving that important email you are sending to your boss.
Classifying text is a task that can be solved by utilizing Machine Learning (ML) and, more specifically natural language processing (NLP) tools, or by using a more deterministic approach with pattern matching, also known as regular expressions. Both approaches have their own strengths and weaknesses, and in many cases, it may be beneficial to use a combination of both methods.
In this article, I will discuss both of them and how you can get started.
Create a Classification Strategy
Before getting started with classification, you must analyze the data you want to classify and see if you can identify a data structure that can inform how you will design your labeling strategy.
Taking food classification as an example, the first classification level could be solid or liquid. Once we classify an item as liquid, it will be followed by more specific classifications, such as water, sodas, and juices. For the water category, the next layer would be sparking and still. Ideally, these layers are informed by a product strategy. However, if there is no clear direction, the engineering team has to take the initiative.
This labeling structure will direct your work toward matching the higher-level category first and then get into the specifics — we will discuss this further in the article. Down the road, you can also use it to measure the efficiency of your labeling process. For instance, your solution could be efficient at labeling soda products but failing at juices. This classification will help to narrow down pain points.
While using regular expressions (regex) might not be as exciting as using ML, it is a powerful tool to achieve classification results with a few resources and in a short time. Regular expressions are also efficient as they can be executed quickly on large amounts of data. To get started, pick a data set and identify the most common iteration of the data you try to make sense of. Then for each of them, come up with an associated regex. Then, rinse and repeat, and as the common iterations are being matched, you can start digging into the edge cases.
While using a pattern-matching strategy can quickly get you to catch 90% of the occurrences, there are limitations. The biggest one is false negatives: we don’t know what we did not find as we can only match patterns based on the specific sequence of characters in the text. Regexes are generally not effective for tasks that require more sophisticated language processing, such as identifying the sentiment of a piece of text or classifying text into multiple categories. And it is extremely time and resource-consuming for an engineering team to review millions of lines of data to catch elements of interest that the regex did not catch, which would inform how to extend the pattern lists. That’s where ML comes in.
The advantage of using an ML model is that it can produce accurate text classification on any newly introduced data. To get started, find a model trained to solve a similar problem. Using our example, look for a model that can classify food. If you cannot find an exact match, find a pre-trained model that does something similar and train it. For example, you may find a model that can classify ice creams; while it may not be exactly what you need, it does something similar, and you can train it to achieve your goal.
There are a lot of NLP (Natural Language Processing) models that can be found on Hugging Face, TensorFlow Hub. My personal favorites for the text classification tasks are Bert and DistilBERT. Bert is the most comprehensive one, while DistilBERT is a small, light, and fast model to train. If you’re unsure which algorithm to use, try experimenting with multiple algorithms and comparing their performance on the test set.
A pre-trained model that you further trained might result in very low accuracy on the first iterations due to a lack of data and training. Be patient and methodical. Time and effort will be required to collect more data and retrain the model. Many public datasets are available for text classification, such as the UCI Machine Learning Repository and the Kaggle Datasets.
If finding training data is an issue, going back to square one by using the regex strategy to quality labeled data might be necessary to enrich your training dataset. On top of that, data augmentation can be used to enlarge your training dataset by replacing words with their synonyms. Open source GloVe, Word2Vec are both great tools for the task. Another trick is to use back-translation; translate to any other language and back. Google and Yandex offer APIs that can be used for that purpose.
Once you have enough data, split the data into training and test sets. Train your NLP model using the training set and evaluate the model’s performance with the testing set. Having two unique data sets is important because you want to avoid overfitting the model to the training data leading to misleading high-efficiency numbers, which would result in poor performance on new, unseen data.
The next step is to feed your training set to the model. After training the model, comes the performance evaluation part on the test set. To evaluate the model, compute metrics such as accuracy, precision, recall, and F1 score. Here is a good blog post explaining how to get these metrics. Improving by fine-tuning the model’s parameters, preprocessing steps, adjusting the model’s hyper-parameters, or adding additional features to the input data or simply experimenting with entirely different models.
Building an ML model that performs well is a complex development process that requires knowledge, time, resources. But it’s worth it because that is the tool that can get you to this 100% coverage goal.
While regexes are a great way to build an MVP, get first classification results faster or label data, it is quickly limited by known-pattern lists. On the other hand, the ML approach is time-consuming and computationally intensive, but once achieved it will provide higher accuracy.
With Khosla Ventures thinking that NLP is the most important technology trend of the next five years and the market expected to grow from today’s $14 million to surpass $49 billion by 2032, we can expect the NLP approach to be increasingly easy to use.