Tech Talk: Document classification and prediction using an integrated application (Part 1)

13 Jul

Written By Julien

Text classification is one of the significant tasks in Natural Language Processing. It is the process of classifying text strings or documents into different categories, depending upon the contents of the strings. Text or document classification has various applications, such as detecting the research areas of scientific papers, classifying an email, classifying blog posts into different categories, automatic tagging of customer queries, etc.

The project we will illustrate here is an application system that allows us to classify many types of documents based on Deep Learning/Machine learning Natural Language Processing (NLP) models. In a second stage, the application will predict the class of new coming unlabeled documents. For optimizing workflow, processing, and usage, we conceived the current project as two separated Flask RESTful APIs coupled with a user interface (using CSS/HTML and JavaScript):

Documents classification and prediction application: this app contains the entire project workflow, including the training of the classification models based on a labeled documents dataset and, in the second phase, the prediction - based on the previously trained model - of new coming unlabeled documents. The app can run on a local machine and a virtual machine with GPU capabilities (required for training the models).
Documents prediction application: the second app contains the document prediction workflow and we created it as a web app multi-label able to run everywhere. For the app to run, the specifications of the previously trained model (e.g., model’s weights and biases) need to be uploaded.

Note: The dataset we present to the system should be a directory containing all the documents with subdirectories as label names. In the case of multi-label document classification (i.e., when a document can belong to more than one category), you should duplicate multi-labeled documents in all the subdirectories they belong.

Project features

We will outline the five key document classification project features:

1. Accommodate a wide variety of documents: including native document formats as well as scanned documents/images (using OCR);

2. Accommodate a wide variety of datasets: kind of ‘one-size-fits-all’ models able to deal with binary, multiclass and multi-label classification;

3. Both naive and expert users can use it: ease of use (user-friendly interface) but providing technical information for improvement (e.g., performance metrics, confusion matrix, ROC curve,…);

4. Good performing: looking for good/state-of-the-art practices (e.g., BERT deep learning transformers model);

5. Able to run everywhere: partially deployed in the cloud (second app with the prediction workflow).

Choosing the right models for document classification

One of the biggest challenges of the current project was to construct classification models that could accommodate a large variety of dataset types (for binary, multiclass, and multi-label classifications). That could provide good performance even in realistic (resources-limited) circumstances. Indeed not all potential customers will necessarily have huge datasets of high-quality documents to train the models and serve good predictions further.

At the same time, we made an effort to submit high-quality data to the models and use the most appropriate text cleaning/preprocessing methods for the specific task of classifying documents. NLP indeed covers a wide variety of functions (i.a. translation, text completion, topics modeling, summarization, Named-Entity-Recognition,...), and not all text cleaning/preprocessing methods are appropriate for all NLP tasks. For example, in the current project context, Part-Of-Speech tagging was not relevant as document classification does not rely on word order or sentence structure. Moreover, in some cases, we should tailor text cleaning/preprocessing methods to the type of task and the specific model used.

Options of document classification models

Some recent pre-trained text models (such as BERT, Roberta, or XLNet) have their own tokenizer/vectorizer, and using standard vectorizer (such as TF-IDF) is not relevant. In the same vein, stemmatizing words are not applicable when using BERT (BertForSequenceClassification). BERT's vocabulary is composed of real words (and sub-words) that won't match a substantial part of the stemmatized words.

We have browsed the literature looking for best practices and state-of-the-art Text Classification and Cleaning methods for all these purposes. In the preliminary phase of the project and with this in mind, we have pre-tested several potential good candidate models from both machine learning frameworks (i.e., gaussian naive bayes, support vector machine, XGBoost) and deep learning frameworks (i.e., BERT, RoBERTa, XLNet). After this pre-test phase, we retained two models for the following reasons :

Deep learning BERT (BertForSequenceClassification) model provides the best performance amongst all tested models and does not tend to overfit even with an increasing number of epochs (as opposed to RoBERTa and XLNet);
Machine learning XGBoost (XGBClassifier) model provides excellent performance and is much less demanding than BERT in computing capabilities (run therefore much faster).

Confirming model choice

To confirm our choice, we have tested these two models with various (type of) datasets. For example, trained on the DBPedia dataset (a dataset composed of 342,782 Wikipedia articles with hierarchical categories), BERT provided an f1-score of 99% on the first-class level (9 categories) and of 96% on the second level class (70 categories).

Keeping in mind the need for dealing with a realistic business context, we have also tested the model on two pdf datasets of reduced size (one multiclass, and the other multilabel) and specially built up for the project purpose and taken from the European Parliament think tank. The documents concern the EU legislation and are sorted in various fields (e.g., culture, environment, economic and monetary issues, etc.).

We have intentionally chosen to build datasets of reduced sizes to challenge the model's performances. The multiclass dataset was composed of 261 pdfs, and the multilabel dataset has 313 pdfs with 33 documents multi labeled (with 6 categories and less than 60 documents per category). The BERT model provides an f1-score of 97%, while the XGBoost model gives an f1-score of 92% when trained on the multiclass dataset.

Up next...

In part 2 of this article on document classification, we cover the workflow design, Python programming, navigation through the designed web app’s user interface, as well as potential applications.

Julien https://www.linkedin.com/in/julientheys/