Tech Talk: How to automate email classification with AWS

15 Oct

Written By Julien

A large company with many clients, such as a bank, insurance, mutuality, public administration, etc., receives a vast amount of emails daily from their customers. For instance, the emails are usually sent to a general contact address. Then, emails are manually classified so they can be sent and processed by the competent department. With this project, the company began to look for a smarter way forward with automated email classification.

The manual classification represents a bottleneck in the customer support flow. It does not matter if the departments are fast to process customer requests if it takes ages to process them. Therefore, an automated email classification can increase customer support efficiency, and as a result, customer satisfaction.

This article covers the approach to implement an email classification model for one of our clients and to put the model into production with Amazon Web Services (AWS). The objective was to classify incoming emails into two categories. The problem was that there was no historical classification of emails. However, there was a history of 1.5 million classified comments from the phone service center. These comments were already classified into categories to be used as a proxy for the emails.

How was the model trained?

We decided to use Naïve Bayes for this text classification task because it is fast to make predictions and works well when trained on a lot of data.

Before sending input text into Naïve Bayes, we must first transform the text into numerical data. So, we used the term frequency-inverse document frequency (TF-IDF) method to transform the text into numerical data. To avoid putting too much weight on common French words, TF-IDF receives a list of French stop words (from the Spacy library) that it should not account for.

Next, we used randomized search cross-validation to find the best parameters for the pipeline made by TF-IDF and followed by the Naïve Bayes. The accuracy on emails was around 80%, even if we trained it on comments from the phone service center. The model learned sector-specific jargon from these comments and transferred the knowledge onto the emails.

How to put the model into production in AWS?

AWS provides all the resources to run a model into a stack. This single unit can include the networking rules, security rules, databases, and computation resources to run your model. It is best practice to use a stack per use case/model. The following schema defines the architecture:

The architecture of AWS for automated email classification

In short, the process is the following:

The data engineer puts the email to be classified as a .parquet file into the input bucket. Alternatively, we can test the stack by a sample email generated by the lambda trial.
An event detects this new file and sends the metadata of the file to the lambda trigger.
The lambda trigger gets the asset name using the file name. It then asks the corresponding asset stack to process the data in the input bucket.
The asset stack processes and outputs the predictions in the output bucket.
The data engineer will then load this output into the Enterprise Data Warehouse.

The main stack includes the lambda trial, the lambda trigger, the input/output bucket, the networking, and compute resources.

Deep dive into an AWS architecture for automated email classification

Let’s review how these components relate and are built together! The order of the execution structures the descriptions of the components. The first element described is the first element to be executed by the code.

App.py for automated email classification

The execution begins with a script (called app.py) which has the purpose of calling the stacks. It must first set up the base configurations used to build the environment and the tags for each stack. A base configuration is a dictionary that AWS uses to identify the application:

Stage: The stage of development (dev, test, prod)
Region
App name
Account id
Account name

It first calls the MainStack, which will set up the resources for the other stacks.

MainStack for automated email classification

The MainStack sets up the following elements:

Input, output, and config bucket
Virtual private cloud (VPC) and subnets
Compute resources
Job queue
Lambda trial and Lambda triggers

The buckets are built using the account id, the app name, and the stage. It means that we will create multiple buckets depending on the number of stages defined. It is better to separate the development, test, and production stages as much as possible.

The compute resources can define the minimum and maximum number of CPUs we wish to allocate to the stacks.

The job queue uses the VPC, subnets, and the compute resources initialized itself. Its purpose is to queue jobs and execute the job with the highest priority.

The lambda trial aims to create a test parquet file in the input bucket to test the stacks. Next, the lambda trigger sends a job using the file key name and the bucket from the event. Moreover, the main stack must also add role policy to each lambda and the bucket source to the lambda trigger.

Additionally, it is essential to grant bucket permissions to the lambdas function. Otherwise, they cannot perform actions with the buckets.

Email stack resources and steps for automated email classification

A docker image 'containerizes' each stack. We define the following resources within the stack class:

Docker image asset
Log configuration: It ensures that the logs will be in CloudWatch.
Job role: It grants permission of the job to access other AWS resources.
Job definition container: a wrapper around the docker image assets.
Job definition: It defines the job using the Job definition container, the role, and the logs.

The docker image contains the scripts to be executed within the container. We can summarize the stack by the following steps:

Create a context dictionary made of the environment and configuration information
Get the input data from the input bucket
Get the model from its bucket
Check whether the input data has the correct format and columns.
Use the model to predict the categories of the input data.
Create a data frame with the predicted categories of the input data
Upload the output data frame to the output bucket

More on the email stack configuration

All stacks start by fetching the environment variables and the configuration variables. It puts the variables into a dictionary that will pass between the functions of the stacks. The environment variables contain the input bucket, the output bucket, and the file key. The config variables include variables with static information, such as the model's key or the model name. To retrieve the data from a bucket in AWS using Python, we use the boto3 library.

The input data was made of around 1 million rows. Your (Elastic Container 2) EC2 might yield a memory error because there is not enough memory to handle that amount of data at once. So, the solution is to slice your data into smaller chunks. Then, you can apply your preprocessing function to your data. To speed up the preprocessing time, we recommend using the vectorization function given by NumPy.

Conclusion

Congrats! You now have the complete picture of deploying a classification model in AWS. In our example, the automated email classification model we put into production processes around 500,000 emails daily. Imagine what your company could do with that kind of capability.

Julien https://www.linkedin.com/in/julientheys/