Efficient document processing for policymakers 

Context & Objectives

An international public organization did not have a centralized approach to monitor partners’ publications. This prevented them from effectively processing information to research and redact from reports and policies.

They wanted to scrape partner institutions' websites at different levels (e.g., international, local, and regional), collecting and classifying documents related to the topics that interest their policymakers. Obtaining precise scraping methods and natural language processing (NLP) algorithms was crucial for them to allocate resources effectively. 

We set out to gather publicly available, open data to help our client have more informed policy making.

Approach

After delivering a serverless, cost-efficient proof of concept (POC), the client asked us to develop a prototype for this pipeline. We performed rounds of agile implementations directly on the client’s environment to ensure the solution fits their needs.

The documents added to the process needed an extraction of a summary, a title, keywords, and the possibility to classify them according to interest areas (40+). There is no common structure or format to the documents, which arrive in any of the languages spoken in the European Union. 

The solution was deployed in the client’s AWS environment, designing and deploying the cloud infrastructure with Terraform for ease of maintenance and scalability. We implemented an API to manage organizations to be scraped and to include specific documents in the pipeline. 

We coded the prototype in Python, using Docker for containerization and leveraging SQL databases. Key extracts from the documents were translated into English, ensuring optimal and efficient usage.

The main deliverables were:

  • Code to deploy infrastructure and perform scraping of documents and their NLP analysis.

  • Documentation on the deployment of the infrastructure as code (IaC) solution with the client’s environment.

  • Rounds of validation tests with client and cloud security audits. 

  • Knowledge sharing with the client’s team for complete ownership of all parts of the pipeline, and the capacity to extend it with new features.

Results

The document classifying solution helps our client improve their ability to find documents and information that support policy making and decisions. 

They cited the solution as bringing quality and speed, flexibility, security, and cost-effectiveness to their search and decision processes.

Previous
Previous

Improved member experience at a professional federation

Next
Next

Competitive pricing with a dynamic model in telecom