Our client wanted to build a knowledge hub to monitor the appearance of counterfeit drug and brand terminology in publicly available criminal proceedings documents.
The company wanted to be able to understand how the prevalence of their own brand terms was changing over time and how it was linked to other descriptive information such as geography.
What we did
✔︎ Built web scrapers to capture raw text from a range of sites describing recent drug counterfeit activity
✔︎ Automated scripts to extract key entities using NLP techniques and ingest into a graph database (Neo4j)
✔︎ Delivered a web application using Django and D3.js where users can interactively explore the graph
linked entities extracted after 4 weeks live.
How we did it
We built bespoke web scrapers using Python and orchestrated the process using Airflow, to run daily into a database.
Our NLP module cleans, tokenises and stems the data, then extracts sets of entities (e.g. people, places, key terms, drug names, dates) using Python libraries. These are then ingested into a Neo4j graph database as nodes and are linked if the entities appear in the same document, weighted by distance. This includes matching incoming entities with existing nodes and updating with new information.
The graph database was surfaced in a web application built using the Django framework. The core interactive visualisation (built using D3.js) allows the user to zoom, pan, filter and recolour nodes and explore the web of activity across a given time frame. Users can search the graph for terms and quickly identify which nodes have been added or updated from the previous day.
Start a conversation
Take the first step by speaking with one of our data experts today.