tech4pets is a not-for-profit company that works with charities to improve animal welfare using technology solutions and big data. Whilst tech4pets had succeeded in collecting vast quantities of data, they requested the help of GlobalLogic to gain a better understanding of the movement of animals across the UK.

This project provided a great opportunity for our Data Science team to get hands-on with an interesting dataset, whilst benefiting a heart-warming cause.

The challenge

During the global pandemic, the demand for pets boomed significantly – with the number of adverts doubling, and the average price more than doubling at peak. This increase in demand led to a rise in problem sellers, fraud, theft, tax evasion, smuggling, organised crime and a rise in animal abuse as pets became an increasingly lucrative cash commodity.

With the majority of advertisements being posted online, tech4pets had been using graph databases and algorithms to identify entities of interest and were working with various organisations to tackle problem selling. tech4pets sought out the help of GlobalLogic to develop a Breed Classifier – a tool that would help label millions of unclassified adverts by building a much clearer picture of the flow of animals online.

The solution

tech4pets began by providing a sample of their data – over a million rows of labelled and unlabelled classified adverts with text descriptions and more. Using Natural Language Processing (NLP), we extracted insights from the text description and built a classification model to predict the dog, cat, rabbit, or horse breed given these insights.

The data was supplied in AWS S3 and a Glue crawler was created and pointed at a S3 bucket, which then returned a single database table combining all fragmented datasets. This table was accessible in Athena, where we ran SQL queries and created further refined tables if needed.

Once the data was accessible in SageMaker, GlobalLogic began an initial exploratory data analysis. Using PyAthena and pandas, we gained an overview of all the data, columns and sparsity of information. It was then hypothesised that by using the ‘description’ field of the unlabelled data, a model could be trained to predict the breed of the advert, allowing any unlabelled data to be useful for tech4pet’s further analyses.

To get the best results from ML models on text, the text was cleaned by removing symbols, punctuation, the (over)use of emojis, stop words and was converted to lowercase – this made it easier for the model to read data more accurately.

Initially, an NLP technique called Term Frequency-Inverse Document Frequency (TF-IDF) was used in conjunction with an XGBoost classification model. These techniques proved beneficial as they were quick to set up in SageMaker and it was easier to see the probability of the hypothesis in the initial modelling phase.

Using this approach, GlobalLogic achieved 65% accuracy on predicting dog breeds from the description. The model learned the variations of breed names used e.g. lab = Labrador Retriever, and which words corresponded to the post’s breed even with multiple breeds detailed.

In an effort to improve results, our Data Science team used a powerful ML model template, using Long Short-Term Memory (LSTM) networks – a type of recurrent neural network capable of learning patterns in word order within sentences, which helps to predict which word comes next.

This template also uses GloVe, an unsupervised learning algorithm for obtaining vector representations (turning words into numbers) for words trained on a large dataset that has ‘learned’ what words mean in different contexts, i.e. king – man + woman ≈ queen. This was a big step up in complexity of the TF-IDF technique used previously, and potentially more accurate.

By using these stronger models, we were able to implement our template quickly and run it on all other animals to achieve a variation of increased accuracies of 78% for dogs, 81% for cats and 75% for rabbits.

What value did GlobalLogic bring?

Our Data Science team achieved the initial goal, proving we can predict the breed of animals from a subset of historical unlabelled data. This result is now helping tech4pets enrich their data and get a better understanding of the movement of animals across the UK.

Co-founder of tech4pets, Keith Hinde, discusses the challenges the organisation faces and the takeaways from GlobalLogic’s involvement in this project:

“Some of the most problematic sources of data we deal with at tech4pets are those which neither solicit nor provide breed information about the pets advertised. Unfortunately, those same sources are often the most problematic from both animal and consumer welfare perspectives. With the great work that the team has done, we now have access to an efficient and accurate breed classifier. This significantly improves the depth and quality of analysis we can provide to clients and therefore tangibly helps the pets they seek to protect”.

Interested to learn more about the project?

Listen to the Tech3 Podcast, where Data Scientists, Roger Zorlu and Sami Aslindi, take you through motivations for the project, challenges and how they’re helping tech4pets monitor and protect pets throughout the UK.