PyData London 2022 highlights

Categories: Big Data & AnalyticsTechnology

Our UK based Data Science team recently attended the PyData London 2022 Conference – a three-day event for the international community of users and developers of data analysis tools, designed for knowledge sharing.

Dr Sami Alsindi – Lead Data Scientist, Nikhil Modha – Senior Data Scientist, Dr Ben Wilson – Data Scientist discuss favourite speakers, learnings and key highlights from this event.

 

Dr Sami Alsindi, Lead Data Scientist

I’ve had fantastic experiences in the past with PyData London, and this event was no different! It was great to be back in-person, network and reconnect with some attendees I hadn’t seen since the last in-person event three years ago.

This year, there were three streams of talks; the Friday sessions were more interactive and weekend sessions more like presentations. As many of these talks ran simultaneously, we decided to split up to cover all bases – ensuring we didn’t miss any high value talks (big thanks GlobalLogic for sending three of us!)

For those who couldn’t attend, the talks were all recorded and will be shared to their YouTube channel.

My favourite Friday session was delivered by Natan Mish – Senior Machine Learning Engineer at Zimmer Biomet, focussing on Data Validation. This talk contained a number of libraries that were new to me but thankfully already known to my Team.

During the session, speakers shared example code and materials which was a great way to get started using these technologies. pydantic for defining data models and validation, pandera for pandas DataFrame schema validation, and great expectations for data testing, quality assessment and profiling will now all be part of my repertoire.

My favourite ‘nerdy’ talk was from James Powell – Board Chair at NumFOCUS. This was a lightning talk about how pandas, one of the main libraries Data Scientists use especially for tabular data, has inconsistencies that are undocumented. Including, in some cases, accidentally altering data! James is a seasoned and fantastic speaker that never fails to deliver a great talk.

There were two brilliant technical talks I attended; the first was delivered by Tomasz Bartczak – Machine Learning Engineer at Cydar Medical and focussed on image segmentation and mapping live 2D images during surgeries to 3D pre-surgery scans. This was described as really brilliant technology that fully utilises the cloud, saving 20 minutes in surgery using their reconstruction models.

The talk mainly focussed on the treatments of blood vessels, in particular the fitting of aortic stents to minimise the risk of aneurysm. This is a delicate surgery in a very ‘mission-critical’ part of the body! They utilised multiple different sets of techniques, including mapping similar aortic anatomies across different patients; this allowed surgeons to leverage the knowledge of all previous treatments across the entire dataset and gain experience in procedures they may have not done in the past. It also helps in cases where patients have a rare complicating factor as they can be matched to similar patients and the strategies that helped mitigate these.

Most impressive to me was, having played about with UK healthcare data in the past (and sometimes unsuccessfully tried to get such data/interface with their systems), that they had managed to embed connections to their cloud-based models within the NHS systems, allowing them to retain these anatomical scans for future treatment recommendations. I asked about this at the end of the talk, and they simply mentioned their fantastic legal and regulatory teams that have expertly navigated these hurdles.

A close second was the Data Validation talk delivered by Marysia Winkels – Data Scientist at GoDataDriven. This session contained a really interesting perspective, somewhat confirming a gut feeling I’ve had for a long time – almost all of the accuracy improvement in modelling comes from having good data, even a really small amount of really good data, rather than the model itself. This pendulum between spending more time improving the ML model vs spending more time improving/gathering/generating data may swing the other way in future and has been more in the ‘improve the model’ camp in the past.

However, my highlight speaker slot was delivered by Ian Ozsvald – Interim Chief Data Scientist Consultant at Mor Consulting and very prominent in the PyData space. He shared examples of projects gone wrong and his learnings. There was a lot that resonated with me from past experiences, and it was pleasing that some of his mitigative strategies from his 20+ years of experience in the field aligned with mine as well.

A special mention also has to go to Juan Rodriguez – Data Scientist Advocate at Orchest, which has to be seen to be believed! Juan composed and performed a live composition of a ‘dubstep rave’ from scratch in only five minutes using FoxDot ending with ‘God Save The Queen’, absolutely brilliant! Check out some of it here.

This event provided some invaluable time, not just to get to know my growing team better, but to connect with other Data Science teams of all flavours – sharing ideas and discussing the challenges we’ve all faced.  I’ve written a LinkedIn post about my experiences here.

An additional win is the ticket sales have raised significant funds for NumFOCUS, the charity that supports the ongoing maintenance and development of those open-source libraries that tens of millions use without restriction and for free around the world.

Altogether we had a fantastic time at PyData London 2022; hoping to convince GlobalLogic to sponsor a stand at the next one!

 

Nikhil Modha – Senior Data Scientist

This was my first time attending PyData London and I’m pleased to say that it was an amazing experience. There were several interesting talks which spanned a range of topics from pure Machine Learning theory to MLOps in the cloud, giving a broad catalogue of talks poking at every area of Python and Data Science.

In particular two of my favourite talks were, How to Stack Neural Networks together: Ideas and Applications by Pranjal Biyani – AI Scientist at Polymerize and Extreme Multilabel Classification in the Biomedial NLP domain by Nick Sorros – Data Scientist at MantisNLP.

Pranjal’s talk about stacking neural networks was particularly interesting to me since I often run into problems modelling mixed data to predict an outcome. For example, if you had an ML problem where you had to predict the price of a house and you had both the tabular data (number of rooms, size in square ft, number of bathrooms etc) and pictures of all the rooms in the house it is very difficult to make one model to be able to take both the images and the tablular data as an input.

Pranjal explained in great detail how you could you stack two models, for example one for tabular data (Neural Net) and one for the image data (CNN), to get a solution that can handle both image and tabular data. These methods could have direct impact on the work we are currently doing here at GlobalLogic, and in particular it could be used in the tech4pets project where we have text descriptions of dogs as well as pictures.

Nick Sorros’ talk was also about addressing frequently seen problems in ML projects; how to build a classifier to predict on thousands of labels. Nick’s example use case was attempting to predict the category label for medical publications to be uploaded onto “PubMed”. He was tasked to build a classifier to predict on over 29,000 labels. Nick went through the different processes and challenges he faced when trying to solve this problem of which were disk space, memory, training time, inference time. Using techniques such as hierarchical label clustering, reduction of sparse matrices and binary trees Nick managed to produce a final ‘BERTMesh’ model which was impressively 63% accurate on predicting 29,000 classes.

Although I only mentioned these two talks, there were many great talks at PyData – you can find the full schedule of talks here.

Luckily for us the majority of the talks were recorded and will be free to access on the PyData YouTube channel in due time. I would highly recommend attending the next PyData for anyone who is interested in Python or Data Science!

 

Dr Ben Wilson – Data Scientist

PyData is a community for developers and users of Python, Julia, and R, it was created to help share open-source tools and ideas with an audience that can benefit from the latest developments in open-source software.

PyData is growing and now run many conferences all over the world each year, we are fortunate to have one of these yearly conferences, PyData London, run in central London just a short walk from the GlobalLogic offices, which I was lucky enough to attend this year.

This was my first attendance of a PyData conference, and I was enamoured by the scope of talks that were included in the program, there really was something for everyone at the event. With talks on topics such as Fairness in Algorithmic Decision Making given by Adrin Jalali – Machine Learning Engineer at Hugging Face, to more technical talks such as that given by Sam Morley – Research Software Engineer at the DataSig Project, presenting the use of Signature Method for time series data.

Although there are many talks, I have chosen to focus on the first one mentioned from Adrin. I found this talk incredibly thought-provoking, Adrin, discussed the implications of recommender systems and ML in general generating undesirable and unfair assumptions and biases towards specific groups of people. The talk recapped some famous anecdotal case studies such as the case of Google’s photo tagging app which was known to mislabel specific groups of people in a racist manner.

Also, the case of COMPAS where sentencing length was too dependent on socio-economic factors leading to unfair sentencing for individuals from these backgrounds.

Less impactful cases were also mentioned such as the case when YouTube created its first app for IOS. It was found that around 10% of videos were being uploaded upside down, due to bias that was introduced by all of the developers in the team being right-handed.

Also mentioned was the fact that the problem is not as simple as just removing some problem variables from the model, as many variables can be interconnected in ways that are not properly understood and can be highly correlated.

The talk should be summarised with the thought that great care is required when designing systems, as your decisions may be impactful to a large number of people. Team diversity is also key, as people different to yourself might spot some of the biases that may not be immediately obvious to you.

I thoroughly enjoyed my time at PyData London and hope to have the opportunity to attend again, if you do receive an opportunity to attend, I would highly recommend attending!

 

Closing thoughts

PyData London 2022 was a very valuable event for GlobalLogic to attend. Jam packed with great speakers from a variety of different fields and a great sense of community, it was great to understand Python and Data Science and learn from industry experts. We’re already looking forward to the next one!

 

About the authors

Sami – I’m Sami Alsindi, Lead Data Scientist at GlobalLogic. I’ve been a Data Scientist for almost four years now, having discovered my love of programming in my PhD. I’ve been a Consultant for most of my career and love making a difference to clients by tapping into their rich datasets and applying AI, ML and DS techniques bespoke to their business problems

Nikhil – I’m Nikhil Modha and I’m from London. I’ve worked in the tech industry for four years, specialising in Data Science and MLOps. I love to work with NLP (Language Data) and productionising models in the cloud.

Ben – I am a Consultant Data Scientist working for GlobalLogic in London, I have been a Data Scientist for six months, prior to this I completed a PhD applying ML algorithms to a Computational Electromagnetics problem. I love all things Computer Vision and enjoy employing cutting edge models to problems in my work.

 

Authors

Author

Sami Alsindi

Lead Data Scientist

View all Articles

Author

Ben Wilson

Data Scientist

View all Articles

Author

Nikhil Modha

Senior Data Scientist

View all Articles

Top Authors

Tomasz Walis-Walisiak

Tomasz Walis-Walisiak

Senior DevOps Engineer

Miguel Ribeiro

Miguel Ribeiro

Senior Data Scientist

Jonathan Hill

Jonathan Hill

Senior Data Scientist

Surbhi Nijhara

Surbhi Nijhara

Principal Architect

Ben Wilson

Ben Wilson

Data Scientist

Top Insights Categories