Tutorials – Friday 15th November
Location: Downing College, Howard Building, Regent Street, Cambridge, CB2 1DQ
09:00 – Developing medical analysis tools in Python with scikit-image, by Frank Longford, Scientific Software Developer at Enthought
This tutorial covers basic and intermediate use of the scikit-image package, following a use case for medical image analysis. Its target audience is programmers with a comfortable understanding of the core Python scientific stack (scipy, numpy etc.), and who would like to learn low-level image analysis techniques in an applied research environment.
In this tutorial, we shall explore how tools within the scikit-image package (https://scikit-image.org) can be used to build a grading system for microscope slides of cellular tissue. We shall cover the basics of image analysis techniques, as well as outline a current medical use case for automated diagnostic tools. We then go on to tackle this use case with routines currently available in scikit-image and highlight where a developer may need to supplement them with other packages in the Python scientific stack.
The aim is to provide an understanding of:
- Fundamental theory of image analysis
- What scikit-image currently has to offer
- Why a developer may chose to use scikit-image over more a established image analysis codebase
- How to apply the tools available in scikit-image to a scientific research problem.
As such, this tutorial will not directly cover the use of convolutional neural nets (since they do not appear in scikit-image), though it will refer to situations where a user may like to perform further training of machine learning algorithms on data that has been pre-processed with scikit-image.
Scikit-learn is a peer-reviewed open source image analysis library, distributed within the Python SciKits stack. It is built on a numpy/scipy backbone, has over 300 contributors and possess a well documented API.
Frank is a Scientific Software Developer at Enthought, UK. He has a multidisciplinary background in interface chemistry, computational modelling, data science and image analysis. Prior to joining Enthought he was awarded a postdoctoral fellowship to develop computational tools for medical imaging using Python, working alongside an international team of pathologists, clinicians and experimental physicists. He has a particular interest in creating elegant software that is intuitive to use and aids collaboration.
Frank holds a Ph.D. in complex systems simulation from the University of Southampton, UK and a M.Chem. from the University of Sussex, UK.
11:00 – Advanced Software Testing for Data Scientists, by Raoul-Gabriel Urma, CEO at Cambridge Spark
The journey to deploy a model to production starts with testing it rigorously, including its code implementation. In this tutorial, you will learn about state of the art software testing approach. You will learn how to write unit tests with enhanced diagnostics, leverage validation tools from numpy, pandas, scikit-learn, apply test doubles and generate test cases using property-based testing.
It’s fun to develop a model in a Python notebook! But engineering teams are always complaining about code maintenance and code quality, asking for production-ready code. What can you, as a data scientist, learn from the software development world to help with this? In this tutorial, you will learn about the state of the art testing approach. You will learn how to break down a model implemented in a notebook into separate parts which you can unit test and ensure quality with common tools available in Python. In addition, you will learn how to apply property-based testing and test doubles.
You will learn about:
- How to structure your preprocessing and modelling code to be testable
- How to write maintainable tests using unittest and pytest
- How to use data science validation tools and diagnostics from Pandas and Scikit-learn
- How to use test doubles to write better tests using unittest.mock
- How to automatically generate test cases using property-based testing and Hypothesis
Dr Raoul-Gabriel Urma is the CEO and Founder of Cambridge Spark, a leading learning organisation for data scientists and developers. In particular, Cambridge Spark has developed K.A.T.E.®, a proprietary AI system for technology L&D and assessment with the support of the UK innovation agency.
He is author of several programming books, including the best-seller “Modern Java in Action: lambdas, streams, functional and reactive programming” which sold over 30,000 copies globally and with a second edition published in November 2018.
Raoul holds a PhD in Computer Science from Cambridge University as well as a MEng in Computer Science from Imperial College London and graduated with first-class honours, having won several prizes for technical innovation. His research interests lie in the area of programming languages, compilers, source code analysis, machine learning and education.
He was nominated an Oracle Java Champion in 2017. He is also an international speaker having delivered over 100 talks covering Artificial Intelligence, Business, Java and Python. Raoul has advised and worked for several organisations on large-scale software engineering projects including at Google, Oracle, eBay and Goldman Sachs. Raoul sits on the Faculty of the Centre for Digital Banking & Finance as Associate Director.
13:30 – Mastering the game: reinforcement learning with Keras, by Israel Herraiz, Strategic Cloud Engineer at Google
Thanks to reinforcement learning, we have seen how machines are now better than humans in many board games. In this tutorial, we will use Python and Keras to develop an agent powered by reinforcement learning, that will learn how to play Connect-4. For this talk, you don’t need to have a deep knowledge of deep learning (pun intended ;), but you should be able to write Python fluently.
In the tutorial, we will use Jupyter Notebooks (either in local, or in an online service such as Google Colab), with Python and Keras, to develop an agent that will play Connect-4. The agent will be based on a library, so its development will be very straightforward.
Once we have done the agent that can play Connect-4 following our instructions, we will create two agents that will use different deep learning models to learn how to play the game.
We will setup a loop to let both agents play against each another, learning from the experience using reinforcement learning.
Finally, we will try to play against the best of the two agents, to check if it is actually a strong Connect-4 player.
Israel Herraiz is a Strategic Cloud Engineer at Google. He has worked in different data science roles at BBVA Data & Analytics and Amadeus. He holds a PhD in Computer Science from Universidad Rey Juan Carlos (2008) and has been a visiting researcher at universities in Europe, Canada and the United States. In a prior life, he was an assistant professor at Universidad Politécnica de Madrid, where he carried out research applying data science to the study of software development and the phenomenon of open-source software.
15:30 – A Deep Dive into NLP with PyTorch, By Jeffrey Hsu, Data Science Lead and Data Product Owner at Scoutbee
In this tutorial, I will give you some deeper insight into recent developments in the field of Deep Learning NLP. The first part of the workshop will be an introduction into the dynamic deep learning library PyTorch. We will explain the key steps for building a language model. In the second part, we will introduce how to implement more advanced architectures and apply it to real-world datasets.
Jeffrey will provide the notebooks listed in this repository (https://github.com/scoutbeedev/pytorch-nlp-notebooks/) for the attendees to follow through during the workshop. The demo will be done with Google Colab so there will no environment setup needed for attendees. Throughout the tutorial, Jeffrey won’t explicitly go through all the notebooks but instead place focus on the slides which will show snippets from the notebooks.
The topics which will be covered are:
1) Intro to Pytorch
2) Build text classifier with Pytorch (Bag-of-Words & RNN based model)
3) Build a character text generator
4) Build a Seq2Seq model
5) Fine tuning with GPT-2
Jeffrey is Data Science Lead and Data Product Owner at scoutbee, a Germany based startup with a strong skill in supplier discovery and AI-powered sourcing platform. A Data Scientist with a product soul.
Conference Day One – Saturday 16th November
Location: Anglia Ruskin University – Science Centre, East Road, Cambridge, CB1 1PT
09:00 – Opening Speech, by Raoul-Gabriel Urma (Cambridge Spark) and Matthew Sattler (HSBC) - Lecture Theatre 05
09:15 – Keynote 1: Recent advances in Natural Language Processing: Towards human-like language understanding for machines, by Ekaterina Kochmar, Affiliated Lecturer and Senior Research Associate at Cambridge Univeristy, and Co-Founder and P/T CSO at Korbit AI - Lecture Theatre 05
Recent advances in Natural Language Processing have brought the field into the spotlight: these days, it is not unusual to see news reports claiming that machines have reached human-like performance on language-related tasks ranging from verbal reasoning to reading comprehension to language generation. This progress is due to better, more effective representations of meaning. Such representations are commonly referred to as word vectors or word embeddings. In this talk, I will overview the background and recent developments in meaning representations, talk about practical applications and challenges, and conclude with a discussion on the fundamental question – do machines really understand language like humans do?
Ekaterina Kochmar is an Affiliated Lecturer and a Senior Research Associate at the Natural Language and Information Processing group of the Department of Computer Science and Technology, University of Cambridge. She holds an MA degree in Computational Linguistics, an MPhil in Advanced Computer Science, and a PhD in Natural Language Processing.
10:25 – tf-explain: Interpretability for Tensorflow 2.0, by Raphael Meudec, Lead Data Scientist at Sicara - Lecture Theatre 05
Deep learning models now emerge in multiple domains. The question data scientists and users always ask is “Why does it work?”. Explaining decisions from neural networks is vital for model improvements and analysis, and user’s adoption. In this talk, Raphael will explain interpretability methods implementations with TF2.0 and introduce tf-explain, a TF2.0 library for interpretability.
We will explore some research papers on interpretability of neural networks, at different scale: from the ultra-specific with analysis of convolutional filters to more user-friendly input visualizations.
For each method, Raphael will provide some theoretical explanations (what mathematical operations we are performing), and a Tensorflow 2 implementation to examine in details how to proceed.
Finally, we will go through tf-explain usage, from offline model inspection to training monitoring.
- Convolutional Kernel Filter Visualization
- Saliency Maps (Vanilla Gradients, SmoothGrad)
- Class Activation Maps
- Occlusion Sensitivity
- TF-explain Usage
Raphael is a Lead Data Scientist at Sicara. He’s focused on Computer Vision models and exploring how we can better understand neural networks decisions. Occasional Keras and Tensorflow contributor, reviewer for Keras-contrib. He’s also a cycling and running addict!
11:00 – Taking your machine learning workflow to the next level using Scikit-Learn Pipelines, by Philip Goddard, Senior Data Scientist at Kindred Group - Lecture Theatre 05
Pipelines are a powerful, but often underused feature of the Scikit-Learn library. In this talk, I will demonstrate the features and advantages of using a pipeline approach to supervised machine learning problems, which can lead to an increasingly elegant, modular and reusable workflow.
The Scikit-Learn library is one of the cornerstones of the Python stack for data science, providing a clean and consistent API for building machine learning models. However, due to the nuances a practitioner will encounter with any data set, maintaining a clean, reproducible workflow can be challenging when faced with various permutations of feature selection and pre-processing before training an algorithm.
In this talk, Philip will demonstrate the features and advantages of a pipeline approach by using it in the context of a supervised machine task, specifically, building a model to predict customer churn. He will demonstrate how pipelines can be used all the way from data pre-processing and feature selection, through to model selection. He will also touch on other important topics, such as addressing class imbalance, and how to elegantly incorporate such considerations into your pipeline by using object-oriented programming techniques.
By using a pipeline approach, machine learning workflows can become increasingly elegant, modular and reusable. Within the Scikit-Learn implementation, only a small learning curve is required to obtain these advantages.
Phil currently holds the position of Senior Data Scientist at Kindred Group, one of Europes fastest growing online gambling providers. He holds a PhD in nuclear physics, and has extensive experience developing machine learning solutions in an online business to consumer context.
Phil focusses on providing pragmatic, useful solutions to the business. He enjoys actively participating in all stages of the lifecycle of a data science project, collaborating with business stakeholders and engineering teams to ensure successful delivery.
11:35 – scikit-multiflow: machine learning on infinite data streams, by Jacob Montiel, postdoctoral fellow at the University of Waikato - Lecture Theatre 05
In the field of machine learning on data streams, data is assumed infinite and models are trained and updated continuously, thereby adapting to changes in the data. This talk provides an overview of data stream learning and introduces scikit-multiflow, an open-source Python framework to easily implement algorithms and perform experiments.
As traditional “batch” learning struggles to keep in pace with today’s data deluge, a parallel field emerges — data stream mining. In this field, data is assumed infinite and models are trained and updated continuously, thereby adapting to changes in the data. This talk provides an overview of the core concepts of data stream learning and introduces scikit-multiflow, an open-source Python framework to implement algorithms and perform experiments in the field of machine learning on evolving data streams.
This talk is composed of two main sections:
An introduction to learning from data streams
- How is stream learning different from the “traditional” batch learning?
- An overview of methods for supervised learning
- Discussion of challenges from changes in the data distribution, known as concept drift
Evaluating model performance on infinite data streams
What is scikit-multiflow?
Overview of the core components of scikit-multiflow and available methods
I am a postdoctoral fellow at the University of Waikato in New Zealand and the maintainer of scikit-multiflow. My research interests are in the field of machine learning for evolving data streams. Prior to focusing on research, I led development work for onboard software for aircraft and engine’s prognostics at GE Aviation; working in the development of GE’s Brilliant Machines, part of the IoT and GE’s approach to Industrial Big Data.
12:10 – Supplier Relationships in Transaction Data: Single class training challenges and supply chain network mapping, Matthew Sattler, HSBC - Lecture Theatre 05
Needle in a Haystack: Transforming inherited data assets into a customer engagement engine
The trials and tribulations of legacy ecosystems are well known but these disparate, “siloed” assets are rarely viewed as an opportunity, let alone transformed into one.
This talk centres on overcoming single class training and false-positive challenges in leveraging classifier models trained on the combined data assets of a global bank in order to create internal lead supply solutions.
Data scientists are commonly asked to create models to predict customer value or product preference. Transaction records or customer lists often reliably deliver sets of individuals who have converted but lack a set of individuals who have been verified as non-consumers of a particular good or service. These circumstances can require data scientists to select from a limited set of algorithms and engage in multi-stage validation processes to achieve success. This talk demonstrates how to avoid false-positive issues and reach strategic business goals.
Matthew Sattler has worked as the Global Head of Data Science at HSBC Global Banking and Markets for 6 years. After gaining a BA in Finance and Economics, he went to work for UBS Investment Bank for two and a half years before joining HSBC GBM.
13:00 – "Lunch and Learn" Session - Tech Skills Development: Big Business - Room 306 Hosted by Jules Wix, Talent & Apprenticeship Manager at Cambridge Spark (Optional) - Room 306
This is a roundtable discussion on best practice for research, data, and scientific computing skills & talent development in large businesses over some lunch.
It is a known challenge that the supply of technical talent does not match the growing demand. This discussion, facilitated by skills development specialist, Jules Wix of Cambridge Spark, is an opportunity to share best practice solutions and novel ideas for reducing the skills gap within your organisation, with a particular focus on possible training and development opportunities available. Likely topics include levy-funded apprenticeships, corporate training options and their impact and facilitating peer-to-peer knowledge sharing within your organisation.
This talk is for leaders of tech teams and those with an interest attracting, training and retaining technical talent within large organisations.
13:45 – Tools for Higher Performance Python , by Ian Ozsvald, Chief Data Scientist and Coach at Mor Consulting - Lecture Theatre 05
Your tools and workflow govern how quickly you can deliver results on new challenges. Often we’re constrained by slow algorithms, inefficient data pipelines and suboptimal use of complex tools like Pandas. We’ll look at recent changes in the Python ecosystem enabling fast identification of slow code, simple compilation of CPU-bound numpy processing with Numba, efficient Pandas operations and parallelised medium-data operations with Dask. This talk will give you new tools and process to take back to the office. This talk is based upon the forthcoming 2nd edition of High Performance Python by Ian Ozsvald & Micha Gorelick, due in 2020.
Ian is a Chief Data Scientist and Coach, he co-organises the annual PyDataLondon conference with 700+ attendees and the associated 9,000+ member monthly meetup. He runs the established Mor Consulting Data Science consultancy in London, gives conference talks internationally often as keynote speaker and is the author of the bestselling O’Reilly book High Performance Python. He has 16 years of experience as a senior data science leader, trainer and team coach. For fun he’s walked by his high-energy Springer Spaniel, surfs the Cornish coast and drinks fine coffee. Past talks and articles can be found at: https://ianozsvald.com/
From the team that makes Plotly, Dash is a library for producing interactive web apps with Python. This talk introduces Dash and will discuss how it may fit into your team. We’ll take an introductory look into how Dash works, before exploring what you can, can’t, should and probably shouldn’t do with this library.
At decisionLab, a London-based data science consultancy producing decision tools, we’ve embraced Dash to produce proof-of-concept models for our projects in alpha. Although we’re not officially connected to the plotly/Dash project, by using the library daily across many projects, we’ve learned many lessons and what we feel are best practises we’d like to share, and hear feedback on!
This talk will give an overview of Dash, how it works and what it can be used for, before outlining some of the common problems that emerge when data scientists are let loose to produce web applications, and web developers have to work with the pydata ecosystem. The talk also covers effective working practises to start producing cool interactive statistical web applications, fast. We’ll also identify some of the pitfalls of Dash, and how and when to make the decision to stop using Dash and start building a proper web application.
Dom Weldon is a Senior Software Engineer at decisionLab, a London-based mathematical modelling consultancy with expertise in machine learning, simulation, optimization and visualization. Dom’s team specialize in taking models from data scientists and turning them into production ready tools. Current clients include the Royal Navy, Siemens and various UK public bodies.
Dom came to decisionLab from his PhD studies in Computational Geography at King’s College London, his initial degree was in Natural Sciences at the University of Cambridge, and he holds a master’s in the historical and cultural geography of the Cold War United States. Outside of work, Dom is interested in languages and travelling, and holds a voluntary statutory appointment on a board monitoring the welfare and dignity of prisoners in a challenging North London jail.
14:55 – Practical methods to optimise model stability: case study using customer-lifetime value at Farfetch , by Davide Sarra (Data Scientist) and Kishan Manani (Lead Data Scientist) from Farfetch - Lecture Theatre 05
Model performance often takes precedence over model stability when optimising machine learning pipelines. This can lead to unexpected outcomes in production. We present how we optimise for both model stability and performance at Farfetch. In this talk we will demonstrate practical methods to navigate the trade-off between model performance and model stability on a real world problem.
Model performance often takes precedence over model stability, if stability is considered at all, when optimising machine learning pipelines. While this can be beneficial in Kaggle competitions, it can lead to unexpected outcomes when a model is in production. For example, credit risk scores, medical classifiers, and customer-lifetime value models should be consistent over the same person. We present a real-world use case of how we optimise for both model stability and performance at Farfetch.
Farfetch is an online luxury-fashion platform with over one million active customers and more than one billion dollars of transactions in its marketplace yearly. We use machine learning to optimise customer relationship management (CRM) activities through customer-lifetime value and churn modelling. For this application, model stability is crucial for adoption by our internal stakeholders and to ensure a consistent customer experience.
Model stability relates to the variability of model predictions arising from the training process, training data, and shifts in the distribution of features over time. We shall discuss how this arises in the case of customer-lifetime value modelling, which requires making predictions for the same set of customers periodically. Firstly, we will introduce how to measure the variability arising from the sources above using methods such as bootstrapping, re-training, and simulation. Secondly, we present our solutions to enhance model stability. Finally, we benchmark a wide array of model classes including Linear Models, Random Forest, and Gradient Boosting on our dataset and use-case. We find that the most performant models are not the most stable.
By the end of the talk we will have demonstrated practical methods to navigate the trade-off between model performance and model stability on a real-world problem.
Kishan has extensive public speaking experience from working in academia and industry. He has presented scientific talks to the lay public and over radio. He has a PhD in Physics and multiple years of experience as a Data Scientist in Finance and E-Commerce.
Davide is Data Scientist with experience in Ad Bidding, Customer Behaviour, Pricing, and A/B testing tools. By working on large machine learning products, Davide has developed skills and a passion also for Software Engineering. He also enjoys public speaking and loves sharing his learnings.
16:00 – Keynote: The Turing Way: Reproducible, Inclusive, Collaborative Data Science , by Kirstie Whitaker, Lead Developer of The Turing Way - Lecture Theatre 05
Reproducible research is necessary to ensure that scientific work can be trusted. By sharing data, analysis code and the computational environment used to generate the results, researchers can more effectively stand on the shoulders of their peers and colleagues and deliver high quality, trustworthy and verifiable outputs. This requires skills in data management, library sciences, software development, and continuous integration techniques: skills that are not widely taught or expected of academic researchers. Skills that are unreasonable, in fact, to expect in one individual team member. Even worse, they are not sufficient for ethical, transparent, collaborative, participatory and well-designed data science!
The Turing Way is a handbook to support students, their supervisors, industry data scientists, team leaders, funders, journal editors, and policy makers in ensuring that reliable and impactful data science is “too easy not to do”. It includes training material on version control, analysis testing, collaborating in distributed groups, open and transparent communication skills, and effective management of diverse research projects. The Turing Way is openly developed and any and all questions, comments and recommendations are welcome at our GitHub repository: https://github.com/alan-turing-institute/the-turing-way.
In this talk, Kirstie Whitaker, lead developer of The Turing Way, will take you on a whirlwind tour of the chapters that already exist and the directions in which we’re continuing to develop including ethical
considerations, research project design, scoping across a broad range of incentives and ways of working, and effective communication strategies. All participants will leave the talk knowing that “Every Little Helps” when making their work reproducible, where to ask for help as they start or continue their open research journey, and how they can contribute to improving The Turing Way for future readers.
Kirstie Whitaker is a research fellow at the Alan Turing Institute(London, UK) and senior research associate in the Department of Psychiatry at the University of Cambridge. Her work covers a broad range of interests and methods, but the driving principle is to improve the lives of neurodivergent people and people with mental health conditions.
Dr Whitaker uses magnetic resonance imaging to study child and adolescent brain development and participatory citizen science to educate non-autistic people about how they can better support autistic friends and colleagues. She is the lead developer of The Turing Way, an openly developed educational resource to enable more reproducible data science.
Kirstie is a passionate advocate for making science “open for all” by promoting equity and inclusion for people from diverse backgrounds, and by changing the academic incentive structure to reward collaborative working. She is the chair of the Turing Institute’s Ethics Advisory Group, a Fulbright scholarship alumna and was a 2016/17 Mozilla Fellow for Science. Kirstie was named, with her collaborator Petra Vertes, as a 2016 Global Thinker by Foreign Policy magazine. You can find more information at her lab website: whitakerlab.github.io.
Interested in doing a lightning talk? Register at the registration desk on Saturday morning!
Conference Day Two – Sunday 17th November
Location: Anglia Ruskin University – Science Centre, East Road, Cambridge, CB1 1PT
09:35 – Operationalising drilling intelligence, David Fraser Halliday, Data Analytics Manager, Schlumberger - Lecture Theatre 05
During Schlumberger Well Construction operations, drilling equipment is subjected to high pressures and temperatures and high levels of shock and vibration. A variety of different measurements are made during these operations to provide information on the conditions the equipment is experiencing, as well as how the equipment is performing. In this presentation, I will describe the challenges involved in accessing these data, before going on to demonstrate how automated data workflows enable large scale data analysis, leading to the operationalisation of drilling intelligence.
We are using the Dataiku Data Science Studio to access data in Google Cloud Storage and Google Big Query. Within Dataiku we are orchestrating data science workflows, utilising Python packages such as NumPy, Pandas, and scikit-learn. We are visualising workflow outputs using interactive Bokeh web-apps, and providing user friendly access using custom chatbots in Slack.
David Halliday is the Well Construction Data Analytics Manager at Schlumberger Cambridge Research. In 2009, he received a PhD in Geophysics from the University of Edinburgh, with a thesis titled “Surface Wave Interferometry”. Upon completing his thesis he joined the Geophysics Department at Schlumberger Cambridge Research as a Research Scientist, working on a range of topics in seismic data acquisition and processing. He progressed to the level of Principal Research Scientist, before taking on his current role in Data Analytics in 2018. David has been recognised by three professional societies, being awarded in 2010 the European Association of Geoscientists and Engineers Arie van Weelden Award and the Royal Astronomical Society Keith Runcorn Prize, and in 2013 the Society of Exploration Geophysicists J. Clarence Karcher Award.
10:35 – Tests and reliability in Machine Learning Projects, by Stephanie Bracaloni, Software Engineer at Iotic Labs - Lecture Theatre 05
Good practices say you must write tests! But testing Machine Learning projects can be really difficult and frustrating. Before spending a lot of time to write or setup complex frameworks checking your algorithm quality, several things can bring a lot of value and known mistakes can be avoided! This talk is about how to bring a POC to production with (more) confidence.
Once your machine learning POC seems promising and your development environment is well set up the next step is to refactor your code and write TESTS. I know it can really be difficult and frustrating. Some manual checks can address the need.
It is not totally false. Tests can be really boring and time consuming to write when you don’t have the right tools, the right APIs, the right environments or the right code structure. But it is always a bad idea to ignore tests or to perform them manually. If you want to be involved in your project life cycle, if you want to bring it from POC to production you need to care about tests (and testable code). After some years tackling production bugs, you can’t feel safe delivering without tests as you can’t start driving until your seat belt is fastened.
There is more than one way to test. Tests can be split on several levels (unit, functional, scenario, performance, etc…) to be able to quickly identify the faulty code/data/parameter. Tests must also be automated in a Continuous Integration and (for most of them) run at least on each experiment before merging it in the baseline pipeline as it is done in software engineering (the CI is triggered on each feature branch).
This talk is about how to easily write tests and testable code, how to avoid most common traps, how to quickly setup a CI and what are the benefits of tests on unrealistic data in your Machine Learning project. (Tests on real data are also really important but they are not the main purpose of this talk) Technologies discussed/mentioned: Docker, TDD, Pytest, Hypothesis, CircleCI, GitlabCI
Stephanie has been working as a software engineer for more than six years. She’s now working on the industrialization of machine learning projects (from POC to production). She like development but she’s not “just a coder”, she always keeps in mind systems and projects as a whole. Finding solutions to new problems or improve day to day process is something she really enjoys.
11:10 – A Kedro Mindset for Data Scientists, by Ivan Danov, Machine Learning Engineer at QuantumBlack and Tech Lead for Kedro - Lecture Theatre 05
Kedro is a Python library that implements best practices for data pipelines with an eye towards productionising ML models. Learn how to use Kedro to structure your data science workflow while creating production-ready code.
This talk will go into detail about challenges that data scientists face in their workflow while creating ML models that are deployable; what software engineering principles data scientists should consider applying to their code to make it easier to deploy in the production environment; and, how they can use an open source Python library, called Kedro, to enhance their data analysis workflow as well as their transition to production-ready code.
Content is structured for beginners in Python and I will take the audience through the Kedro Spaceflights tutorial.
I. The difficulty of being a data scientists today (5 min)
- The rise of trends of MLOps, DataOps and Data DevOps have doubled down on data scientists being able to master challenging frameworks and methodologies
- Workflow challenges you will face while trying to create production-level code on your own
II. What does production-ready code look like? (5 min)
- Definitions for production-level code and data pipelines
- Coverage of the software engineering principles that should be applied to create data pipelines
III. Using Kedro in your data analysis workflow (10 min)
- Setting up Kedro’s data access layer and configuration for the Spaceflights tutorial
- Jupyter Notebook / Lab workflow that creates reproducible code
IV. Transitioning from Jupyter Notebooks into production-ready code (5 min)
- Exporting code from Jupyter Notebooks into the Kedro project template
- Constructing a data pipeline within Kedro
- Visualise the Spaceflights data pipeline with Kedro-Viz
V. Deployment strategies (5 min)
VI. Q&A (5 min)
Ivan is a Machine Learning Engineer at QuantumBlack and Tech Lead for Kedro. His work on Kedro has created the product that we know today.
As a user of Kedro on projects, he really wanted to be able to help Data Scientists successfully transition into this new era of “production-ready code” without too many changes to their workflow.
His presentation experience is focused on three areas:
- Leading weekly training for the Python course for Code First: Girls, a non-profit social enterprise that focuses on increasing the proportion of women in tech
- Presenting Kedro at Data Science meetups
- Leading Kedro presentations and demos for CDOs, CEOs and CAOs
11:45 – GA2M: combining accuracy and explainability, by Guillaume Baquiast, Data Scientist at QuantumBlack - Lecture Theatre 05
As ML models are used in real-life applications, it is important to make them interpretable. In this talk, we will present GA2M, an algorithm that is both accurate and fully interpretable by design. We will describe the idea of the algorithm, display how to use our implementation, and give real-life examples of its benefits.
We will start this talk by stating the importance of interpretability in Machine Learning. We will then present the algorithm GA2M, introduced by Y. Lou, R. Caruana et al. in Accurate Intelligible Models with Pairwise Interactions. Finally we will present some results and introduce our implementation.
Guillaume Baquiast is a Data Scientist at QuantumBlack. He uses Data Science to have real-world impact.
12:20 – Using NLP to improve theatre utilisation in hospitals, by Jaymin Mistry, Data Scientist, PA Consulting - Lecture Theatre 05
NHS Trusts are under significant pressure to improve their efficiency and reduce waiting lists. Theatre time is both expensive and has a huge opportunity cost. In this talk, Jaymin will explain how a NLP solution can deliver improved theatre utilisation, the reasons for model selection, the nuances of working with medical data and how this beats rules based scheduling methods.
Explanation of the problem
Hospitals are under significant financial pressure to increase productivity. One area which is under specific pressure is the utilisation of theatres. Theatres have a high running cost (particularly multiple specialist staff) and opportunity cost (wasted time could be used to treat more patients). Predicting surgery time has a large amount of uncertainty and variance. This results in scheduled operations overrunning: which affects staff and expenditure and under-utilisation when a theatre is only utilised for part of the time it is available, missing a potential opportunity to reduce waiting lists. Currently, theatre lists are created by administrative staff using surgeon estimates of surgery time.
Solution: Several rules based solutions exist that use business logic and summary statistics to provide some assistance to administrative staff. They 1) fail to identify the variance between different patients (age/ complexity etc) with the same condition 2) Often require procedure categorisation by medically trained staff, which complicates/limits their usage
I will explain how NLP was used to extract information from the free text description of surgery. This supplemented other information available about patients to predict surgery duration with an improved accuracy, which will enable better scheduling of operations. I will explain the reasons for selection of models and the different approaches used to prepare data along with the challenges of engaging with end users and medical professionals. The end result is a 5-10% increase in usable time, which translates to hundreds more operations in a large hospital.
We are aiming to:
1) improve the text preprocessing step so that more information can be extracted and noise can be reduced
2) Use information from different hospitals to improve a general model.
Jaymin Mistry is a Data Scientist at PA Consulting.
13:05 – Bonus "Lunch and Learn" session - Tech Skills Development: Scaling Business, Hosted by Jules Wix, Talent & Apprenticeship Manager - Room 306
Over lunch, we’ll discuss best practice for research, data, and scientific computing skills & talent development in start-ups and scaling businesses.
Restricted access to technical talent is one of the biggest barriers to successfully scaling a small business, compounded by the lack of resources which may be available to compete against larger employers. This round table discussion, facilitated by skills development specialist, Jules Wix of Cambridge Spark, is an opportunity to share best practice solutions and novel ideas for reducing the skills gap within your organisation, with a particular focus on possible training and development opportunities available. Likely topics include developing your own talent through apprenticeships, identifying resources available for upskilling your team and how to maximise the resources you have available to attract and develop new talent to your organisation.
This talk is for anyone with an interest attracting, training and retaining technical talent within small and growing organisations.
13:55 – Pandas implants - extension arrays and other customisation techniques, by Jan Pipek, DTone - Lecture Theatre 05
The pandas library offers three approaches to user customization – class inheritance, series/data frame accessors, and extension arrays/dtypes. The talk introduces all of them shortly, while focusing on a complete implementation of physical-unit-aware data column example.
Since version 0.23, the pandas library allows using custom user types for internal representation in series and data frames by introducing the ExtensionArray and ExtensionDtype interfaces (in places where a NumPy array would be used). Version 0.24 brings that forward by implementing all its “exotic” types in terms of the mentioned interfaces.
The talk will explore the possibilities and shortcomings of extension arrays and will gradually build towards a simple proof-of-concept custom column that supports physical units (including the dimension-aware arithmetics and conversions).
In addition, two other approaches of adding custom behaviour to pandas – inheriting from pandas types and creating accessors for series / data frames / indices – will be presented.
Jan is a data scientist at DTone. He recently converted from Monte Carlo simulations in medical physics.
He has been using Python for more than ten years, with a strong inclination for data analysis and visualization (having written several useless and hopefully at least one useful library – physt), but also trying to enjoy the language in the broader sense.
He is both happy and fortunate to be one of the PyData Prague meetup organizers.
14:30 – Automated Machine Learning for Time Series in Python , by Maksim Sipos, CTO, causaLens - Lecture Theatre 05
Most supervised and unsupervised machine learning happens in an offline, batch, setting. However, in some online settings, new data arrive as individual points and the models must process it immediately as such. In this talk, we will show how Python can be used to effectively automatically model time series data in an online streaming setting.
In this talk we will present the approach that we take in our company to do automated machine learning in online settings. Our company’s technology stack is entirely built in Python and Cython. We also use the usual Python data science stack of libraries, including numpy, pandas and sklearn. We have also built a great deal of custom time-series specific code. We will present some design choices that we have made when building our system, and how we’ve solved performance issues related to processing time series data in Python. In particular, techniques that we will discuss:
- time series alignment and resolutions
- time series forecasting
- automated machine learning
- regression and classification
- online time series computations
Maksim Sipos got his bachelors degrees in Mathematics and Physics in 2008, and a PhD in Theoretical Statistical Physics in 2012 at the University of Illinois Urbana-Champaign in the United States. Maksim was awarded 7 times for his clear communication, presentation and teaching skills and has published 6 peer-reviewed papers in research journals that were cited more than 250 times.
Following schooling, Maksim has worked at a prominent systematic hedge fund based in Princeton, on petabyte-size live algorithmic systems directly managing 100s of millions of USD. He has also helped bring to life a variety of data science initiatives as a consultant and CTO at number of successful European and Silicon Valley startup companies.
Most recently, at causaLens, Maksim and his team are building a cutting edge, autonomous, real-time system that builds adaptive predictive models on time series data. Throughout his academic and work career, Maksim has used Python and Cython to do numerics and data science.
15:05 – Chasing nanoseconds: data science for low-latency networks, by Omer Yuksel, Performance Analyst at IMC Trading - Lecture Theatre 05
In this talk we will discuss the data science work at IMC Trading on low-latency networks, where every nanosecond counts. First we will show high-level examples of analysis and modelling of networks and systems, and then we will go over challenges that come with handling data at nanosecond scale.
In order to succeed in technology-driven, low-latency trading, understanding the networks and systems is important. This requires data analysis at a high precision – often in nanoseconds. However, analysis on such a scale can be tricky.
The talk consists of two parts. First, I will show examples on the type of work we do using PyData stack with low-latency datasets, e.g. modelling the network behaviour, predicting the effects of the changes we introduce, and understanding the queuing effects and bursts. The second part will be on the challenges we face when we analyze or visualize the data at a high precision. These are for example handling the various sources of errors, using the platforms that are not fully ready for nanosecond scale, or various gotchas when handling the data in PyData stack.
The talk will be high-level on the networks and systems part – no prior knowledge on that is required. Basic knowledge of Pandas, NumPy, and probability may be beneficial.
Omer Yuksel works at IMC as a Performance Analyst, and has a background in computer engineering and applied mathematics, and is responsible for models and Python modules for monitoring network performance metrics.
16:10 – Final Keynote: how does the brain work?, by Kenneth Harris, Professor of Quantitative Neuroscience in the UCL Institute of Neurology - Lecture Theatre 05
The brain the most sophisticated computer, and the most complex piece of matter, in the known universe. Modern neuroscience is producing petabytes of experimental data, offering a unique opportunity to find out how this computer works. The data takes very diverse forms, arising from very different types of experiments, from genomics to large-scale neuronal recordings.
This talk will describe some of the data, processing methods, and data organization challenges involved in modern neuroscience, some of the conclusions that can be drawn from it, and their implications for artificial learning systems.
Kenneth Harris studied mathematics at Cambridge University, did a PhD in robotics at UCL, then moved to Rutgers University in the United States for postdoctoral work in neuroscience. Before returning to UCL in 2012, he was Associate Professor of Neuroscience at Rutgers, and Professor of Neurotechnology at Imperial College London. He is currently Professor of Quantitative Neuroscience in the UCL Institute of Neurology.