The Role of Data Science in Research A Case Study with Liverpool School of Tropical Medicine
In academia, new applications of Machine Learning are emerging that improve the accuracy and efficiency of processes, and open the way for disruptive data-driven solutions. For example, the implementation of Data Science in Biomedicine is helping to accelerate patient diagnoses and create personalised medicine based on biomarkers.
Aligned with these advancements, we have received growing interest from professionals in academic disciplines outside of computer science, regarding what are the Data Science tools and techniques they need to know to prepare for the future, and what are the relevant applications in their area of specialisation.
Working with Liverpool School of Tropical Medicine (LSTM), we set out to address these questions and upskill their Department of Vector Biology in Data Science using Python. Our goal was to provide PhD’s and Post Doctoral Researchers with transferable knowledge and Data Science skills they can apply to their research in Epidemiology and Bioinformatics.
In this article we will provide an overview of:
- Essential Data Science techniques researchers need to know
- Applications of Data Science in Epidemiology
- Case study: A training plan for Liverpool School of Tropical Medicine
It’s worth noting that this Data Science training strategy can be applied in any field. Cambridge Spark Data Science and Machine Learning training programmes are designed to equip individuals with the skills to gather, analyse and interpret structured and unstructured data, in just two days.
An Introduction to Data Science in Python
The essential Data Science techniques researchers need to know about
To build data science capabilities, the first step is to upskill researchers and subject-matter experts in the foundations of Data Science using Python. Widely-used techniques to start learning are:
Data Science Essentials
- Working with Jupyter notebooks
- The Numpy library for array manipulation
- The Pandas library for data manipulation
- Data cleaning and pre-processing
- Data visualisation with Matplotlib and Seaborn
- Applying Principal Component Analysis (PCA) in Python with SKLearn
Unsupervised Learning and Supervised Learning
- The scikit-learn library for Machine Learning and scikit-learn pipelines
- k-means clustering
- Hierarchical cluster analysis
- Density-based clustering (DBScan)
- The k-Nearest Neighbour algorithm
- Overfitting, underfitting, bias-variance tradeoff
- Cross-Validation and hyperparameter tuning
- Decision Trees
- Intuition behind Bagging and Bootstrapping, Concept, Algorithm, Random Forests in scikit-learn
- Intuition behind Boosting classifiers, visualisation, Boosting methods in scikit-learn
- Adaboost, XGBoost, LightGBM
- Stacking in scikit-learn
Applications of Data Science in Epidemiology
How researchers can make use of Machine Learning
Current research initiatives are using Machine Learning to detect health threats and improve diagnosis accuracy /efficiency to have a positive impact on patient outcomes. Examples include:
- Using Feature Engineering and Feature Selection in order to identify biomarkers capable of distinguishing between diseases and group samples with shared characteristics.
- Applying regression models to examine the cause-and-effect relationship between disease risk factors.
- Using random forests to make highly informative predictions for more targeted drug prescriptions.
- Using CNN’s for image analysis to detect diseases such as Malaria.
A training plan for researchers at Liverpool School of Tropical Medicine
“The course was intended to improve the data science capability of our department, though each student had their own motivation for signing up. Personally, I was looking for an overview of machine learning tools, the necessary considerations when applying them, and indications about how to implement them,” said Eric Lucas, Post Doctoral Research Associate, Liverpool School of Tropical Medicine.Aligned with these technical specifications and learning objectives Cambridge Spark delivered a three-day Introduction to Data Science using Python training session, on-site, at the Department of Vector Biology.
The training was very relevant. I am about to start a project that aims to predict phenotype based on genetic data, which I plan to approach using machine learning. I really enjoyed the discussions on pitfalls of machine learning, what makes them effective, what can be expected of them and what can’t be expected of them.
Eric Lucas, Post Doctoral Research Associate, Liverpool School of Tropical Medicine.
“I enjoyed learning about how the different machine learning tools work, their strengths and weaknesses. I do a lot of data analysis already (using a lot of tools that overlap strongly with machine learning, such as logistic regressions, PCA, clustering analysis) and I generally get a kick out of thinking about data,” said Eric Lucas, Post Doctoral Research Associate, Liverpool School of Tropical Medicine. “I was actively searching for organisations that could provide in-house machine learning courses, and the course which Raoul proposed matched very closely with what I envisaged.”
Building A Data Science Team?
Get in touch and we will give you a call to discuss your objectives, and how we can help build the relevant Data Science capabilities for your needs.