- Department: Chemistry
- Credit value: 20 credits
- Credit level: M
- Academic year of delivery: 2023-24
- See module specification for other years: 2024-25
This module will provide an overview of the application of data science and machine learning in chemistry and beyond by looking at four specific problem areas: the analysis of atomic structures, the simulation of molecular dynamics, the handling of atmospheric data, and the analysis of scientific image data. You will learn about different types of chemical data and the software methods used to validate, analyse and extract conclusions from them. You will apply this knowledge to some practical problems using real data and industry-standard tools. The diversity of data types, sources and approaches will give you enough experience to approach new problem domains with confidence.
Occurrence | Teaching period |
---|---|
A | Semester 2 2023-24 |
While data analysis methodology remains common to all disciplines, different methods are particularly suited to help with certain kinds and volume of available data. This module aims to provide relevant experience in the use of data analysis and machine learning techniques in four distinct areas of chemistry: atomic structure ('Molecular Structure Data'), atomistic simulations ('Machine Learning in Computational Chemistry'), atmospheric chemistry ('Atmospheric Data') and molecular property prediction and design ('Applications of Neural Networks in Chemistry').
Students will be able to:
Analyse and evaluate large datasets from different sources.
Develop suitable validation criteria for different data types.
Create software that extracts chemical knowledge from computational representations of molecules.
Appreciate applications of supervised and unsupervised machine learning models in computational chemistry.
Implement feedforward, graph (GNNs), and recurrent neural networks (RNNs) for molecular property prediction and generative molecular design
Content separated by sub-module:
Macromolecular Structure: accessing and retrieving data from a molecular structure database; performing data validation; gathering statistical information about bond lengths, angles, and torsions.
Atmospheric Data: accessing and working with atmospheric data e.g. air pollution data; counterfactual analysis used in for example the analysis of interventions to improve air quality; parameterizations to support atmospheric chemistry modelling.
Machine Learning in Computational Chemistry: bypassing expensive computational chemical calculations using machine learning (ML); representing structures of molecules in computers; using these representations for unsupervised classification and clustering of molecular structures; using neural networks for rapid prediction of potential energies; and using kernel regression to predict molecular properties.
Applications of Neural Networks in Chemistry: working with molecules in computers [atomic simulation environment (ASE); RDKit]; molecular representations; molecular property prediction using feedforward (handcrafted features) and graph (learned features) neural networks (GNNs); generative molecular design using recurrent neural networks (RNNs).
Task | % of module mark |
---|---|
Essay/coursework | 50 |
Essay/coursework | 50 |
None
Assessment 1
Essay/coursework (project report, data presentation including code): Data analysis and presentation for chemistry problem domains. Students to submit code in a compressed file, use up to 4 sides of an A4 to describe the results of their data analysis of one chemistry problem domain.
Assessment 2
Essay/coursework (project report, data presentation including code): Machine learning and neural networks for chemistry applications. Students to submit code in a compressed file, use up to 4 sides of an A4 to describe the results of the application of machine learning and neural networks to one chemistry problem domain.
Task | % of module mark |
---|---|
Essay/coursework | 50 |
Essay/coursework | 50 |
Feedback will be provided through workshops and online exercises. Feedback on summative work will be provided within 25 working days of the assessment.
Introduction to Data Science : A Python Approach to Concepts, Techniques, and Applications
Laura Igual, Santi Segui´. Springer 2017
Python for Data Analysis : Data Wrangling with Pandas, NumPy, and IPython
Wes McKinney. O'Reilly 2017
Pro Git
Scott Chacon, Ben Straub. Apress 2014
Python and Matplotlib Essentials for Scientists and Engineers
Matt A. Wood. Claypool Publishers 2015
Visualization for the Physical Sciences
Lipsa et al. Computer Graphics Forum, 2012, Vol.31 (8), p.2317-2347
Introduction to Scientific Visualization
Helen Wright. Springer 2007
Data Modeling Essentials
Graeme Simsion, Graham Witt. Morgan Kaufmann 2004