The issue
Recent food crises have shown the importance of having effective means of food identification and analysis. Many tests have been developed to monitor food safety, but analysis of the resulting data is highly problematic.
Using nuclear magnetic resonance (NMR) methods allows the identification of a wide range of small molecules, or metabolites, that provide characteristic ‘‘fingerprints’’ that show the relative concentrations of compounds present in a sample.
Each sample may produce thousands of data points that require peak modelling and other data reduction techniques. Chemometrics applies mathematical methods from statistics and pattern recognition to these metabolomic fingerprints and extracts relevant features that enable samples to be classified, anomalies to be recognised and markers for different biological states identified.
Changes in temperature, pH and ionic strength result in unwanted shifts in peak position. It is common practice to accommodate small shift changes by integrating the spectral data over regions of equal length. Uniform binning can dissect NMR resonances or assign multiple peaks to the same bin, adding to the variance and making data interpretation difficult.
The research
Our researchers designed the adaptive binning algorithm to allow variable-length bins which correspond directly to peaks in the spectra, facilitating interpretation of the data. As noise regions are excluded, the method significantly reduces variation within a biological class (for example, disease state) in comparison to fixed-width binning.
Although the use of integrated peaks rather than individual data points reduces the number of variables, the search space in metabolomics studies is still prohibitively large for evolutionary computing methods such as genetic programming.
The advantage of genetic programs over standard multivariate analyses is that they do not involve a transformation of the variables. This produces results that are easier to interpret in terms of the underlying chemistry.
In response, we have developed a two-stage genetic programming algorithm designed specifically for use with the one-dimensional datasets.
Computational efficiency is significantly improved by limiting the number of generations in the first stage and only submitting the most discriminatory variables to the second stage, in which the optimal classification solution is sought.
The outcome
Our work has enabled the Food and Environment Research Agency (Fera) to maximise the information available from food testing. This has resulted in improved food safety and authentication worldwide as well as underpinning the analytical testing services delivered by Fera.
Our techniques have been incorporated into a bespoke Matlab-based solution which is now routinely used by Fera’s Chemical and Biochemical Profiling section. It is used in the specialist testing services Fera provides across the food storage and retail, agri-environment and veterinary sectors to over 7,500 customers in over 100 countries.
The techniques are used in Fera’s research, supporting around £8m worth of work to develop a wide range of global applications including the determination of disease-related biomarkers, contaminant detection, food traceability and the development of drought and disease-resistant crop varieties.