CFM-ID Benchmarking

Publication

See the publication in the Journal of Chemical Information and Modeling.

Introduction

Chemical identification in metabolomics analysis often relies, in part, on signals (spectra) from a mass spectrometer. Classically, compounds are identified by comparing the spectra of unknowns to a library of spectra measured from known reference standards. However, only a small fraction of compounds have reference spectra. Therefore, spectra-prediction tools, such as the machine-learning tool CFM-ID, have been created to provide predicted, surrogate spectra when experimental spectra are not available.

Summary

This work compares CFM-ID predictions to empirical measurement.

The basic steps are: 1) A bird’s eye view of performance. Compare all empirical spectra to matching-compound CFM-ID predictions of all energies. From this we observe a bimodal trend.

Seek to explain the bi-modality by understanding the role that collision energy plays. Compare each compound’s prediction’s three energy levels to the empirical, taking note of the empirical energy level. This reveals a performance bias in the low energy predictions (spike on the left-hand-side of overall performance) and suggests that we should only use the high-energy (40 eV) prediction.
Probe the role that chemical structure plays. Take all compounds with empirical energies close to 40 eV (the energy level of CFM-IDs high-energy prediction), and match them against the 40 eV prediction. For those same compounds, obtain chemical fingerprints. Obtain the fingerprints of the training set. Perform UMAP on the union of the sets, to see that performance is not related to “similarity to the training set”.
Determine what determines predictability if the reason is not similarity to training set. Classify all compounds used in step 3, and display them as a function of predictability (similarity). Finally, train a random forest model to predict predictabilty, using a new set of compounds. Explore feature differntiation on that model, and observe that the features that are correlated with good predictability are characteristic of the compound classes that are associated with good predictability.

Technologies

This project made use of the Python3 libraries/workflow tools:
* pandas
* sklearn
* umap
* matplotlib
* numpy
* snakemake

Method/Results

1) Overall Performance

To measure overall performance, we first obtained compound-spectra pairs that were not in the CFM-ID train/test set from the NIST20 library. We then operated CFM-ID on each compound and performed a dot-product similarity measurement between a compounds’ prediction and experimentally measured spectrum. The list of compound and similarity pairs was partitioned according to mass-analyzer and adduct and displayed below. In mass spectrometry, it is conventional to represent a similarity of 1 as 999 and a similarity of 0 as 0.

2) Explaining Bimodality with Collision Energy

We explain the bi-modality by understanding the role that collision energy plays. We compare each compound’s prediction’s three energy levels to the range of empirical energy levels, and partition according to the CFM-ID (lines) and empirical energy (x-axis ranges) levels. This reveals a performance bias in the low energy predictions (spike on the left-hand-side of overall performance) and suggests that we should only use the high-energy (40 eV) prediction.

3) Hypothesizing that performance also depends on similarity to training set Performance by Compound

Then, we probe the role that chemical structure plays. Whereas CFM-ID is largely a machine learning model, we hypothesize that a compound’s prediction improves as similarity to training set increases.

To test this, we take the similarities between the empirical spectra for all compounds with empirical energies close to 40 eV (the energy level of CFM-ID’s high-energy prediction - the energy level that we determined was best in the above step) and their corresponding predictions. These serve as the colors in (b) the above graph.

Then, for those same compounds and for CFM-ID’s training set, we obtain the CACTVS/Pubchem fingerprints. We perform UMAP on the union of the sets (coloring the training set red), and see that 1) performance is not related to structural similarity and 2) best-performace seems to occur in a localized grouping, which suggests that structural motifs will explain performance.

3) Hypothesizing that structural motifs explain CFM-ID performance

To first test this hypothesis, we classify all compounds used in step 3, and display the class distributions as a function of predictability (similarity). We see that the most predictably compounds are benzenoids.

Screenshot%20from%202022-10-04%2016-38-08.png

Screen%20Shot%202022-10-04%20at%202.43.01%20PM.png

As a totally orthogonal method to confirm the above intution, we trained a random-forest model on predictions from the VFNPL Natural Product Dataset. This model sought to predict predictablity. From this model, we were able to ascertain which bits in the CACTVS fingerprints were able to discern/explain good/bad performance by CFM-ID. Indeed, those bits with good-performance are those for which substructures correspond to the classes “benzenoids” and “organo-heterocyclics”.