DeepLearning Spectrum Intensity Prediction

Results

Overall

While this project is ongoing, preliminary results indicate some success at improving the reliability of spectra generated from quantum mechanical (QM) predictions.

image.png

An Example

A cherry-picked example of a spectrum where ML outperforms QM.

image.png

A cherry-picked example of spectrum where QM outperforms ML.

image.png

Method

Motivation

At some point, a colleague of mine mentioned that predicting EI/MS spectra using quantum mechanics often succeeded at predicting fragments, but failed at predicting intensities. They posited that this was because they were modeling the electron impacts, but not the interaction with the detectors. I wondered if the intensities could be predicted using ML (in this case, deep learning).

Problem Formualtion

Descriptors

At its core, supervised learning seeks to map a space to a space. In our case, we want to map (something?) to intensity. What should that something be?

In our case, we will create an input feature space comprised of three main types of descriptors: 1) absolute m/z, 2) surrounding m/z, and 3) structural fingerprints.

  1. In the case of quantum mechanics predictions, we can (somewhat) reliably predict fragments. This means that we can somewhat assume that we can have the m/z for the compound, but not the intensity. This means that our (something?) begins with the m/z shown in red.

image.png
  1. Its very natural to use absolute m/z as features, which we will do. We also want to assume that the intensity relates to other mz in the spectrum. Why is this? Loosely, we can think about the interaction with the detector as a competition of sorts between fragments. We do not use convolutional layers, though, because the competition is not based on simply on local phenomena.

  1. Finally, if we are predicting fragments using QM, then we can include information about the structure. To do this, we coerce structures to the Morgan topological fingerprints.

Output Space

Because we are predicting intensities, each spectrum, such as the one above, actually becomes n training examples, where each example is a single input “central” m/z and and each output is an “intensity”.

Noteworth Complexities

Imbalance

The vast majority of m/z:intensity pairs have output <10% of max with ~90% of the maximum being the rarest class. Therefore, we implemented, in our Dataset, a custom subsampler that, for each spectrum we interated over, a probabalistic sampling for each m/z depending on the output intensity. For example, those intensities <10% had a 0.6% chance of getting input whereas those with >90% intensity had a 15% chance of getting input.

RAM

The 200,000 spectra, if assumed to have 50 mz:intensity pairs each, generated 10,000,000 training points. With a dense input feature matrix of 450 (m/z options) + 1002 (surrounding m/z) + 2400 (Morgan Fingerprints), that leads to a 10e6 * 3852 * 32 (float32 in tensors) = 1,232 gigabytes going to RAM, as the lower limit. Hence, we implemented a slightly complicated Dataset to generate these features on the fly for our training (with non-constant length, whereas we had probabalistic sampling)

Discussion

The nature of machine learning and molecules

I think the core problem with machine learning and molecules is that the feature spaces that are commonly used to describe molecular phenomena are very chaotic and because of this, interpolation suffers. Consider this graph created during one of the trainings

image-2.png

Where blue is training set error and orange is validation set error.

Even though our model is continuously improving and able to create the arbitrary complexities involved in the spectral prediction, these learnings say very, very little about new data. It looks like the optimally generalized model is created around epoch 8, which is long before the model fits the training data as much as it can.

Consider the chemistry of a long alkyl chain with two carbonyls. The chemistry is very different when there are zero C between them, 1 C between them, and then again 4 or 5 C between them. In my opinion, we simply lack a linear basis space on which to express chemicals and their behavior. Lewis structures are extremely communicable, but utilizing them is an art form.

How to improve this model?

More time availability is always a great start :)

I suspect that reforging this as a classification problem might yield more diversely predicted intensities. As it stands, there is a “central tendency” of predictions to approach 50% of the maximum relative intensity.

I would also like to rent a large computer and benchmark the deep learning against some classic ML approach such as random forests.

Code availability

https://github.com/plbremer/gc_intensity_predictor