Visualizing 200k+ Spectra and Structures

Results

Deployment Website

Unfortunately, I had to take down the deployment website (buying a house is really expensive!). However, the docker image can be found here

Usecase

You can use this app to observe compare how spectrum, structure, retention-index, and molar mass might be related. Each dot is a compound. Dot positions can be based on spectrum or structure. Dot colors can be based on spectrum, structure, retention-index, or molar mass.

An Example

There are many things that can (and can’t) be readily deduced from this. One approach to find interesting patterns is to get positions based on one attribute, but colors based on another.

image.png

There is a lot of noise, but perhaps some extractable trends as well. The pink region selected here indicats that these compounds have largely the same localization on the structure manifold (position) as well as the spectrum manifold (color). We can click on a compound to take a closer look.

image.png

The results are a little noisier than we might hope, but there ijs clearly some sort of trend with a spectrum peak at about m/z 150 as well as the structures’ cyclic/heterocyclics connected to carbonyls (and sometimes carboxylic acids).

Methods

Data Preparation

We started with a slightly modified version of the NIST17 EI/MS library. We removed all compounds with lowerbound mz <50 and upperbound mz>500. We removed each compound that did not self-consistently complete a structural round trip from InChI to rdkit’s mol back to InChI. We removed each compound that lacked a Kovat’s retention index. This reduced the library from about 240,000 to just over 200,000.

Clustering

Spectra

The EI/MS spectra form a natural feature space with 450 features from m/z 50 to m/z 500. We created a UMAP with neighbors=5000 (as large as feasible for available hardware) to minimize spurious grouping and mindist=0 to encourage visual separation (the space might get very crowded with 200,000 points). We used the cosine score for similarity/distance with a disconnection distance of 0.99 (to avoid random connection between totally unrelated spectra).

We used DBSCAN on the latent space for coloring of clusters with a minimum of cluster size of 500 and epsilon of 0.2, based on visual inspection.

Structures

The structures were converted to Morgan Fingerprints with a bitspace of 2400. We created a UMAP with neighbors=5000 (as large as feasible for available hardware) to minimize spurious grouping and mindist=0 to encourage visual separation (the space might get very crowded with 200,000 points). We used the jaccard distance for similarity/distance with a disconnection distance of 0.99 (to avoid random connection between totally unrelated spectra).

We used DBSCAN on the latent space for coloring of clusters with a minimum of cluster size of 500 and epsilon of 0.2, based on visual inspection.

Discussion

Efficacy of this approach

Visualizing complex spaces like this is useful because it allows for non-informaticists to draw meaning in a (potentially misleading) manner. No knowledge of algorithms or software is required, and (usually) some level of intution garnished from these explorations is better than naive intution. True organic chemists and spectral interpreters can use this too to perhaps derive spectrum-prediction heuristic rules that might have been previously overlooked. Or perhaps to understand the space and biases that come with the NIST17 library as a whole.

On the negative side, the color/position contrast technique is met with limited success. By and large, the spaces are non-linear, and only some gems may be collected.

Hardware and Time Limitations

More time is needed to explore this project.

There are a lot of improvments I would like to make

  1. More explored parameters. UMAPs required 5 hours on a 48 core, 300 gigs-of-ram AWS EC2 machine. It would have been very nice to explore DensMAP, the more complicated analog of UMAP that does not make a uniformity assumption, as well as try many parameter combinations.

  2. More visually meaningful coloring. The efficacy of color as a cluster indicator decreases as the number of clusters increases. It would be nice to add some cluster subset functionality.

Deployment Comments

A minimized version of this app is containerized and deployed on AWS Beanstalk. As I put more apps on there, perhaps some sampling will be appropriate to make the app faster. It would also be nice to deploy this app in a serverless fashion with Lambda.

Code availability

https://github.com/plbremer/gc_manifold_widget