Butyrylcholinesterase Druggability

Publication

See the publication in the Journal of Chemical Information and Modeling.

Introduction

Proteins and ligands (such as small-molecule drugs) bind in a conformational equilibrium (across the ensemble individual protein-ligand pairs interchange between multiple distinct positions). Knowledge of the total set of conformations (called binding modes) provides a complete picture of the protein-ligand binding behaviour which can be used for rational drug design. We apply a novel method to study butyrylcholinesterase inhibited by dialkyl aryl phosphates, such as the example below.

dialkylarylphosphate.png

To study this system, we use molecular dynamics simulations. By the property of ergodicity, the ensemble of snapshots from the simulations-over-time average to the ensemble of protein-ligand systems at the same point in time, as one would find in-vivo or in-vitro. Ergodicity is visualized below.

Picture5.PNG
Therefore, the method can be outlined as:
1) performing simulations
2) partitioning simulation snapshots into binding modes
3) reporting binding modes

Technologies

This project made extensive use of the Python libraries: * pandas * sklearn * matplotlib * numpy * hdbscan

Method Overview

Performing simulations

We prepared simulations using the molecular dynamics suite GROMACS and ran those simulations on the distributed computing network Folding@home. In total, we prepared over 1 million snapshots from over 100 µs of simulations for each inhibitor.

Ascertaining Binding modes

The creation of binding modes can be simplified into two steps. First, we transform the snapshots (which can be thought of as a list of atoms and spatial coordinates) into descriptor vectors (often called fingerprints in the chemical sciences) that rationally encapsulate the system and onto which certain clustering techniques can be applied.

To do this, we used a fingerprint. The components of this fingerprint are each amino acid in the protein joined with each functional group within the ligand. The magnitude of each component is the (normalized) number of amino acid atoms within 5 Angstroms of that functional group. By studying the proximity of atoms we interrogate the intermolecular forces that govern binding.

Picture14.PNG

The next step involves clustering the descriptor vectors to group snapshots that are similar, where each cluster is considered a binding mode. To perform this clustering, we applied K-means clustering. K-means is a nice choice because of the large number of datapoints and comparatively large descriptor vectors. Contrastingly, K-means is a challenging choice because the number of clusters, k, must be specified.

To determine, k, we used a novel approach where we looked at the variance and range of cluster populations over repeated clusterings. We chose this approach because inertial graphs showed now elbow and we reserved silhouette methods for cluster confirmation because we lacked a readily available group truth. To explain our method, we show the following figure:

dif_k_smaller_2.png

In this image, it is clear that k can be 4 or 3.

For a given k, if one were to invoke K-means multiple times, the one could sort the populations of the resulting clusters. When K is a number for which there is a “natural” clustering (be it 3 or 4) then the populations of all clusters will, on the average, have comparatively little variance. On the other hand, if one uses an incorrect number of clusters like 5, then K-means will be forced to assign cluster membership in a non-reproducible way.

This, in turn, leads to variance in the population of the clusters. While smaller numbers without variance are valid choices (like 3 rather than 4), larger numbers allow for finer cluster-structure to be revealed. We see this in an example from one of the alkyl aryl phosphates that we studied.

dif_k_2_top-2.png

Here, the x-axis indicates the value of k and the red-and-blue coloring represent the variance and range over 10 clusterings for that k. Visually, there are 7 binding modes for this system.

Results

Before reporting results we perform two sanity checks: the first validates the simulations in the context of clusters and the second validates our clusters.

Ensemble Equilibration

To confirm that our simulations and partitions are valid (and choose a subset of snapshots on which to perform clustering - more details in publication) we plot the cluster populations against the time coordinate of the simulations. We observe asymptotic population behaviour (equilibration) and are therefore confident about our simulations and clusters.

population_over_time.png

Silhuoette Matrix

Because we lack a readily obtainable group truth, we use a silhuoette approach to evaluate the cluster quality. Here, we show the result for the same ligand whose k-selection graphic is shown above.

asdf.png

For each cluster, a random subset of descriptor vectors are chosen and associated with each cluster number (number being population rank). An RMSD similarity operation was performed between each invluded vector, with more similar vectors resulting in a darker green color. Hence, we see that the clusters that we obtain are more-similar than dissimilar. Notably, the smaller-k option revealed in the population-variance graph reveals itself here as a macro-structure block of mild similarity among the 4 most populated modes.

Contact Tables

Confident in our simulations and clustering, we report results in a compact, biologist-friendly result, shown here.

contact_table.png

This contact table describes the behaviour of one dialklyl aryl phosphate with butyrylcholinesterase. Each row is a binding mode and each column is an amino acid that is found in the binding pocket of butyrylcholinesterase. The columns are grouped-and-colored according to literature about important sets of amino acids. Entries in this matrix are the most prominent functional group and the color of an entry is the type of intermolecular interaction that is expected based on the chemistry of the amino acid and the functional group. Again, we see large similarity between the first 4 binding modes, but very different behaviour for the less populated modes.