The Computation Core spearheads a collaborative investigation into the benefits of introducing a covalent bond constraint in virtual screening of GST enzymes for the purpose of discovering known and novel substrates. Both x-ray structures and comparative models were employed as docking receptors. Covalent docking increased the accuracy of the docking pose as well as the enrichment of known substrates in the ligand library, when compared to conventional constraints. This technique may be applied to alternate enzyme systems to expedite the discovery of enzymological function by way of more confident substrate predictions.
Enzymes in the glutathione transferase (GST) superfamily catalyze the conjugation of glutathione (GSH) to electrophilic substrates. As a consequence they are involved in a number of key biological processes, including protection of cells against chemical damage, steroid and prostaglandin biosynthesis, tyrosine catabolism, and cell apoptosis. Although virtual screening has been used widely to discover substrates by docking potential noncovalent ligands into active site clefts of enzymes, docking has been rarely constrained by a covalent bond between the enzyme and ligand. In this study, we investigate the accuracy of docking poses and substrate discovery in the GST superfamily, by docking 6738 potential ligands from the KEGG and MetaCyc compound libraries into 14 representative GST enzymes with known structures and substrates using the PLOP program [ Jacobson Proteins 2004 , 55 , 351 ]. For X-ray structures as receptors, one of the top 3 ranked models is within 3 Å all-atom root mean square deviation (RMSD) of the native complex in 11 of the 14 cases; the enrichment LogAUC value is better than random in all cases, and better than 25 in 7 of 11 cases. For comparative models as receptors, near-native ligand-enzyme configurations are often sampled but difficult to rank highly. For models based on templates with the highest sequence identity, the enrichment LogAUC is better than 25 in 5 of 11 cases, not significantly different from the crystal structures. In conclusion, we show that covalent docking can be a useful tool for substrate discovery and point out specific challenges for future method improvement.
Figure 2: Representative network view of the GST superfamily showing where docking targets fall in this superfamily. In this view, 2,190 50% sequence identity filtered nodes (“ID50 node”) representing 13,493 sequences are shown as dots, and each sequence similarity relationship with a BLAST E-value ≤ 1 × 10–13 is shown as a line or edge between representative nodes. The largest clusters are labeled by their subgroup in the structure–function linkage database (SFLD).(37) The 13,493 sequences for cytosolic GSTs (“cytGSTs”) are based on Pfam (v26.0)(34) sequences that were at least 100 residues in length and had scores above the gathering threshold for at least one Pfam hidden Markov model (HMM) corresponding to the cytGST N-terminal domain (“GST_N,” “GST_N2″, or “GST_N3”; collectively referred to here as “GST_N*”). Also included in the data set were 58 proteins that lacked matches to a GST_N* HMM but were chosen as Enzyme Function Initiative (EFI)(56) targets for experimental characterization because of their sequence similarities to cytGSTs. Sequence similarity was detected with an all-by-all BLAST search using blastp (v2.2.24).(35) The data were viewed with Cytoscape (v2.8.3). Sequence identities are calculated using CD-HIT(36) (“ID50” set). The edges are shown using the organic layout where nodes that are more highly interconnected are clustered more tightly together. PDB structures associated with cytGST sequences (95% or more sequence identity to the PDB sequence) were identified using the SFLD(37) interface. An ID50 node is shown as a blue dot if any member sequence of the node was associated with a PDB structure. Representative nodes where only an EFI structure was associated with its member sequences are shown as red dots. Retrospective docking targets are marked using dark blue borders; only 10 targets are shown because the rest are more than 50% identical to these 10 representatives shown.
Figure 3: Comparison of the predicted complex between the native product and its holo X-ray structure with the native pose: (A) 1F3B with GBX; (B) 2F3M with GTD; (C) 2GST with GPS; (D) 1M9B with IBG. The surface of the receptor is shown, with red and blue corresponding to oxygen and nitrogen atoms. The product is shown in the native configuration (gray), the most accurate sampled configuration (yellow), and the top ranked configuration by the PLOP energy function (light blue) and PoseScore (green).
Figure 4: Comparison of comparative model properties and the RMSD errors of the docked poses. Each point represents the property of a single comparative model and/or the RMSD error of the docked pose of its corresponding crystallographic product against this model. Results for models built based on apo and holo template structures are shown in red and blue, respectively, with the least-squares linear fits shown as dashed lines. The Pearson correlation coefficients R are also shown. The gray dashed line indicates the lower bound of the RMSD of the top 3 ranked poses.
Figure 5: Non-hydrogen atom RMSD error of the docked poses, for the crystallographic product against its corresponding comparative models with different sequence identities. For each target, the RMSD errors of the best-sampled poses using apo templates (stripped red bar), the RMSD errors of the top-ranked poses by PoseScore using apo templates (stripped blue bar), the RMSD errors of the best-sampled poses using holo templates (red bar), and the RMSD errors of the top ranked pose by PoseScore using apo templates (blue bar) are shown. For comparison, the RMSD errors of the docked poses using holo X-ray structures are shown as horizontal lines (blue solid line for the top ranked pose by PoseScore, red dashed line for the best-sampled pose).
Figure 6: Docking of native products against comparative models. (A) Successful docking of GBX onto a model of 3CSH, based on 85% sequence identity to the template 3O76 (chain A). The ribbon representation of the crystal structure is shown in light gray and that for the model in orange. The C-α RMSD error of the model is 0.64 Å, and the non-hydrogen atom RMSD error of the binding site is 0.71 Å. The ligand is shown in the stick representation. The native pose is shown in light gray, the best-sampled pose (RMSD of 0.90 Å) is shown in yellow, the pose top ranked by PLOP (0.92 Å) is shown in cyan, and the pose top ranked by POSESCORE (0.92 Å) is shown in green (not visible because it is hidden behind the cyan structure). (B) Failed docking of GPS onto a model of 1F3B, based on 76% sequence identity to the template 1YDK (chain B). The ribbon representation of the crystal structure is shown in light gray and that for the model in orange. The C-α RMSD error of the model is 2.42 Å, and the non-hydrogen atom RMSD error of the binding site is 2.47 Å. The ligand is shown in the stick representation. The native pose is shown in light gray, the best-sampled pose (RMSD of 5.38 Å) is shown in yellow, the pose top ranked by PLOP (6.58 Å) is shown in cyan, and the pose top ranked by POSESCORE (0.92 Å) is shown in green.
Figure 7: Enrichment curves for the 11 GST targets with 3 or more known products, using holo X-ray structures. For each target, the enrichment curves for the individual chains (blue, green, yellow, and cyan lines), the consensus scoring for multiple chains (red line), and random selection (black dashed line) are shown.
Figure 8: Enrichment curves for target 2F3M using its 6 comparative models based on different apo and holo templates. For each model, the enrichment curves for the individual chains (blue, green, yellow, and cyan lines), the consensus scoring for multiple chains (red line), and random selection (black dashed line) are shown. Target–template sequence identities are as follows: The 2F3M-3T2U and 2F3M-2FHE sequence identities are 33% and 50%, respectively (apo templates). Target–template sequence identities are as follows: 33% (2F3M-3T2U, apo), 50% (2F3M-2FHE), 33% (2F3M-2GSR, holo), 42% (2F3M-1M99, holo), 66% (2F3M-1GSU, holo), and 85% (2F3M-2C4J, holo). Binding site RMSD errors are as follows: 5.9 Å (2F3M-3T2U, apo), 3.1 Å (2F3M-2FHE), 7.4 Å (2F3M-2GSR, holo), 3.1 Å (2F3M-1M99, holo), 1.4 Å (2F3M-1GSU, holo), and 1.1 Å (2F3M-2C4J, holo).
Figure 9: Enrichment curves for the 11 GST targets with 3 or more known products, using their comparative models based on templates with the highest sequence identities. For each target, the enrichment curves for the individual chains (blue, green, yellow, and cyan lines), the consensus scoring for multiple chains (red line), and the random selection (black dashed line) are shown. Target–template sequence identities are as follows: 85% (2F3M-2C4J), 85% (3CSH-2C4J), 77% (2C4J-6GSV), 54% (3LJR-2C3Q), 33% (2GSQ-2GSR), 28% (3GX0-1PN9), 76% (2GST-2C4J), 44% (1M9B-2C4J), 64% (1VF3-3KTL), 76% (2AB6-6GSV), and 46% (1GWC-2VO4).
Reprinted with permission from Dong et al. Journal of Chemical Information and Modeling 54 (6), pp 1687–1699. Copyright 2014 American Chemical Society.