Skip to main content

Statistical potential for modeling and ranking of protein-ligand interactions

Fan H, Schneidman-Duhovny D, Irwin JJ, Dong G, Shoichet BK, Sali A. (2011) J Chem Inf Model 51, 3078–3092. PMCID: PMC3246566

As part of the EFI’s Computation Core, the Shoichet and Sali labs developed a new statistical potential for docking that meshes with the strategies used for homology model selection and scoring, thereby strengthening methodologies central to the EFI.  Two atomic statistical scoring functions were developed: PoseScore for recognizing native binding geometries of ligands from decoy poses and RankScore for distinguishing ligands from decoy molecules. The statistical potentials are available through the Integrative Modeling Platform (IMP) software package (http://salilab.org/imp/) and the “Pose & Rank” web server (http://salilab.org/ligscore/). 

Abstract

Applications in structural biology and medicinal chemistry require protein-ligand scoring functions for two distinct tasks: (i) ranking different poses of a small molecule in a protein binding site and (ii) ranking different small molecules by their complementarity to a protein site. Using probability theory, we developed two atomic distance-dependent statistical scoring functions: PoseScore was optimized for recognizing native binding geometries of ligands from other poses and RankScore was optimized for distinguishing ligands from nonbinding molecules. Both scores are based on a set of 8,885 crystallographic structures of protein-ligand complexes but differ in the values of three key parameters. Factors influencing the accuracy of scoring were investigated, including the maximal atomic distance and non-native ligand geometries used for scoring, as well as the use of protein models instead of crystallographic structures for training and testing the scoring function. For the test set of 19 targets, RankScore improved the ligand enrichment (logAUC) and early enrichment (EF(1)) scores computed by DOCK 3.6 for 13 and 14 targets, respectively. In addition, RankScore performed better at rescoring than each of seven other scoring functions tested. Accepting both the crystal structure and decoy geometries with all-atom root-mean-square errors of up to 2 Å from the crystal structure as correct binding poses, PoseScore gave the best score to a correct binding pose among 100 decoys for 88% of all cases in a benchmark set containing 100 protein-ligand complexes. PoseScore accuracy is comparable to that of DrugScore(CSD) and ITScore/SE and superior to 12 other tested scoring functions. Therefore, RankScore can facilitate ligand discovery, by ranking complexes of the target with different small molecules; PoseScore can be used for protein-ligand complex structure prediction, by ranking different conformations of a given protein-ligand pair. The statistical potentials are available through the Integrative Modeling Platform (IMP) software package (http://salilab.org/imp) and the LigScore Web server (http://salilab.org/ligscore/).

Link to PubMed »

Figure 1. The performance of the statistical potential affected by the distance cutoff, showed on the training sets. (a) Two parameters of the potential were fixed (wref = 0.4, wuni = 0), the potential showed the highest accuracy in ligand pose detection when the other parameter rmax is set to 6 Å, selecting correct binding mode for 64 (91%) targets in the training set of 70 proteins. (b) Two parameters of the potential were fixed (wref = 0.4, wuni = 0) the potential showed the highest accuracy in the rescoring when the other parameter rmax is set to 6 Å, improving enrichment (logAUC) for 14 targets in the DUD-1 training set.

Figure 2. Four examples of accurate ligand pose prediction from the PoseScore test set. For each target, the crystal structure of the protein binding site and the cocrystallized ligand (solid stick, green) as well as the best-ranked ligand geometric decoy (solid stick, yellow) are shown. (a) Thrombin (1a46). The crystal structure of the ligand was ranked 1. A geometric decoy with the 1.39 Å rmsd error was ranked 2. (b) Carbonic anhydrase I (1bzm). The crystal structure of the ligand was ranked 3. A geometric decoy with the 1.65 Å rmsd error was ranked 1. (c) Elastase (1ela). The crystal structure of the ligand was ranked 1. A geometric decoy with the 1.37 Å rmsd error was ranked 2. (d) Streptavidin (1sre). The crystal structure of the ligand was ranked 5. A geometric decoy with the 1.39 Å rmsd error was ranked 1.

Figure 3. Four examples of inaccurate ligand pose prediction from the PoseScore test set. For each target, the crystal structure of the protein binding site, the cocrystallized ligand, and the highest ranking geometric decoy of the ligand are presented as in Figure 2. See Results for more detail.

Figure 4. Ligand poses of AmpC β-lactamase from the test set of RankScore. (a) 2D images of AmpC ligands HTC and CTC (b) Docking poses of HTC (yellow stick) and CTC (blue stick) generated by screening against the B chain of AmpC structure (PDB code: 1xgj).

Figure 5. The effect of the parameter α on the performance of statistical potential derived using DFIRE formula, showed on the training set. α value was set to 1, 2, 3, 4, 5, and 6 in the calculation of the potential independently. For each α value, 5 different values were chosen for the maximal boundary rmax including 6 Å (black solid line), 8 Å (black dotted line), 10 Å (red solid line), 12 Å (red dotted line), 14 Å (blue solid line) respectively. The generated potentials were tested on the training set containing 70 proteins. The potential was the most accurate when α was set to 3 and rmax set to 6 Å.

Figure 6. Schematic presentation of a protein–ligand complex. The protein is approximated as the outer sphere (solid line) and the ligand is completely embedded inside the protein as the inner sphere with a radius r. For the ligand atom positioned at a distance of d to the ligand center, the amount of protein–ligand atom pairs within certain distance R (R ≤ Rcutoff) is calculated by eq 12.

Figure 7. The probability distribution of protein–ligand atom pairs, assuming no difference between atom types. Five distributions are plotted. First, the distribution derived using eq 8, from the sample of X-ray structures of protein–ligand complexes (black solid line). Second, the distribution derived using eq 8, from the sample of docking poses that had rmsd error of larger than 2 Å with respect to the X-ray structures (black dashed line). Third, the distribution derived using eq 11 in which the parameter α was set to 2 (red solid line). Fourth, the distribution derived using eq 11 in which the parameter α was set to 3 (blue solid line). Fifth, the distribution derived using eq 11 in which the parameter α was set to 4 (brown solid line).

2011 Computation Core Publication
Reprinted with permission from the Journal of Chemical Information and Modeling. © 2011 American Chemical Society.