Skip to main content

Avoidable errors in deposited macromolecular structures: an impediment to efficient data mining.

Dauter Z, Wlodawer A, Minor W, Jaskolski M, Rupp B (2014) IUCrJ 1, 179-193. PMCID: PMC4086436

From the EFI Data Core, the following review offers a comprehensive analysis of the most common mistakes made in model generation from x-ray crystallography diffraction data (and how to avoid them). This is a paramount issue for researchers that rely on comparative modeling and virtual docking for functional elucidation, as well as those researchers using co-crystal structures for prediction validation. Both methods rely heavily on the confidence of macromolecular structures, and any future high-throughput functional assignment schemes will rely on the overall accuracy of structural data depositories.


Whereas the vast majority of the more than 85 000 crystal structures of macromolecules currently deposited in the Protein Data Bank are of high quality, some suffer from a variety of imperfections. Although this fact has been pointed out in the past, it is still worth periodic updates so that the metadata obtained by global analysis of the available crystal structures, as well as the utilization of the individual structures for tasks such as drug design, should be based on only the most reliable data. Here, selected abnormal deposited structures have been analysed based on the Bayesian reasoning that the correctness of a model must be judged against both the primary evidence as well as prior knowledge. These structures, as well as information gained from the corresponding publications (if available), have emphasized some of the most prevalent types of common problems. The errors are often perfect illustrations of the nature of human cognition, which is frequently influenced by preconceptions that may lead to fanciful results in the absence of proper validation. Common errors can be traced to negligence and a lack of rigorous verification of the models against electron density, creation of non-parsimonious models, generation of improbable numbers, application of incorrect symmetry, illogical presentation of the results, or violation of the rules of chemistry and physics. Paying more attention to such problems, not only in the final validation stages but during the structure-determination process as well, is necessary not only in order to maintain the highest possible quality of the structural repositories and databases but most of all to provide a solid basis for subsequent studies, including large-scale data-mining projects. For many scientists PDB deposition is a rather infrequent event, so the need for proper training and supervision is emphasized, as well as the need for constant alertness of reason and critical judgment as absolutely necessary safeguarding measures against such problems. Ways of identifying more problematic structures are suggested so that their users may be properly alerted to their possible shortcomings.

Link to PubMed »


Figure 1: Selected residues from the structure of proteinase K (PDB entry 3i34) accompanied by electron-density maps obtained from the Uppsala Electron Density Server (EDS) (a, d, g), the PDB_REDO server (b, e, h) and after manual rebuilding and refinement of the model by the present authors (c, f, i). The 2mF o − DF c map (blue) is contoured at 1.5σ and the mF o − DF c map at ±2.0σ (positive contours, green; negative contours, red). (a, b, c) His69 and a spurious Hg ion; (d, e, f) Gln54 with the maps in (f) calculated prior to the introduction of the second conformation of this residue; (g, h, i) Asp207, which is represented as Ser207 in the deposited structure. The purple and red spheres represent dubious Hg and water sites included in the original and the PDB_REDO models, respectively.

Figure 2: C-terminal fragment of the B chain from PDB entry 2p68 with 2mF o − DF c (blue at +1.5σ) and mF o − DF c (green at +2.0σ) maps obtained from (a) the EDS, (b) the PDB_REDO server and (c) after insertion and refinement (by the present authors) of six residues missing from the original model. Red spheres mark incorrectly placed water molecules in the original and the PDB_REDO models.

Figure 3: Two fragments of an inhibitor binding to the Polo-box domain of Plk1 (PDB entry 4mlu). (a) The environment of a histidine moiety of the ligand superimposed on the 2mF o − DF c map (blue at +1.0σ) and the mF o − DF c map at ±3.0σ (positive contours, green; negative contours, red), strongly suggesting the existence of a substituent at the N[sm epsilon] atom and beyond. (b) The environment of a phosphate group at the other end of the ligand with the 2mF o − DF c map (blue at +1.5σ) and the mF o − DF c map at ±3.0σ (positive contours, green; negative contours, red), suggesting that just a few water molecules are present rather than the two conformations of the expected phosphoester moiety.

Figure 4: Packing of the molecules in PDB entry 2w9s. Shown are the unit cell and symmetry elements of the original space group P2 (a) and of the true space group P62 (b). Molecules that are equivalent by space-group symmetry elements are presented in the same color.

Figure 5: The hexamer of protein molecules in PDB entry 3m9b, originally presented in P21 symmetry (a), are in fact placed around the space-diagonal threefold axis of the true space group P213 (b). Symmetry-equivalent molecules are shown in the same color.

Figure 6: Four protein molecules in the unit cell of PDB entry 1woc. (a) Original presentation as one dimer and two monomers; (b) after regrouping by an appropriate symmetry operation it is clear that this structure consists of two identical dimers.

Figure 7: Coordination of the ‘strong’ calcium site (yellow sphere) in savinase with the refined values of atomic displacement parameters (B factors; Å2) of the relevant atoms: (a) in PDB entry 1gci refined at 0.78 Å resolution, (b) in PDB entry 1svn refined at 1.4 Å resolution.

Figure 8: A fragment of PDB structure 3e0k around the site of a rather improbable Na+ ion. The B factors (Å2) of the Na+ ion and selected neighboring atoms are shown in parentheses and the distances (in Å) of two close O and two H atoms are also shown. The 2mF o − DF c map (blue) is contoured at 1.5σ and the mF o − DF c map at ±2.5σ (positive contours, green; negative contours, red).

Figure 9: A fragment of bovine trypsin from PDB structure 3unr that includes the residue His91 that is presented as doubly protonated. However, such an assignment of H atoms (gray) is in conflict with the neighboring peptide N—H group of Ser93.

Figure 10: A histogram showing the distribution of sums of site occupancies for individual atoms in double-conformation fragments of the structure of crambin (PDB entry 3nir) refined at the ultrahigh resolution of 0.48 Å.

Creative Commons Attribution Licence