The following review captures the successes made and challenges encountered by the EFI Computation Core over the past four years. While milestones have been acheived via techniques such as pathway docking, metabolite docking, and large-scale homology modeling, opportunities remain in the form of challenging metabolites, orphan enzymes, and enzymes classified as Domains of Unknown Function, to name a few.
The rapid growth of the number of protein sequences that can be inferred from sequenced genomes presents challenges for function assignment, because only a small fraction (currently <1%) has been experimentally characterized. Bioinformatics tools are commonly used to predict functions of uncharacterized proteins. Recently, there has been significant progress in using protein structures as an additional source of information to infer aspects of enzyme function, which is the focus of this review. Successful application of these approaches has led to the identification of novel metabolites, enzyme activities, and biochemical pathways. We discuss opportunities to elucidate systematically protein domains of unknown function, orphan enzyme activities, dead-end metabolites, and pathways in secondary metabolism.
Figure 1:Structure-based virtual metabolite docking protocol for enzyme activity prediction. When no structure has been experimentally determined for a protein sequence, a model can be built using a variety of comparative modeling methods, but only when the structure of a homologous protein is available that has approximately 30% of greater sequence identity to the protein of interest. Whether using a structure of a model, it is critical that active site metal ions and cofactors are present, and that catalytic residues are positioned appropriate for catalysis. Virtual metabolites libraries can be constructed and ‘docked’ against the putative active sites of structures or models using computational tools more commonly used in structure-based drug design (e.g., Glide or DOCK). The docking scoring functions can be used to rank the ligands according to their estimated relative binding affinities. Top-scoring metabolites are typically inspected for plausibility (Is the predicted binding mode compatible with catalysis? Is the metabolite likely to be present in the relevant organism?), and then selected for experimental testing (in vitro enzymology). Protocols similar to that shown here have been used in retrospective and prospective studies 22, 23, 24, 25, 27, 28, 29, 30, 31, 32, 33, 36 and 39.
Figure 2:Predicted binding poses are in good agreement with subsequently determined experimental structures. Predicted ligand binding mode (cyan) superimposed with the X-ray crystal structure (gold) of: (A)S-adenosylhomocysteine deaminase (PDB: 2PLM); (B)N-succinyl-l-Arg racemase (PDB: 2P8C); (C)d-Ala-d-Ala epimerase (PDB: 3Q4D), and (D) a polyprenyl synthase (PDB: 4FP4). In (B–D), the docking predictions were made using homology models based on crystal structures with 35%, 39%, and 29% sequence identity, respectively.
Figure 3:Structure-guided discovery of new enzymes in a novel hydroxyproline betaine metabolism pathway. (A) shows the name, TrEMBL annotation, and most similar homolog in the Protein Data Bank for each protein in the pathway. The automated TrEMBL annotations are incorrect or imprecise for all proteins in the pathway. However, there is rich structural information that can be used for modeling and docking, as shown in the closest PDB homolog column. The pathway is shown in (B). (C–E) show the binding site and/or active site of the three proteins [HpbD, HpbJ, and HpbR, shown in bold in (A)] in the pathway, respectively, along with the docking-predicted binding mode for the ligand trans-4-hydroxy-l-proline betaine (ball-and-stick, green color). Both HpbJ and HpbR have a predicted cation-π cage, known for binding quaternary amines. In HpbD, two catalytic residues (Lys163 and Lys265) replace aromatic residues, leaving Trp320 as the key aromatic residue forming a cation-π interaction with the substrate.
Figure 4:The biosynthesis of cholesterol: a paradigmatic isoprenoid pathway. Crystal structures of key enzymes in the pathway have been solved, including farnesyl pyrophosphate synthase [gold; Protein Data Bank (PDB): 1RQI], squalene synthase (light blue; PDB: 3WEG), and oxidosqualene-lanosterol cyclase (magenta; PDB 1W6K). These crystal structures provide opportunities to predict functions of related enzymes of the isoprenoid synthase superfamily. However, function prediction for the terpenoid synthases (also called terpene cyclases) is challenging due to the huge product chemical space created by carbocation rearrangements.