Enzymes of unknown function selected for study by the EFI are referred to as “targets” and are used to develop, test, and revise the integrated sequence/structure-based strategy being developed for functional assignment.
In the first stage of target selection, the Superfamily/Genome Core collects sequences for each superfamily in the Structure Function Linkage Database (SFLD). The Babbitt Laboratory created the SFLD with the resources and aid of the UCSF Resource for Biocomputing, Visualization, and Informatics (RBVI). The SFLD serves as an analysis and archive site for functionally diverse superfamilies and provides members of the EFI with access to highly curated and explicit sequence and genome context information. The Superfamily/Genome Core also generates sequence similarity networks for visualization of the sequence relationships within each superfamily. These networks are viewed on the platform Cytoscape which offers the advantage over traditional phylogenic trees that large sequence sets can be analyzed and visualized more easily. Using these bioinformatic analyses, divergent families are identified which allow exploration of diverse sequence and, therefore, function space within each superfamily.
Using this information, the Bridging Projects together with the Computation Core select targets that will provide the test cases for developing the strategies for in silico ligand docking. The selection process takes advantage of the expertise of the Bridging Project by providing insights into possible functions based on known chemistry, identity of active site functional groups, and composition of specificity-determining residues, motifs, or structures. The selected targets are placed into the EFI’s Protein Core and Structure Core “pipeline” so that protein samples can be provided to the Bridging Projects for testing the substrate specificity predictions made by the Computation Core and to the Structure Core for determination of structures that provide templates for in silico ligand docking as well as allow verification of the structure‑based predictions of substrate specificity. Priority is given enzymes in genetically tractable organisms as identified by the Microbiology Core, thereby allowing genetic, phenotypic, and metabolomic approaches for establishing in vivo function.
Several criteria are used to direct target selection:
Specificity Boundaries: As sequence diverges within a superfamily, the substrate specificity (function) changes. An important test of substrate specificity predictions by the Computation Core is whether changes in the substrate specificity of homologous enzymes can be predicted.
Sequence/Function Diversity: Sequence similarity networks allow facile identification of divergent families that have not been experimentally or structurally characterized, and such divergent families likely will have new substrate specificities. An important test of the Computation Core’s algorithms is whether novel specificities can be predicted for targets selected from divergent families.
Structures with No Functions (SNFs): The goal of the Protein Structure Initiative (PSI‑1 and PSI‑2) was to explore sequence space in order to define “fold space.” To meet that goal, structures were determined for many functionally uncharacterized enzymes. A challenge is to “rescue” these targets by testing Computation Core generated predictions of substrate specificities.
Operon‑Encoded Proteins: Many bacterial enzymes (the primary focus of the EFI) are localized in operons that encode metabolic pathways. Because enzymes in a pathway will bind structurally related metabolites, in silico ligand docking by the Computation Core to all of the enzymes in a pathway is expected to facilitate identification of the pathway and, therefore, the substrate specificity for the target.
Chemoenzymatic Reagent: In response to the predictions made by the Computation Core, the Bridging Projects undertake the preparation of possible substrates. Many of these are most efficiently prepared enzymatically with, for example, kinases, dehydrogenases, and/or aldolases. As these enzymes are needed, they are added to the protein production pipeline.