The following publication is the definitive report of the Structure Function Linkage Database which has been supported in part, by the EFI through the Superfamily/Genome Core. The SFLD has provided the EFI with a platform for both identification and visualization of links between protein families and their corresponding functional specificity. The SFLD has been particularly fruitful because the EFI works with "functionally diverse" superfamilies, such as the Enolase Superfamily, which is highlighted within.
The Structure-Function Linkage Database (SFLD) is a manually curated classification resource describing structure-function relationships for functionally diverse enzyme superfamilies. Members of such superfamilies are diverse in their overall reactions yet share a common ancestor and some conserved active site features associated with conserved functional attributes such as a partial reaction. Thus, despite their different functions, members of these superfamilies 'look alike', making them easy to misannotate. To address this complexity and enable rational transfer of functional features to unknowns only for those members for which we have sufficient functional information, we subdivide superfamily members into subgroups using sequence information, and lastly into families, sets of enzymes known to catalyze the same reaction using the same mechanistic strategy. Browsing and searching options in the SFLD provide access to all of these levels. The SFLD offers manually curated as well as automatically classified superfamily sets, both accompanied by search and download options for all hierarchical levels. Additional information includes multiple sequence alignments, tab-separated files of functional and other attributes, and sequence similarity networks. The latter provide a new and intuitively powerful way to visualize functional trends mapped to the context of sequence similarity.
Figure 1. Hierarchical classification in the SFLD, SSNs and their potential contribution for enzyme classification and function prediction. (A) SFLD classification is exemplified by the enolase superfamily. (i) This superfamily is divided into seven subgroups; three of them—the enloase, mandelate racemase and muconate cycloisomerase subgroups—are shown in this panel. As shown here for the muconate cycloisomerase subgroup only, these subgroups are divided into families. (Note: The same name, e.g. enolase, can represent a superfamily, a subgroup, and a family.) Colored circles serve as a legend for panels B and C. (ii) Superposition of three residues reflecting conservation of important active site machinery across all members of the superfamily. Each color represents a different structure, one from each of the three subgroups: a dipeptide epimerase in green (PDB: 3RIT), a mandelate racemase in magenta (PDB: 1MDR) and an enolase in yellow (PDB: 7ENL). All enolase superfamily members share three metal binding active site residues that participate in a common partial reaction, abstraction of proton, that initiates each of their different overall reactions. (iii) Dipeptide epimerases, members of a family within the muconate cycloisomerase subgroup, share functionally important residues—three conserved in all members of the superfamily and associated with the proton abstraction, and two additional residues (K162, K266) that also contribute to proton abstraction. Another set of residues (upper part of panel iii, R24, E51 and D296) are thought to participate in the specificity of some dipeptide epimerases (35). Thus, these latter three residues differentiate these dipeptide epimerases from other families in the superfamily. The dipeptide ligand crystallized with 3RIT is shown in cyan. (B) A representative SSN of the enolase superfamily. Each node represents all sequences that share >70% sequence identity. Node size corresponds to the number of sequences that are represented by the node; the smallest nodes represent one sequence, and the largest nodes represent >100 sequences. Edges between representative nodes indicate a mean BLAST E-value, between all pairs of sequences in these nodes, <1e−43. Coloring is as shown in subgroup (node border color) and family (node fill color) sub-panels in A (C) A full SSN of the muconate cycloisomerase subgroup. Each node represents a single protein, and each edge indicates a BLAST E-value <1e−80. Using this network representation layout, within-cluster similarities are greater than similarities between clusters. Nodes are colored only if they are associated with reliable evidence, i.e. better evidence than ‘inferred from electronic annotation’. Different families are color-coded (panel A). The correspondence between function and sequence is evident in the network. Different families tend to appear in specific sequence clusters, allowing reliable (and visual) delineation of the sequence space that corresponds to a specific function.
Figure 2. Searching by sequence in the SFLD. The screenshot at the top shows the query sequence, GI:390523686. Choosing ‘search’ (with the HMM option) compares all SFLD HMMs against the query sequence. The table of results (middle panel) lists all the classification levels (family/subgroup/superfamily) for which a relevant hit was found. The two top families are glutamate 2,3-aminomutase and L-lysine 2,3 aminomutase. Clicking the ‘Align to this family’ button leads to a list of the active site residues that appear in the family, and to an MSA of the family with the query included as the bottom sequence (red ellipse). As explained in the text, although the protein is annotated in GenBank as L-lysine 2,3 aminomutase, evaluation of the MSA suggests that the characteristic active site residues for that function are not conserved in the query (indicated by the red arrows below the bottom panel). Instead, this analysis supports the annotation of this sequence as a glutamate 2,3-aminomutase. The user can also download an SSN of the relevant subgroup, here thresholded at 1e−85, which includes the two above-mentioned families. In the figure, L-lysine 2,3-aminomutases are shown in cyan, glutamate 2,3-aminomutases in red and arginine aminomutases in green. The query protein is represented by a blue circle and arrow. This network perspective also supports the annotation of the query protein as a glutamate 2,3-aminomutase. Two examples of clusters that do not include any annotated proteins are indicated by black arrows, hinting at potentially new functional families in this subgroup. A few of the white-colored sequences of unknown function that can potentially be annotated as L-lysine 2,3-aminomutases are indicated with a green arrow.
Reprinted with permission from Oxford University Press. Copyright © 2014, Oxford University Press.