This paper from the EFI Superfamily/Genome Core illustrates the value and contributions of large-scale studies, such as the EFI, using protein similarity networks. The work also points readers to the Structure Function Linkage Database (SFLD) for EN and GST superfamily networks supported by the EFI.
As increasingly large amounts of data from genome and other sequencing projects become available, new approaches are needed to determine the functions of the proteins these genes encode. We show how large-scale computational analysis can help to address this challenge by linking functional information to sequence and structural similarities using protein similarity networks. Network analyses using three functionally diverse enzyme superfamilies illustrate the use of these approaches for facile updating and comparison of available structures for a large superfamily, for creation of functional hypotheses for metagenomic sequences, and to summarize the limits of our functional knowledge about even well studied superfamilies.
Figure 1. Structure similarity networks of ePK-like superfamily generated from pairwise comparisons using FAST algorithm. Each node represents a structure. Each edge represents a connection with a FAST N-score better than a given threshold. A, FAST N-score cutoff = 11, colored by Pfam family. Upper panel, structures available as of October 2005 (97 nodes). At this cutoff, the average root mean square deviation (r.m.s.d.) is ∼2.81 Å with ∼213 Cα atoms aligned. Lower panel, structures available as of May 2011 (295 nodes). At this cutoff, the average r.m.s.d. is ∼2.98 Å with ∼207 Cα atoms aligned. B, FAST N-score cutoff = 23. At this cutoff, the average r.m.s.d. is ∼1.97 with ∼247 Cα atoms aligned. Nodes colored green represent structures available in the Protein Data Bank as of October 2005; those colored blue represent structures added to the Protein Data Bank between October 2005 and May 2011 (total of 295 nodes). Nodes were arranged using the yFiles organic layout provided with Cytoscape version 2.7. Lengths of edges are not meaningful except that sequences in tightly clustered groups are relatively more similar to each other than sequences with few connections.
Figure 2. Alternative view of structure similarity networks of 86 representative structures in ePK-like superfamily (generated as described for Fig. 1). Nodes are colored according to their Manning/Bourne group classification. Dark gray nodes represent structures that were not classified. A, FAST N-score cutoff = 4. B, FAST N-score cutoff = 23.
Figure 3. Sequence similarity networks of acid-sugar dehydratases known or predicted to belong to enolase superfamily and human gut microbiome. Networks were generated from all-by-all BLAST comparisons of 1578 sequences representing sequences of eight known acid-sugar dehydratase families and the mandelate racemase family from the mandelate racemase subgroup (see Footnote 5) as defined by SFLD and a filtered set of gut metagenome sequences that showed significant similarity to the members of the subgroup. Each of the 1578 nodes represents a sequence. Larger square nodes represent those that have been experimentally characterized, so their reaction and substrate specificities are known. Brown nodes represent sequences from the human gut metagenome, and white nodes represent SFLD sequences in the subgroup for which the reaction and substrate specificities have not been predicted. The remainder (small nodes) represent sequences for which specificity can be predicted at high confidence, colored by their SFLD family names (see Footnote 4). Nodes were arranged using the yFiles organic layout provided with Cytoscape version 2.7. A, each edge in the network represents a BLAST connection with an e-value of 1e−44 or better. At this cutoff, sequences have a median percent identity and alignment length of ∼32% and 369, respectively. B, each edge in the network represents a BLAST connection with an e-value of 1e−84 or better. At this cutoff, sequences have a median percent identity and alignment length of ∼44% and 384, respectively. Lengths of edges are not meaningful except that sequences in tightly clustered groups are relatively more similar to each other than sequences with few connections.
Figure 4. Sequence similarity network of cytosolic GSTs. Similarity is defined by pairwise BLAST alignments better than an e-value cutoff of 1e−12. 622 representative sequences that are a maximum of 40% identical and that span the diversity of >6000 GSTs are shown. Nodes are colored by classification of the sequence in the Swiss-Prot Database (part of the UniProt Database), if available. The 40 large nodes designate sequences with structures. At this cutoff, edges at this threshold represent alignments with a median 27% identity over 200 residues. This network and legend are adapted from Ref. 42 with permission.
2011 Superfamily/Genome Core Publication
Reprinted with permission: the Journal of Biological Chemistry. © 2012 by the American Society for Biochemistry and Molecular Biology.