Skip to main content

Large-scale determination of sequence, structure, and function relationships in cytosolic glutathione tranferases across the biosphere.

Mashiyama ST, Malabanan MM, Akiva E, Bhosle R, Branch MC, Hillerich B, Jagessar K, Kim J, Patskovsky Y, Seidel RD, Stead M, Toro R, Vetting MW, Almo SC, Armstrong RN, Babbitt PC (2014) PLoS Biol 12, e1001843. PMCID: PMC3995644

The pinnacle publication to emerge from EFI investigations into the GST Superfamily details 82 experimentally-verified functional assignments and 37 x-ray structures, specifically targeted to expand annotation coverage into previously unexplored regions of the GST Superfamily sequence space. This study redefines superfamily subgroup boundaries and illustrates the full functional capacity of highly similar protein sequences, thus contributed greatly to the evolution of the scientific community's understanding of the sequence-structure-function relationship.


The cytosolic glutathione transferase (cytGST) superfamily comprises more than 13,000 nonredundant sequences found throughout the biosphere. Their key roles in metabolism and defense against oxidative damage have led to thousands of studies over several decades. Despite this attention, little is known about the physiological reactions they catalyze and most of the substrates used to assay cytGSTs are synthetic compounds. A deeper understanding of relationships across the superfamily could provide new clues about their functions. To establish a foundation for expanded classification of cytGSTs, we generated similarity-based subgroupings for the entire superfamily. Using the resulting sequence similarity networks, we chose targets that broadly covered unknown functions and report here experimental results confirming GST-like activity for 82 of them, along with 37 new 3D structures determined for 27 targets. These new data, along with experimentally known GST reactions and structures reported in the literature, were painted onto the networks to generate a global view of their sequence-structure-function relationships. The results show how proteins of both known and unknown function relate to each other across the entire superfamily and reveal that the great majority of cytGSTs have not been experimentally characterized or annotated by canonical class. A mapping of taxonomic classes across the superfamily indicates that many taxa are represented in each subgroup and highlights challenges for classification of superfamily sequences into functionally relevant classes. Experimental determination of disulfide bond reductase activity in many diverse subgroups illustrate a theme common for many reaction types. Finally, sequence comparison between an enzyme that catalyzes a reductive dechlorination reaction relevant to bioremediation efforts with some of its closest homologs reveals differences among them likely to be associated with evolution of this unusual reaction. Interactive versions of the networks, associated with functional and other types of information, can be downloaded from the Structure-Function Linkage Database (SFLD;

Link to PubMed »


Figure 1: Global view of sequence relationships in the cytGST superfamily. This level 1 representative network shows 2,190 nodes representing 13,493 proteins filtered at 50% sequence identity. A cluster (a group of interconnected nodes separated from other groups of interconnected nodes) is labeled if there are at least 50 member sequences in that cluster. These include the large and diverse Main subgroup, the AMPS subgroup containing the Swiss-Prot classes Alpha, Mu, Pi, and Sigma, the recently described Xi subgroup, and several smaller but distinct clusters labeled R1–R4. Colors assigned correspond to Swiss-Prot annotations for canonical cytGST classes or to annotations from the literature for the newer classes Nu and Xi. These representative nodes are colored only if at least 50% of Swiss-Prot annotated sequences in that node have been assigned to that class. Grey nodes denote representative nodes for which no corresponding Swiss-Prot annotation is available for a class for greater than 50% of the annotated sequences in that node. Heavy borders indicate that a 3D crystal structure is associated with at least one member sequence of a representative node and shape indicates the source of the structure data: triangle, structures that were solved for this work; square, from the literature; diamond, structure evidence both from this work and the literature. Edges or lines between nodes are shown if the least significant pairwise sequence similarity score between the representative sequences of two nodes is better than the threshold (BLAST E-value≤1×10−13). The 32,716 edges depicted have a median percent sequence identity of 33% over 208 residues.

Figure 2: The level 2 representative network shows more detailed subgroupings. The same network as in Figure 1 except that it is visualized at a higher stringency threshold, i.e., E-value≤1×10−25. Coloring is the same as in Figure 1. The 15,070 edges depicted in the figure have a median percent sequence identity of 38% over 212 residues. As with level 1 subgroups, clusters are designated as level 2 subgroups if there are at least 50 member sequences in that cluster.

Figure 3: The level 2 representative network painted with known reaction type. Nodes are colored if at least one member in a representative node has experimental evidence for that reaction type. Colors denote reaction types as given in Figure 4, with the additional category of multiple reaction types (“multiple”), where orange indicates more than one reaction type occurring in a node. Some reaction types in Figure 4 are not represented by a separate color because they are subsumed by the “multiple” category. Single reaction type abbreviations: DSBR, disulfide bond reductase; ERO, epoxide ring opening; NA, nucleophilic addition; NAS, nucleophilic aromatic substitution; NS, nucleophilic substitution; RD, reductive dehalogenase. Node shapes indicate the source of experimental evidence for each reaction type: triangle, this work; square, literature; diamond, from this work and the literature. Nodes with member sequences that have evidence for biologically relevant reactions/functions are marked with thick black borders.

Figure 4: Major reaction types of the cytGST superfamily. Reactions are grouped by chemistry with a sample reaction shown for each reaction type.

Figure 5: Representative structure similarity network for the cytGST superfamily. 131 representative structures for 379 cytGST structures filtered to 95% sequence identity are shown. Edges are shown as for Figure 1 except that structural similarity is defined from the FAST algorithm, with a FAST SN score ≥20 required to show edges. 565 edges are shown. For this network the median SN score is 22.7 over 187 residues. (A) Nodes are colored by level 2 subgroup assignments. (B) Nodes are colored by reaction type. As with the level 2 sequence similarity network, multiple reaction types are broadly spread throughout the structure similarity network, indicating that some divergent structures catalyze the same reaction types. Reaction type abbreviations: multiple, multiple reaction types present; DSBR, disulfide bond reductase; NAS, nucleophilic aromatic substitution; NS, nucleophilic substitution; RD, reductive dehalogenase.

Figure 6: Level 2 sequence similarity network painted by type of life. The nodes of this level 2 representative network are colored by the type of life represented if more than 50% of the annotated member sequences in a representative node have that classification. Taxonomic classifications were labeled and ordered by the class Insecta, the kingdoms Metazoa, Viridiplantae, and Fungi, and the superkingdoms Eukaryota, Bacteria, and Archaea. Grey nodes indicate nodes in which there were not a majority of annotated nodes for one of these classifications or if annotations were not available from NCBI.

Figure 7: Full sequence similarity network of level 2 subgroup Main.4 indicating conflicting classifications from the literature. Edges are shown if they meet the similarity threshold of a BLAST E-value≤1×10−31. Shapes indicate class annotation from the literature: square, Delta; triangle, Epsilon; and diamond, Theta. To show more detail, colors are for taxonomic classifications from NCBI Taxonomy at a finer grained level than that used in the representative network shown in Figure 6. There are 872 sequences in the network with a median percent sequence identity for the 92,539 edges of 44% over 209 residues. Literature references for class annotations can be obtained from the network file for this subgroup available for download from the SFLD.

Figure 8: New experimental evidence for DSBR activity in many level 2 subgroups. The level 2 sequence similarity network is painted by DSBR activity. Nodes are colored if one or more member sequences in a representative node have experimental evidence for DSBR activity. Green indicates the evidence comes from only from this work and purple indicates evidence is from the literature. One representative node, labeled with member sequence YghU, has evidence both from the literature and this work. Heavy borders indicate that a 3D crystal structure is associated with at least one member sequence of a representative node where DSBR activity occurs and shape indicates the source of the structure data: square, from the literature; diamond, structure evidence both from this work and the literature.

Figure 9: Comparison of the active site region from divergent subgroups with DSBR activity reveals some commonalities. (A) Structure of Q4KED9 (UniProt accession) from Pseudomonas fluorescens (PDB ID 4IKH, subgroup Main.2), one of the new structures from this work, showing two molecules of glutathione bound in the active site. The interactions between the bound ligands and the side chains of Thr28, Gln57, Glu89, Ser90, and Arg152 of the other subunit (magenta) are shown. (B) Structure of B3VQJ7 from Phanerochaete chrysosporium (PDB ID 3PPU, subgroup Xi.1) with one molecule of glutathione bound. The corresponding interactions between glutathione and the side chains of Cys86, Glu173, and Ser174 are shown. (C) Summary of sequence motifs from the structure-guided alignment showing the sequence context for the residues highlighted in 7A and 7B from several divergent subgroups: Main.2 (red box), Main.3 (green box), and Xi.1 (blue box). All the proteins shown have experimental evidence for DSBR activity. UniProt entries and available PDB IDs are given on the left side of the sequences. Highlighted in yellow are the aligned positions of the residues that have notable interactions with the bound ligand as described in the text (numbered according to 4IKH/Q4KED9). New structures generated for this work are indicated with an asterisk.

Figure 10: Similarity relationships between a reductive dehalogenase protein and some homologs may give insight into function. (A) The full sequence similarity network view of Main.2 displays all 1,642 non-redundant sequences for this level 2 subgroup. Boxed labels and arrows indicate the slime mold protein with reductive dehalogenase activity (Q54B85) and its closest homolog Q9RBP3. Nodes are colored if there is experimental evidence for cytGST-like function with shapes indicating the evidence source: triangle, this work; square, the literature; diamond, both this work and the literature. Dark borders indicate nodes with crystal structures; the border color indicates the source of the structure: blue, this work; brown, the literature. Node colors designate the reaction type(s) associated with each sequence (node): multiple, multiple reaction types; DSBR, disulfide bond reductase; NA, nucleophilic addition; NAS, nucleophilic aromatic substitution; NS, nucleophilic substitution; RD, reductive dehalogenase. Edges with BLAST E-values≤1e–31 are shown; the 233,255 edges shown in the network have a median percent ID of 54% over 214 residues. (B) Chemical structures of DIF that is reductively dehalogenated by Q54B85, and CDNB, the synthetic compound commonly used in NAS assays. (C) Alignment of Q54B85 (red box) with homologs that have experimental evidence for GST-like activity. An arrow indicates residue Cys54 from Q54B85 that is critical for RD activity but is a conserved Asn in most homologs. The blue box indicates Q9RBP3, the closest homolog of Q54B85 that has CDNB activity. Reactions designated in the alignment are as follows: R, RD; D, DSBR; N, NAS; S, NS; A, NA; P, peroxidase. Sequences marked with an asterisk indicate that the experimental evidence for that reaction was obtained only from this work; sequences marked with a diamond indicate the availability of a crystal structure for that protein and black diamonds indicate crystal structures are from this work.

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.