EFI Enyzme Similarity Tool PrecomputeD
The following tutorial is for the use of EFI's Enzyme Similarity Precomputed Tool (EST-Precompute). Fundamentally, EST-Precompute provides users with an expedited mechanism for determining the Pfam family membership of a particular sequence, and then links that user to precomputed Sequence Similarity Networks for that Pfam family. If a user is already aware of a Pfam family of interest, they may go directly to those networks from the start page. Thus, there are two potential EST-Precompute inputs: an amino-acid sequence, or a Pfam family identifier.
Once a Pfam family is identified, the user has two options: 1) to access the "unfiltered" networks or 2) to use the provided statistics to select a new alignment score lower limit at which to generate "filtered" networks. The first option connects users with networks instantaneously, while the second option does require approximately 1 hour of computation time.
The EST-Precompute product is a database of downloadable Sequence Similarity Networks.
This project is made possible with an allocation of computing time at the Blue Waters Petscale Computing facility (UIUC).
Sequence Similarity Networks
Sequence Similarity Networks (SSN) are increasingly popular tools for visualizing sequence identity relationships across a large set of protein sequences. For a thorough explanation of what SSNs are, how they are made, and how they are used - please see the tutorial pages that accompany EST-Precompute's predecesor, EFI-EST.
The Pfam Database
As evidenced in the overview, EST-Precompute relies heavily on the Pfam database for definitions of protein families and domains, in order to reduce the protein universe into manageable datasets. Instead of creating one massive SSN for the ~90 million sequences currently within TrEMBL - the EFI has precomputed individual SSNs for each Pfam family, of which there are 14,831. These networks are thus most effective for assessing sequence relationships within a set of related proteins.
The current EST-Precompute database contains networks based on sequence information from InterPro release 48.0 and UniProt release 2014_07 (Release Notes).
In Option A, copy an amino-acid sequence (excluding any FASTA header information) in the provided window, enter your E-mail address, and hit GO. This begins the process of InterProScan. You will receive an E-mail with a link to results when the scan is complete.
In Option B, you may browse a list of available Pfam families by identifier or family size. You may also input the identifier directly into the provided "Search" box to jump to a particular Pfam family. Select a Pfam family, enter your e-mail addres, and hit GO. You will be instantly redirected to a family-specific set of statistics.
InterProScan is a tool for comparing protein sequences to multiple database signatures using several protein signature recognition methods. This tool was developed by the European Molecular Biology Laboratory - European Bioinformatics Institute and is implemented primarily on the InterPro homepage . The current implementation of InterProScan within EST-Precompute compares user sequences to the following databases: InterPro, TIGRFAM, ProDom, Panther, SMART, Prosite, Pfam, SuperFamily, PRINTS, Gene3d, PIRSF, HAMAP, and Coils. If users select to use EST-Precompute via Option A: input sequence, they will be brought to a page that informs them that the scan is underway and that they may close their browser window. An E-mail will be sent to the designated address when scan results are ready.
A user's sequence may return zero, one, or several database matches. For Pfam family matches, a link is provided to bring the user to that family's statistics page. If several Pfam families match, the user should pursue each link in a separate window for separate analyses. If no Pfam families match the user sequence, the user should go to EFI-EST to generate a homology-based Sequence Similarity Network.
Several statistical analyses are precomputed on Pfam family sequences:
- A count of the number of edges as a function of the alignment score cutoff (histogram)
- A count of sequence length (histogram)
- A relationship between percent identity and alignment score (quartile plot)
- A relationship between alignment length and alignment score (quartile plot)
Thorough descriptions of the information contained within this graphs, as well as instructions on using these graphs to guide the selection of an alignment score cutoff and length restrictions can be found at the EFI-EST website as well as within the EFI-EST review article .
There are three options for moving forward:
- Unfiltered networks are precomputed at the minimum cutoff (Alignment Score = 5) and available for immediate download. Simply select the "See Unfiltered SSN with Minimum Score of 5" button. An alignment score this relaxed draws nearly all sequence relationships as edges in the SSN. Since the number of edges contributes to the overall size of the SSN file, the unfiltered network is a only a viable option for families of ~3,500 sequences or less (which includes ~10,000 of 14,831 Pfam families).
- Filtered networks can be computed with user-defined restrictions. Use the statistical graphs to select a more stringent alignment score cutoff and input this value in the "alignment score" window. Then, designate sequence length restrictions, if desired. Click "Filter SSN" to regenerate filtered networks. You will receive an E-mail when networks are ready for download.
- Alternatively, if very little is known about the sequence-function relationship within a given protein (super)family, we recommend starting with a network filtered at 30% sequence identity. Input "30" into the "Minimum % identity" window and click "Filter SSN". Again, you will receive an E-mail when networks are ready for download.
After selecting a new alignment score or percent identity cutoff, and hitting "Filter SSN" - you will be advised to close the browser window. You should receive an E-mail with a link to results within 30 minutes.
Unfiltered or filtered networks are available for individual download. File sizes are listed for uncompressed and compressed formats. Additionally, the number of nodes and edges are listed. Please consider the file size and the number of edges when selecting a network to work with: the average computer with 4 GB of RAM can open an SSN with 500,000 edges. Files over 2 GB in size may not download correctly. All networks are in the .XGMML format accepted by Cytoscape.
Full networks: a full network is provided (where every node represents one sequence) unless the file contains >10 million edges.
Representative node (aka. repnode) networks: repnode networks are provided (where every node represents sequences consolidated by sequence identity) for sequence identity cutoffs ranging from 40-100% in increments of 5%, regardless of the number of edges.
New to Cytoscape?
See our separate tutorial on using Cytoscape to visualize EFI SSNs here.
1. InterProScan 5: genome-scale protein function classification.
2. Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks.