Skip to main content

Resources

Create SSN from FASTA

A Primer for using the EFi's unix scripts for sequence similarity networks - from fasta

 

If you are new to the process of generating SSNs via command line - please first review the information here.

This primer is for generating Sequence Similarity Networks using a user-defined FASTA file.

 

step 0: prepare a fasta file

 

Prepare a single text or fasta file containing all sequences that you would like included in the final sequence similarity network. Standard FASTA formatting is accepted. 

Move this file to your personal Biocluster account via SFTP (WinSCP or CyberDuck, etc).

 

IN A BIOCLUSTER-CONNECTED TERMINAL:

 

Start an interactive session so that the next commands run on a compute node:

 

-bash-4.1$ qsub -I -q efi       (that is a capital "i")

qsub: waiting for job 638255.biocluster.igb.illinois.edu to start

qsub: job 638255.biocluster.igb.illinois.edu ready

 

Navigate to the directory containing your FASTA file. Run the following command to clean up text files generated on non-Linux operating systems:

 

-bash-4.1$ dos2unix "yourfilename".fa

or

-bash-4.1$ mac2unix "yourfilename".fa

dos2unix: converting file Modified.doro.txt to UNIX format ...

 

 
Run the following program to process your FASTA file into EST-ready FASTA and DAT files. Designate your FASTA file as the input and designate "output.fa" and "output.dat" for the EST-ready FASTA and DAT file. 
 
-bash-4.1$ module load efiest/alpha
-bash-4.1$ formatcustomfasta.pl -in “yourfilename”.fa -out output.fa -dat output.dat
-bash-4.1$ exit
 
DESCRIPTION OF VARIABLES:
 
-in    is the input fasta that you have
-out  is the fasta file created that has the proper uniprot-like numbering
-dat  dat file created that will have the original sequence information like from input.fa in the description field
 
The output FASTA file will now contain EST-friendly header information, and the output DAT file will contain the contents of your original FASTA file headers. The original FASTA file headers will be written into the Description node attribute of the final SSN. The "exit" command will end your interactive session and return you to a Biocluster head node.
 

step 1: blast and generate plots

Navigate to the directory containing your FASTA and DAT outputs. Run the following iteration of generatedata.pl in order to BLAST your FASTA sequences and generate statistical plots. At this point, you may opt to only include the FASTA as input - OR - you may combine your FASTA sequences with any Pfam or InterPro identifier by also designating a -pfam or -ipro variable in the below command.

 

At the command prompt enter the following script (the red text indicates required network specific input).  The text in brackets is optional; omit the brackets when including optional parameters: 
 
 
-bash-4.1$ module load efiest/alpha
-bash-4.1$ /home/groups/efi/alpha/generatedata.pl -queue efi -np 36 -tmp “SSN-results-directory-name”  -userfasta output.fa -userdat output.dat [-pfam PFXXXXX] [-ipro IPRXXXXXX]
 
Select any name to replace the "SSN-results-directory-name".
 
For more details regarding the generatedata.pl variables - please see the bottom of this page.
 
 

step 3: generate networks

Run the analyzedata.pl program as described here.