Skip to main content

Resources

Create SSN from Sequence

 

A PRIMER FOR USING THE EFI'S UNIX SCRIPTS FOR SEQUENCE SIMILARITY NETWORKS - FROM sequence

 

If you are new to the process of generating SSNs via command line - please first review the information here.

This primer is for generating Sequence Similarity Networks using a single user-input protein sequence.

 

 

STEP 1: BLAST AND GENERATE PLOTS

 

Run the following program in order to BLAST your sequence against the TrEMBL database to collect homologous sequences, perform pairwise BLAST of the entire dataset, and generate statistical plots. The dataset maximum is 5,000 sequences. If your sequence is highly divergent, it may return less than 5,000 sequences.

 

At the command prompt enter the following script (the red text indicates required network specific input).  The text in brackets is optional; omit the brackets when including optional parameters: 
 
 
-bash-4.1$ module load efiest/alpha
-bash-4.1$ blasthits-new.pl -seq "amino acid sequence" -queue efi -np 12 -tmpdir “SSN-results-directory-name”  -evalue 5
 
 
 
 
Example:
 

[kwhalen2@biocluster MultiCPU-BLAST]$ blasthits-new.pl -seq MLVIACNTAAAVVLDEVKEKLPIPVVGVVQPGAISALKVTKNEHVAVIGTTGTIQSGAYEHTLKKINNRVQVESLACPPFVPLVERGIFSGPEALETCSETLKPLQGKGFDTLILGCTHYPLLKPVIQEVMGPDIQIISSGDETAREVSGLLYHKQRLRTCNEAPEHHFYTTGDAKSFKRLADVWLEMNVPGVETITLES -queue efi -np 12 -tmpdir GR-test-results -evalue 5

#sqlite3

#$combined=$ENV{'EFIEST'}."/data_files/combined.fasta";

#$db=$ENV{'EFIEST'}."/data_files/uniprot_combined.db";

#$dbh = DBI->connect("dbi:SQLite:$db","","");

 

#mysql

#$db="efi_20150212";

#$db="efi_20150403";

$db=$ENV{'EFIDB'};

$username='efignn';

$password='c@lcgnn';

$dbh = DBI->connect("DBI:mysql:$db;host=10.1.1.3;port=3307", $username, $password, { RaiseError => 1 });

$dbh->{mysql_auto_reconnect} = 1;

 

#universal

#$data_files="/home/groups/efi/databases/20150212";

#$data_files="/home/groups/efi/databases/20150403";

$data_files=$ENV{'EFIDBPATH'};

$perpass=1000;

 

 

db is: /home/groups/efi/databases/20150522/combined.fasta

-memqueue not specifiied, using default

 

Blast for similar sequences and sort based off bitscore

initial blast job is:

 1485699.biocluster.igb.illinois.edu

getmatches job is:

 1485700.biocluster.igb.illinois.edu

createfasta job is:

 1485701.biocluster.igb.illinois.edu

annotation job is:

 1485702.biocluster.igb.illinois.edu

multiplex job is:

 1485703.biocluster.igb.illinois.edu

fracfile job is:

 1485704.biocluster.igb.illinois.edu

createdb job is:

 1485705.biocluster.igb.illinois.edu

blast job is:

 1485706[].biocluster.igb.illinois.edu

Cat job is:

 1485707.biocluster.igb.illinois.edu

Blastreduce job is:

 1485708.biocluster.igb.illinois.edu

Demux job is:

 1485709.biocluster.igb.illinois.edu

Graph job is:

 1485710.biocluster.igb.illinois.edu

 

 

 

Description of Input

 

Required:

  • -­seq (string) Protein (amino acid) sequence without spaces. 
  • -­tmpdir (string) Specifies the name of a folder that will be created in your home directory for your datasets and networks; in the example above, GR-test-results. Do not include spaces.
  • -­queue (string) Specify the EFI machines by inputting: efi
  • -np (integer) Specify the optimal number of processors for parallel BLAST, 12
  • -evalue (integer) Species the maximum alignment E-value to return sequences. The integer represents the absolute value of the exponent of a BLAST E-value. Increasing the value of this variable will produce an increasingly homologous dataset.
 
 
 

STEP 3: GENERATE NETWORKS

Run the analyzedata.pl program as described here.