OXFORD UNIVERSITY COMPUTING LABORATORY


IMSA V1.00

(Online Documentation)

Sumedha Gunewardena,
Oxford University Computing Laboratory,
Wolfson Building, Parks Road,
Oxford, OX1 3QD , UK.

Sumedha.Gunewardena@comlab.ox.ac.uk

Contents

Introduction

IMSA is a multiple sequence alignment tool that allows as input, prior knowledge of known sequences of homology or known structural or functional elements. The program annotates the input sequences based on this knowledge, which is then used to perform a smart alignment of the sequences. The program tries to capture two biologically reasonable conjectures that can vastly improve the sensitivity of the alignments. The first of these ideas is based on the need to preserve certain biologically distinguishable structures during the alignment process. The second idea is based on the need to align residues of certain distinguishable segments of sequence with each other, with higher probability than otherwise specified by the substitution matrix.

The multiple sequence alignment algorithm used in IMSA is modified from the standard iterative pair-wise alignment algorithm. We use what we call 'sequence tags' to tag the input sequence. This is an efficient and robust method to tag biological sequences that we developed for this purpose.

Downloading and Installing IMSA

IMSA is written in ANSI C. It consists of the files imsa.c, utility.c, autobind.c, readfile.c, tag.c and the header file imsa.h. These files and their Makefile can be downloaded from here. Download these files to your UNIX directory. Use the make command to compile the source files.

IMSA Input File Format

Following is an illustration of the file format used by IMSA. Strings appearing within [ ] are identifiers, and must appear as they are (including the square brackets). Strings appearing within { } define motif sets. Where |text| is present, it must be substituted with either a digit, character or string. The scoring matrix should be defined as a left triangular matrix with its elements corresponding to the order in which the alphabet is defined. Text appearing in italics is comments and not a part of the actual input file.

[SPS]    
{ |motif| |motif| · · · |motif| }    
{ |motif| |motif| · · · |motif| }    

     ·        ·     

. . . . . . . . . . . . . . . . . . . . . . . % Motif sets

     ·        ·     

   
{ |motif| |motif| · · · |motif| }    
[E-SPS]    
[MAT]    
{ |char| |char| · · · |char| } . . . . . . . . . . . . . . . . . . . . . . . % The alphabet
|digit|    
|digit| |digit|    

     ·        ·     

. . . . . . . . . . . . . . . . . . . . . . . % The substitution matrix

     ·        ·   

   
|digit| |digit| · · · |digit|    
[E-MAT]    
> |string| . . . . . . . . . . . . . . . . . . . . . . . % Sequence name
|string| . . . . . . . . . . . . . . . . . . . . . . . % Sequence
> |string| . . . . . . . . . . . . . . . . . . . . . . . % Sequence name
|string| . . . . . . . . . . . . . . . . . . . . . . . % Sequence

     ·        ·     

   

     ·        ·     

   
> |string| . . . . . . . . . . . . . . . . . . . . . . . % Sequence name
|string| . . . . . . . . . . . . . . . . . . . . . . . % Sequence

Example:

Following is an example file containing 8 protein sequences to be aligned. There are 7 motif sets with 8 motifs in each set given as input.

[SPS]
{ VLTQPP VLTQPP QLVQSG QLEQSG SVFLFP SVTLFP SVFPLA QVYTLP }
{ TISCTG TISCSG RLSCSS SLTCTV EVTCVV TLVCLI ALGCLV SLTCLV }
{ NVKWY TVNWY AMYWV YWTWV KFNWY TVAWK TVSWN AVEWE }
{ SVSKS SGSKS TISRN TMLVN KTKPR GVETT GVHTF NYKTT }
{ TSATLAI ASASLAI NTLFLQM NQFSLRL VVSVLTV ASSYLSL LSSVVTV LYSKLTV }
{ YYCQSY YYCAAW YFCARD YYCARN YKCKVS YSCQVT YICNVN FSCSVM }
{ VFG VFG YWG VWG IEK VEK VDK TQK }
[E-SPS]

[MAT]
{ C S T P A G N D E Q H R K M I L V F Y W }

14
-1 6
-5 2 7
-6 1 -1 10
-5 2 2 1 6
-8 1 -3 -3 1 8
-8 2 0 -3 -1 -1 7
-11 -1 -2 -4 -1 -1 4 8
-11 -2 -3 -3 0 -2 1 5 8
-11 -3 -3 -1 -2 -5 -1 1 4 9
-6 -4 -5 -2 -5 -7 2 -1 -2 4 11
-6 -1 -4 -2 -5 -8 -3 -6 -5 1 1 10
-11 -2 -1 -4 -4 -5 1 -2 -2 -1 -3 3 8
-11 -4 -2 -6 -3 -8 -5 -8 -6 -2 -7 -2 1 13
-5 -4 -1 -6 -3 -7 -4 -6 -5 -5 -7 -4 4 2 9
-12 -7 -5 -5 -5 -8 -6 -9 -7 -3 -5 -7 -6 4 2 9
-4 -4 -1 -4 0 -4 -5 -6 -5 -5 -6 -6 -6 1 5 1 8
-10 -5 -6 -9 -7 -8 -6 -11 -11 -10 -4 -7 -11 -2 0 0 -5 12
-2 -6 -6 -11 -6 -11 -3 -9 -7 -9 -1 -10 -10 -8 -4 -5 -6 6 13
-13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7 1 -9 -11 -12 -7 -14 -2 -2 19
[E-MAT]

> FABVL
ASVLTQPPSVSGAPGQRVTISCTGSSSNIGAGHNVKWYQQLPGTAPKLLIFHNNARFSVSKSGTSATLAITGLQA
EDEADYYCQSYDRSLRVFGGGTKLTVLR

> FB4VL
QSVLTQPPSASGTPGQRVTISCSGTSSNIGSSTVNWYQQLPGMAPKLLIYRDAMRPSGVPDRFSGSKSGASASLA
IGGLQSEDETDYYCAAWDVSLNAYVFGTGTKVTVLGQ

> FB4VH
EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAIIWDDGSDQHYADSVKGRFTISRNDS
KNTLFLQMDSLRPEDTGVYFCARDGGHGFCSSASCFGPDYWGQGTPVTVSS

> FABVH
AVQLEQSGPGLVRPSQTLSLTCTVSGTSFDDYYWTWVRQPPGRGLEWIGYVFYTGTTLLDPSLRGRVTMLVNTSK
NQFSLRLSSVTAADTAVYYCARNLIAGGIDVWGQGSLVTVSS

> FCCH2
PSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPQVKFNWYVDGVQVHNAKTKPREQQYNSTYRVVSVLTVLHQN
WLDGKEYKCKVSNKALPAPIEKTISKAKG

> FABCL
QPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADGSPVKAGVETTTPSKQSNNKYAASSYLSLTP
EQWKSHKSYSCQVTHEGSTVEKTVAPTSCS

> FABCH1
ASTKGPSVFPLAPTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYIC
NVNHKPSNTKVDKKVEPKSA

> FCCH3
QPREPQVYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSRW
QQGNVFSCSVMHEALHNHYTQKSLSL

Parameters

The input to IMSA is a set of parameters followed by the file containing the sequences to be aligned. There are five parameters one could specify at the command line, these are:

  • Ggap start-up cost (g) : The gap start-up cost is the first of a two-tire gap cost. It specifies the cost of starting a new gap in the alignment.
  • Gap extension cost (s) : The gap extension cost is the second of the two-tire gap cost. It specifies the cost of aligning a residue(s) with a gap(s). It is usually set to a value very much smaller than the gap start-up cost.
  • Binding penalty (b) : The binding penalty specifies how tight the given motifs are to be preserved throughout the alignment. Higher the binding penalty, the more tighter motifs are preserved.
  • Attraction weight (w) : The attraction weight defines the strength of attraction between residues of motifs in the same motif set.
  • Number of realignments (r) : The number of realignments specifies the number of times each sequence should be removed from the alignment and realigned afresh with the existing set of sequences.

Running IMSA

IMSA is run from the command line. The program name should be followed by the list of parameters (optional) and the input file name. Here is an example command line statement to run IMSA on the sequence file "alignment_file.msa".

$imsa -g8 -s4 -b5 -w25 alignment_file.msa


Please send Questions, Comments, and Bug Reports to author.



[Oxford Spires]



Oxford University Computing Laboratory Courses Research People About us News