-----Multiple Sequence Alignment by Parallel Iterative Aligner-----


KEY FEATURES

The system partially improves temporary alignment in an iterative way,
and can effectively reach high-quality alignment as a result.

Multiple branches of a search tree in this combinatorial problem are
evaluated in parallel by using many processing elements in each
iteration.

The heuristic method, ``Restricted Partitioning Technique,'' prunes a
large number of branches in the search tree and makes it possible to
solve the combinatorial problem in a practical amount of time.


INTRODUCTION

An important way of discovering new biological information is by
inferring the unknown structure of a protein from its sequence. We do
this by analyzing the sequence of amino acids, because, fortunately,
proteins that have similar sequences have similar structures. Multiple
sequence alignment is one of the most typical methods of sequence
similarity analysis.  The alignment of several protein sequences can
provide valuable information for researching the function or structure
of proteins, especially if one of the aligned proteins has been well
characterized.

Let us show an example of multiple sequence alignment. The following
set of sequences represents an alignment of six different protein
sequences. Each letter represents an amino acid. For instance,
\verb+HEKL+ stands for a row of Histidine, Glutamic acid, Lysine and
Leucine.

---------------HEKLLHPGIQKTTKLF-GET---YYFPNSQLLIQNIINECSICNLAKTEHRNTDM--P-TKTT
--------------LHQ-LTHLSFSKMKALLERSHSPYYMLNRDRTL-KNITETCKAC--AQVNASKSAVKQG-TR--
-PVLQ---LSPA-ELHS-FTHCG---QTAL--TLQ----GATTTEA--SNILRSCHAC---RGGNPQHQMPRGHI---
QATFQAYPLREAKDLHT-ALHIG---PRAL--SKA---CNISMQQA--REVVQTCPHC------NSAPALEAG-VN--
--ISD--PIHEATQAHT-LHHLN---AHTL--RLL---YKITREQA--RDIVKACKQC---VVATPVPHL--G-VN--
--ILT--ALESAQESHA-LHHQN---AAAL--RFQ---FHITREQA--REIVKLCPNC---PDWGSAPQL--G-VN--
               ^    ^                                 ^  ^

Each sequence is shifted by inserting gaps (dash characters). Each
column of the resultant alignment has the same or similar amino acids. 
An identical pattern such as H....H and C..C is considered to be an
important site called a sequence motif, or simply a motif, because an
important protein sequence site has been conservative along with
evolutional cycles between mutation and natural selection. Multiple
sequence alignment is useful not only for inferring the structure and
function of proteins but also for drawing a phylogenetic tree along
the evolutional histories of the creatures.


DYNAMIC PROGRAMMING

Dynamic programming (DP) is a basic method to find an optimal
alignment. The method is regarded as the best path search in the
N-dimensional network. In the method, if two groups of sequences are
given, a two-dimensional network that has a number of nodes connected
by arrows is formed. A score is assigned to each arrow. We search a
path from the top left node to the bottom right node, maximizing the
total score of the arrows. The best path corresponds to the optimal
alignment.

Scores on arrows should reflect similarity between compared
characters.  In the case of protein sequence alignment, Dayhoff's odds
matrix PAM250 is the most popular way of obtaining the scores. The
matrix was obtained by statistical analysis of the mutation
probability of amino acids.

Theoretically, N-dimensional DP provides an optimal alignment of N
groups of sequences. However, N-dimensional DP operates in exponential
time as N grows.  When N is more than three, it does not complete in a
realistic time frame. Conventionally, researchers in the biological
field make a multiple sequence alignment by merging groups of aligned
sequences. A conventional algorithm, called the tree-based algorithm,
merges them in tree-like order using two-dimensional DP.

Though the execution time of the conventional algorithm is manageable,
the quality of its resultant alignment is not high enough yet.  Thus,
researchers fairly often have to do multiple sequence alignments by
hand. The large number of sets of sequences to be aligned have become
a burden on those researchers.


PARALLEL ITERATIVE ALIGNER

We developed a parallel iterative aligner in order to improve the
quality of automatic multiple sequence alignment.  The algorithm of
this parallel iterative aligner is based on the Berger-Munson
algorithm. Firstly, we introduce the B-M algorithm, and then we
explain the parallel iterative aligner.

The B-M algorithm features a novel randomized iterative strategy so as
to generate a high-score multiple sequence alignment. The iterative
strategy procedure is as follows: the initially aligned sequences are
randomly divided into two groups. By fixing the alignment of sequence
members within each group, we can optimize the alignment between the
groups, using two-dimensional DP. The resultant alignment, in turn, is
the starting point for the next alignment of a different pair of
groups. Each iteration that improves the alignment between two
sequence groups will also improve the global alignment.

This iterative strategy often results in much better multiple
alignments than those obtained by conventional algorithms.  However,
the B-M algorithm needs a large amount of time as the number of
sequences grows.

We can reduce the execution time, when a parallel machine is
available. The algorithm of our parallel iterative aligner is as
follows.  Every possible partitioning into two groups of aligned
sequences can be respectively evaluated by two-dimensional DP in a
parallel way. In each iteration, the evaluation is executed in
parallel and the alignment which has the best score is selected as the
starting point for the next iteration. The parallel iterative aligner
performs better than the original B-M method in terms of execution
time.

Furthermore, we have developed an effective heuristic search,
restricted partitioning technique. Applying the iterative strategy, we
realized that the number of sequences in the divided groups is
important. As partitioning divides N sequences into k sequences and
N-k sequences, a smaller k tends to provide a larger improvement when
using two-dimensional DP. The restricted partitioning technique
preferentially selects partitionings which have a small k such as one
or two. It can restrict the search space and reduce the execution time
remarkably. Parallel iterative aligners with this technique can manage
more sequences at the same time than those without it.
