為了手上一個project, 不得不涉獵Sequence alignment, 這個類分析一個非常複雜的問題, 但又十分重要。在生物學上,我們是用這個方法來比對基因序列,例如貓跟狗的基因像不像,小貓熊寶寶的爹有沒有可能是隔壁籠的台灣黑熊。聽其來很有意思,不過,事實上,非常的枯燥,我們工作的結果像是下面這段序列比對一樣,基本上,是拿兩本有字天書,逐字比對。
且慢,如果你不是這個領域的朋友,別急著轉台....
##ReadMore##其實是 sequence alignment 也是很有趣及應用價值的。Alignment 的目標在於找尋兩段序列(文章也是一種序列)相似處,比方說,最近我發現了網路上有一篇文章,有種似曾相識的感覺,我懷疑是有人人惡意抄襲了我Blog 的文章,他把裡面的內容改一改字(mutant),加加減減(insert, delete),就好像自己的新文章一樣。他一定看準了,我一個字一個字對,可能會把頭髮給對白了,不過,他沒想到,我剛寫好了 Needleman-wunsche (NW)的演算法,不用一秒,NW 演算法可以找出兩篇文章的脈絡,並且告訴我們兩篇有多像了。
不過有些盜文者很聰明,他們知道我會 NW 演算法,所以他們採取截錄的方法,只偷一段,然後把幾段拚起來,就成了一篇新報告,因為文章結構不同,NW 就沒有那麼靈敏。不過,他們沒想到,我也順便寫了,Smith-waterman(SW) 的演算法,SW法會找出兩篇文章內,最相似的段落,只要有盜文,就抓的到。
這些看似來無趣的數學運算,用對了地方也是很有趣的,又到了期末交報告的季節了,以前幫忙改過學生的作業,改著改著常有似曾相識的感覺,天下文章一大抄,後 來發現其實原創的版本有限,大多都是由原版演化出來的新版本。有了這兩個演算法,我想應該可以幫學生報告建立親源樹,找出原創版本是誰,然後誰跟誰借來 抄,第二版又借給誰,那一個版本比較流行.....好像想太多了,好了,如果你是非生資領域的朋友,可以轉台了,接下來聊聊演算法。
Sequence alignment 的演算法, 一個非常複雜的問題, 因為兩段序列要對在一起可能有非常多種組合,任何一個位子,都可以突變,插入或剔除,使用遞迴的方法會產生非常多重複的計算,不但非常沒有效率,序列長了甚至癱瘓系統。為了解決這個問題,因而衍生出dynamic programming的寫法,有關 alignment 的原理及程式寫法,在這篇有詳細的介紹,用看的他的說明可能有點難懂,不過,如果有程式基礎,看source code 應該對了解原理會很有幫助。
Dynamic programming 的理論出來非常久了,我想要找公開的程式碼來用應該不難,所以我問了孤狗大神,大神說有!但只有JAVA 跟 C++的 版本,就是沒有C# 版(用DOT NET 在科學界是孤鳥啊!),我也不想為了這一小段碼,把之前寫的程式,全翻成 JAVA or C++ 。只能參考他們程式碼用C# 重寫一次,個人不太喜歡JAVA 的版本,物件化過了頭,結構很漂亮,但是效能上郤不如一個C++版簡單的陣列有效。我沒學過C/C++ ,為了看懂C++特別去圖館借了兩本入門書來修綀,經過一夜的閉關,C++的天書開始顯出影像,雖然還沒有練到十成功力,但至少可以把C++版的 Code 改寫成C#了,而且原本洛洛長的碼,用純物件的C#,變得很簡潔易讀,念在我用的MSDN是免錢的份上,我把這段碼放迴饋給科學界其他的孤鳥們,如果你想用C# 開發輕量級Alignmemt 程式,這個class 或許對你有些幫助。
使用方式:
For Needleman-Wunsch method
NeedlemanWunsch nw = new NeedlemanWunsch(2, 2, -1); \\ 參數設定參見程式碼
AlignResult newResult = nw.Align(seq1, seq2);
Needleman-Wunsch 會傳回一個最佳的 alignment 結果。
For Smith-Waterman method
SmithWaterman sw = new SmithWaterman(2, 2, -1);
AlignResult[] newResults = sw.Align(seq1, seq2);
Smith-Waterman可能會有多個最佳的 alignment 結果,所以陣列方式傳回。
For present result
AlignResult.MatchLength;
會傳回比對結果共有多少的base 可以對上。
AlignResult.MatchString;
傳回比對果字串,並三行,第一三行是 sequence, 第二行是兩 sequence間,base-base 的關係(| match, : mismatch, . gap)
好了,廢話不多說,看碼.....
=========================== 以下是程式碼 ======================
public class AlignResult
{
public string Name = string.Empty;
public override string ToString()
{
return Name;
}
public AlignResult(string seq1, string seq2)
{
seq_1_aligned = seq1;
seq_2_aligned = seq2;
}
private string seq_1_aligned = string.Empty;
private string seq_2_aligned = string.Empty;
public string MatchString
{
get
{
StringBuilder sb = new StringBuilder();
char[] array1 = seq_1_aligned.ToCharArray();
char[] array2 = seq_2_aligned.ToCharArray();
for (int i = 0; i < array1.Length; i++)
{
if (array1[i] == '-' || array2[i] == '-')
{
sb.Append('.');
}
else
{
sb.Append((array1[i] == array2[i]) ? '|' : ':');
}
}
StringBuilder rsb = new StringBuilder();
rsb.Append(seq_1_aligned + "\n");
rsb.Append(sb.ToString() + "\n");
rsb.Append(seq_2_aligned + "\n");
return rsb.ToString();
}
}
public int MatchLength
{
get
{
StringBuilder sb = new StringBuilder();
char[] array1 = seq_1_aligned.ToCharArray();
char[] array2 = seq_2_aligned.ToCharArray();
int k = 0;
for (int i = 0; i < array1.Length; i++)
{
if (array1[i] == array2[i]) k++;
}
return k;
}
}
}
public class NeedlemanWunsch
{
public int GapPenalty = 2;
public int MatchScore = 2;
public int MisMatchScore = -1;
public NeedlemanWunsch() { }
/// <summary>
/// constructor of Needleman Wunsch aligner
/// </summary>
/// <param name="gap"> minus score when a gap happend</param>
/// <param name="match">add score when bases are perfect-match</param>
/// <param name="mismatch">add score when bases are mis-match</param>
public NeedlemanWunsch(int gap, int match, int mismatch)
{
GapPenalty = gap;
MatchScore = match;
MisMatchScore = mismatch;
}
/// <summary>
/// Align two sequence and return align result
/// </summary>
/// <param name="seq1">sequence 1 as string (case sensitive)</param>
/// <param name="seq2">sequence 2 as string (case sensitive)</param>
/// <returns></returns>
public AlignResult Align(string seq1, string seq2)
{
#region component
int l1 = seq1.Length;
int l2 = seq2.Length;
int[,] dpMatrix = new int[l2 + 1, l1 + 1]; //dynamic program matrix
char[,] pointerMatrix = new char[l2 + 1, l1 + 1]; //trace back char matrix
#endregion
#region fill dynamic program and traceBack matrix
char[] seq_1 = seq1.ToCharArray();
char[] seq_2 = seq2.ToCharArray();
this.dpm_init(dpMatrix, pointerMatrix, l1, l2, GapPenalty); //initial matrix
for (int i = 1; i <= l2; i++)
{
for (int j = 1; j <= l1; j++)
{
char ptr; //pointer of each cell
int fU, fD, fL; //score of upper, diagon and left cells
int match = (seq_1[j - 1] == seq_2[i - 1]) ? MatchScore : MisMatchScore;
fU = dpMatrix[i - 1, j] - GapPenalty; //if up shift will generate a gap, so give a penalty.
fD = dpMatrix[i - 1, j - 1] + match; //if not shift, whether base is macth or not, give a match score
fL = dpMatrix[i, j - 1] - GapPenalty; //if left shift will generate a gap, so give a penalty
// Assing max score among fU, fD, fL to current cell
int max = 0;
if (fD >= fU && fD >= fL) // diagonal is prefered
{ max = fD; ptr = '\\'; } // next cell is diagonal cell
else if (fU > fL)
{ max = fU; ptr = '|'; } // next cell is upper cell
else
{ max = fL; ptr = '-'; } // next cell is left cell
dpMatrix[i, j] = max;
pointerMatrix[i, j] = ptr;
}
}
#endregion
/* Debug code for present dynamic program score and pointer matrix
Console.WriteLine("DynamicMatrix");
print_matrix(dpMatrix, seq1, seq2);
Console.WriteLine("TrackBackMatrix");
print_traceback (pointerMatrix, seq1, seq2);
*/
#region Trace Back and create aligned string
// track back form the most left-bottom cornner
int ti = l2--;
int tj = l1--;
// string builder for align string
StringBuilder seq1Builder = new StringBuilder();
StringBuilder seq2Builder = new StringBuilder();
/*
pointerMatrix[x,y]
* n # S E Q 1 ( y axis)
#
S
E
Q
2
(x-axis)
*/
while (ti > 0 || tj > 0) // walk to the most right-top cornner
{
switch (pointerMatrix[ti, tj])
{
//move to up cell, shift x, Seq2 get a base and Seq1 get a "-"
case '|':
seq1Builder.Insert(0, '-');
seq2Builder.Insert(0, seq2[ti - 1]);
ti--;
break;
//move to diagonal cell, both Seq1 and Seq 2 add a base
case '\\':
seq1Builder.Insert(0, seq_1[tj - 1]);
seq2Builder.Insert(0, seq2[ti - 1]);
ti--;
tj--;
break;
//move to left cell, Seq 1 get a base, Seq 2 get a "-"
case '-':
seq1Builder.Insert(0, seq_1[tj - 1]);
seq2Builder.Insert(0, '-');
tj--;
break;
default:
break;
}
}
#endregion
return new AlignResult(seq1Builder.ToString(), seq2Builder.ToString());
}
public void dpm_init(int[,] dpMatrix, char[,] traceBackMatrix, int seq1Length, int seq2Length, int gapPenalty)
{
dpMatrix[0, 0] = 0;
traceBackMatrix[0, 0] = 'n';
for (int i = 1; i <= seq1Length; i++)
{
dpMatrix[0, i] = -i * gapPenalty;
traceBackMatrix[0, i] = '-';
}
for (int i = 1; i <= seq2Length; i++)
{
dpMatrix[i, 0] = -i * gapPenalty;
traceBackMatrix[i, 0] = '|';
}
}
// methods for debug
public void print_matrix(int[,] intArray, string seq1, string seq2)
{
StringBuilder sb = new StringBuilder();
char[] seq1Array = seq1.ToCharArray();
char[] seq2Array = seq2.ToCharArray();
//Create Title
sb.Append('#');
sb.Append('\t');
for (int j = 0; j < seq1Array.Length; j++)
{
sb.Append('\t');
sb.Append(seq1Array[j]);
}
sb.Append('\n');
// dump array
for (int i = 0; i < seq2Array.Length + 1; i++)
{
sb.Append((i > 0) ? seq2Array[i - 1] : '-');
for (int j = 0; j < seq1Array.Length + 1; j++)
{
sb.Append('\t' + intArray[i, j].ToString());
}
sb.Append('\n');
}
System.Console.Write(sb.ToString());
}
public void print_traceback(char[,] charArray, string seq1, string seq2)
{
StringBuilder sb = new StringBuilder();
char[] seq1Array = seq1.ToCharArray();
char[] seq2Array = seq2.ToCharArray();
//Create Title
sb.Append('#');
sb.Append('\t');
for (int j = 0; j < seq1Array.Length; j++)
{
sb.Append('\t');
sb.Append(seq1Array[j]);
}
sb.Append('\n');
// dump array
for (int i = 0; i < seq2Array.Length + 1; i++)
{
sb.Append((i > 0) ? seq2Array[i - 1] : '-');
for (int j = 0; j < seq1Array.Length + 1; j++)
{
sb.Append('\t');
sb.Append(charArray[i, j]);
}
sb.Append('\n');
}
System.Console.Write(sb.ToString());
}
}
public class SmithWaterman
{
//internal class to store entry points fo local align fragments
class seed
{
private int x = 0;
private int y = 0;
public seed(int sx, int sy)
{
x = sx;
y = sy;
}
public int X
{ get { return x; } }
public int Y
{ get { return y; } }
}
public int GapPenalty = 2;
public int MatchScore = 2;
public int MisMatchScore = -1;
public SmithWaterman() { }
/// <summary>
/// constructor of Smith Waterman aligner
/// </summary>
/// <param name="gap"> minus score when a gap happend</param>
/// <param name="match">add score when bases are perfect-match</param>
/// <param name="mismatch">add score when bases are mis-match</param>
public SmithWaterman(int gap, int match, int mismatch)
{
GapPenalty = gap;
MatchScore = match;
MisMatchScore = mismatch;
}
/// <summary>
/// Align two sequence and return align result
/// </summary>
/// <param name="seq1">sequence 1 as string (case sensitive)</param>
/// <param name="seq2">sequence 2 as string (case sensitive)</param>
/// <returns>align result array of all highest score entry point</returns>
public AlignResult[] Align(string seq1, string seq2)
{
#region component
int l1 = seq1.Length;
int l2 = seq2.Length;
int[,] dpMatrix = new int[l2 + 1, l1 + 1]; //dynamic program matrix
char[,] pointerMatrix = new char[l2 + 1, l1 + 1]; //pointer matrix
#endregion
#region fill dynamic program
char[] seq_1 = seq1.ToCharArray();
char[] seq_2 = seq2.ToCharArray();
// inital dynamic matrix with zero
this.dpm_init(dpMatrix, pointerMatrix, l1, l2);
// fill dpMatrix and pointer matrix as Needleman Wunsch algorithmn
// two different point are marked below
int fU, fD, fL;
for (int i = 1; i <= l2; i++)
{
for (int j = 1; j <= l1; j++)
{
int match = (seq_1[j - 1] == seq_2[i - 1]) ? MatchScore : MisMatchScore;
fU = dpMatrix[i - 1, j] - GapPenalty;
fD = dpMatrix[i - 1, j - 1] + match;
fL = dpMatrix[i, j - 1] - GapPenalty;
int max = 0;
char ptr;
if (fD >= fU && fD >= fL)
{
max = fD;
ptr = '\\';
}
else if (fU > fL)
{
max = fU;
ptr = '|';
}
else
{
max = fL;
ptr = '-';
}
dpMatrix[i, j] = (max < 0) ? 0 : max; //S.W. method score will be not smaller than Zero;
pointerMatrix[i, j] = (dpMatrix[i, j] == 0) ? '.' : ptr; //S.W. method trace back will stop at cell with zero score (marked with ".");
}
}
#endregion
/* Debug code for present dynamic program ad pointer matrix
Console.WriteLine("DynamicMatrix");
print_matrix(dpMatrix, seq1, seq2);
Console.WriteLine("PointerMatrix");
print_traceback(pointerMatrix,seq1,seq2);
*/
// S.W method walk from the high score entry point (seed)
// The high score cell might be not unique.
#region find highest score seed
List<seed> highList = new List<seed>();
int highestScore = 0;
for (int i = 0; i <= l2; i++)
{
for (int j = 0; j <= l1; j++)
{
if (dpMatrix[i, j] >= highestScore)
{
if (dpMatrix[i, j] > highestScore)
{
highList.Clear();
highestScore = dpMatrix[i, j];
}
seed newEntry = new seed(i, j);
highList.Add(newEntry);
}
}
}
#endregion
//One seed will have one align result
// All aligned result will return as a AlignResult array
#region track back to zero
List<AlignResult> align = new List<AlignResult>();
foreach (seed s in highList)
{
align.Add(traceBack(pointerMatrix, seq1, seq2, s.X, s.Y));
}
return align.ToArray();
#endregion
}
/// <summary>
/// The trace back procedure of Smith Wunsch algorithm is similar to needlemen
/// but:
/// 1. start from the high score cells
/// 2. end at zero-score cell or 0,0 point
/// </summary>
public AlignResult traceBack(char[,] traceBackMatrix, string seq1, string seq2, int x, int y)
{
// start from the highest cells
int ti = x;
int tj = y;
char[] seq_1 = seq1.ToCharArray();
char[] seq_2 = seq2.ToCharArray();
StringBuilder seq1Builder = new StringBuilder();
StringBuilder seq2Builder = new StringBuilder();
while (ti > 0 || tj > 0) // stop at (0,0) point
{
if (traceBackMatrix[ti, tj] == '.') break; //stop at zero cell
switch (traceBackMatrix[ti, tj])
{
case '|':
seq1Builder.Insert(0, '-');
seq2Builder.Insert(0, seq_2[ti - 1]);
ti--;
break;
case '\\':
seq1Builder.Insert(0, seq_1[tj - 1]);
seq2Builder.Insert(0, seq_2[ti - 1]);
ti--;
tj--;
break;
case '-':
seq1Builder.Insert(0, seq_1[tj - 1]);
seq2Builder.Insert(0, '-');
tj--;
break;
default:
break;
}
}
return new AlignResult(seq1Builder.ToString(), seq2Builder.ToString());
}
public void dpm_init(int[,] dpMatrix, char[,] traceBackMatrix, int seq1Length, int seq2Length)
{
dpMatrix[0, 0] = 0;
traceBackMatrix[0, 0] = '.';
for (int i = 1; i <= seq1Length; i++)
{
dpMatrix[0, i] = 0;
traceBackMatrix[0, i] = '.';
}
for (int i = 1; i <= seq2Length; i++)
{
dpMatrix[i, 0] = 0;
traceBackMatrix[i, 0] = '.';
}
}
// methods for debug
public void print_traceback(char[,] charArray, string seq1, string seq2)
{
StringBuilder sb = new StringBuilder();
char[] seq1Array = seq1.ToCharArray();
char[] seq2Array = seq2.ToCharArray();
//Create Title
sb.Append('#');
sb.Append('\t');
for (int j = 0; j < seq1Array.Length; j++)
{
sb.Append('\t');
sb.Append(seq1Array[j]);
}
sb.Append('\n');
// dump array
for (int i = 0; i < seq2Array.Length + 1; i++)
{
sb.Append((i > 0) ? seq2Array[i - 1] : '-');
for (int j = 0; j < seq1Array.Length + 1; j++)
{
sb.Append('\t');
sb.Append(charArray[i, j]);
}
sb.Append('\n');
}
System.Console.Write(sb.ToString());
}
public void print_matrix(int[,] intArray, string seq1, string seq2)
{
StringBuilder sb = new StringBuilder();
char[] seq1Array = seq1.ToCharArray();
char[] seq2Array = seq2.ToCharArray();
//Create Title
sb.Append('#');
sb.Append('\t');
for (int j = 0; j < seq1Array.Length; j++)
{
sb.Append('\t');
sb.Append(seq1Array[j]);
}
sb.Append('\n');
// dump array
for (int i = 0; i < seq2Array.Length + 1; i++)
{
sb.Append((i > 0) ? seq2Array[i - 1] : '-');
for (int j = 0; j < seq1Array.Length + 1; j++)
{
sb.Append('\t' + intArray[i, j].ToString());
}
sb.Append('\n');
}
System.Console.Write(sb.ToString());
}
}
=========================== 以上是程式碼 ======================
且慢,如果你不是這個領域的朋友,別急著轉台....
##ReadMore##其實是 sequence alignment 也是很有趣及應用價值的。Alignment 的目標在於找尋兩段序列(文章也是一種序列)相似處,比方說,最近我發現了網路上有一篇文章,有種似曾相識的感覺,我懷疑是有人人惡意抄襲了我Blog 的文章,他把裡面的內容改一改字(mutant),加加減減(insert, delete),就好像自己的新文章一樣。他一定看準了,我一個字一個字對,可能會把頭髮給對白了,不過,他沒想到,我剛寫好了 Needleman-wunsche (NW)的演算法,不用一秒,NW 演算法可以找出兩篇文章的脈絡,並且告訴我們兩篇有多像了。
不過有些盜文者很聰明,他們知道我會 NW 演算法,所以他們採取截錄的方法,只偷一段,然後把幾段拚起來,就成了一篇新報告,因為文章結構不同,NW 就沒有那麼靈敏。不過,他們沒想到,我也順便寫了,Smith-waterman(SW) 的演算法,SW法會找出兩篇文章內,最相似的段落,只要有盜文,就抓的到。
這些看似來無趣的數學運算,用對了地方也是很有趣的,又到了期末交報告的季節了,以前幫忙改過學生的作業,改著改著常有似曾相識的感覺,天下文章一大抄,後 來發現其實原創的版本有限,大多都是由原版演化出來的新版本。有了這兩個演算法,我想應該可以幫學生報告建立親源樹,找出原創版本是誰,然後誰跟誰借來 抄,第二版又借給誰,那一個版本比較流行.....好像想太多了,好了,如果你是非生資領域的朋友,可以轉台了,接下來聊聊演算法。
Sequence alignment 的演算法, 一個非常複雜的問題, 因為兩段序列要對在一起可能有非常多種組合,任何一個位子,都可以突變,插入或剔除,使用遞迴的方法會產生非常多重複的計算,不但非常沒有效率,序列長了甚至癱瘓系統。為了解決這個問題,因而衍生出dynamic programming的寫法,有關 alignment 的原理及程式寫法,在這篇有詳細的介紹,用看的他的說明可能有點難懂,不過,如果有程式基礎,看source code 應該對了解原理會很有幫助。
Dynamic programming 的理論出來非常久了,我想要找公開的程式碼來用應該不難,所以我問了孤狗大神,大神說有!但只有JAVA 跟 C++的 版本,就是沒有C# 版(用DOT NET 在科學界是孤鳥啊!),我也不想為了這一小段碼,把之前寫的程式,全翻成 JAVA or C++ 。只能參考他們程式碼用C# 重寫一次,個人不太喜歡JAVA 的版本,物件化過了頭,結構很漂亮,但是效能上郤不如一個C++版簡單的陣列有效。我沒學過C/C++ ,為了看懂C++特別去圖館借了兩本入門書來修綀,經過一夜的閉關,C++的天書開始顯出影像,雖然還沒有練到十成功力,但至少可以把C++版的 Code 改寫成C#了,而且原本洛洛長的碼,用純物件的C#,變得很簡潔易讀,念在我用的MSDN是免錢的份上,我把這段碼放迴饋給科學界其他的孤鳥們,如果你想用C# 開發輕量級Alignmemt 程式,這個class 或許對你有些幫助。
使用方式:
For Needleman-Wunsch method
NeedlemanWunsch nw = new NeedlemanWunsch(2, 2, -1); \\ 參數設定參見程式碼
AlignResult newResult = nw.Align(seq1, seq2);
Needleman-Wunsch 會傳回一個最佳的 alignment 結果。
For Smith-Waterman method
SmithWaterman sw = new SmithWaterman(2, 2, -1);
AlignResult[] newResults = sw.Align(seq1, seq2);
Smith-Waterman可能會有多個最佳的 alignment 結果,所以陣列方式傳回。
For present result
AlignResult.MatchLength;
會傳回比對結果共有多少的base 可以對上。
AlignResult.MatchString;
傳回比對果字串,並三行,第一三行是 sequence, 第二行是兩 sequence間,base-base 的關係(| match, : mismatch, . gap)
好了,廢話不多說,看碼.....
=========================== 以下是程式碼 ======================
public class AlignResult
{
public string Name = string.Empty;
public override string ToString()
{
return Name;
}
public AlignResult(string seq1, string seq2)
{
seq_1_aligned = seq1;
seq_2_aligned = seq2;
}
private string seq_1_aligned = string.Empty;
private string seq_2_aligned = string.Empty;
public string MatchString
{
get
{
StringBuilder sb = new StringBuilder();
char[] array1 = seq_1_aligned.ToCharArray();
char[] array2 = seq_2_aligned.ToCharArray();
for (int i = 0; i < array1.Length; i++)
{
if (array1[i] == '-' || array2[i] == '-')
{
sb.Append('.');
}
else
{
sb.Append((array1[i] == array2[i]) ? '|' : ':');
}
}
StringBuilder rsb = new StringBuilder();
rsb.Append(seq_1_aligned + "\n");
rsb.Append(sb.ToString() + "\n");
rsb.Append(seq_2_aligned + "\n");
return rsb.ToString();
}
}
public int MatchLength
{
get
{
StringBuilder sb = new StringBuilder();
char[] array1 = seq_1_aligned.ToCharArray();
char[] array2 = seq_2_aligned.ToCharArray();
int k = 0;
for (int i = 0; i < array1.Length; i++)
{
if (array1[i] == array2[i]) k++;
}
return k;
}
}
}
public class NeedlemanWunsch
{
public int GapPenalty = 2;
public int MatchScore = 2;
public int MisMatchScore = -1;
public NeedlemanWunsch() { }
/// <summary>
/// constructor of Needleman Wunsch aligner
/// </summary>
/// <param name="gap"> minus score when a gap happend</param>
/// <param name="match">add score when bases are perfect-match</param>
/// <param name="mismatch">add score when bases are mis-match</param>
public NeedlemanWunsch(int gap, int match, int mismatch)
{
GapPenalty = gap;
MatchScore = match;
MisMatchScore = mismatch;
}
/// <summary>
/// Align two sequence and return align result
/// </summary>
/// <param name="seq1">sequence 1 as string (case sensitive)</param>
/// <param name="seq2">sequence 2 as string (case sensitive)</param>
/// <returns></returns>
public AlignResult Align(string seq1, string seq2)
{
#region component
int l1 = seq1.Length;
int l2 = seq2.Length;
int[,] dpMatrix = new int[l2 + 1, l1 + 1]; //dynamic program matrix
char[,] pointerMatrix = new char[l2 + 1, l1 + 1]; //trace back char matrix
#endregion
#region fill dynamic program and traceBack matrix
char[] seq_1 = seq1.ToCharArray();
char[] seq_2 = seq2.ToCharArray();
this.dpm_init(dpMatrix, pointerMatrix, l1, l2, GapPenalty); //initial matrix
for (int i = 1; i <= l2; i++)
{
for (int j = 1; j <= l1; j++)
{
char ptr; //pointer of each cell
int fU, fD, fL; //score of upper, diagon and left cells
int match = (seq_1[j - 1] == seq_2[i - 1]) ? MatchScore : MisMatchScore;
fU = dpMatrix[i - 1, j] - GapPenalty; //if up shift will generate a gap, so give a penalty.
fD = dpMatrix[i - 1, j - 1] + match; //if not shift, whether base is macth or not, give a match score
fL = dpMatrix[i, j - 1] - GapPenalty; //if left shift will generate a gap, so give a penalty
// Assing max score among fU, fD, fL to current cell
int max = 0;
if (fD >= fU && fD >= fL) // diagonal is prefered
{ max = fD; ptr = '\\'; } // next cell is diagonal cell
else if (fU > fL)
{ max = fU; ptr = '|'; } // next cell is upper cell
else
{ max = fL; ptr = '-'; } // next cell is left cell
dpMatrix[i, j] = max;
pointerMatrix[i, j] = ptr;
}
}
#endregion
/* Debug code for present dynamic program score and pointer matrix
Console.WriteLine("DynamicMatrix");
print_matrix(dpMatrix, seq1, seq2);
Console.WriteLine("TrackBackMatrix");
print_traceback (pointerMatrix, seq1, seq2);
*/
#region Trace Back and create aligned string
// track back form the most left-bottom cornner
int ti = l2--;
int tj = l1--;
// string builder for align string
StringBuilder seq1Builder = new StringBuilder();
StringBuilder seq2Builder = new StringBuilder();
/*
pointerMatrix[x,y]
* n # S E Q 1 ( y axis)
#
S
E
Q
2
(x-axis)
*/
while (ti > 0 || tj > 0) // walk to the most right-top cornner
{
switch (pointerMatrix[ti, tj])
{
//move to up cell, shift x, Seq2 get a base and Seq1 get a "-"
case '|':
seq1Builder.Insert(0, '-');
seq2Builder.Insert(0, seq2[ti - 1]);
ti--;
break;
//move to diagonal cell, both Seq1 and Seq 2 add a base
case '\\':
seq1Builder.Insert(0, seq_1[tj - 1]);
seq2Builder.Insert(0, seq2[ti - 1]);
ti--;
tj--;
break;
//move to left cell, Seq 1 get a base, Seq 2 get a "-"
case '-':
seq1Builder.Insert(0, seq_1[tj - 1]);
seq2Builder.Insert(0, '-');
tj--;
break;
default:
break;
}
}
#endregion
return new AlignResult(seq1Builder.ToString(), seq2Builder.ToString());
}
public void dpm_init(int[,] dpMatrix, char[,] traceBackMatrix, int seq1Length, int seq2Length, int gapPenalty)
{
dpMatrix[0, 0] = 0;
traceBackMatrix[0, 0] = 'n';
for (int i = 1; i <= seq1Length; i++)
{
dpMatrix[0, i] = -i * gapPenalty;
traceBackMatrix[0, i] = '-';
}
for (int i = 1; i <= seq2Length; i++)
{
dpMatrix[i, 0] = -i * gapPenalty;
traceBackMatrix[i, 0] = '|';
}
}
// methods for debug
public void print_matrix(int[,] intArray, string seq1, string seq2)
{
StringBuilder sb = new StringBuilder();
char[] seq1Array = seq1.ToCharArray();
char[] seq2Array = seq2.ToCharArray();
//Create Title
sb.Append('#');
sb.Append('\t');
for (int j = 0; j < seq1Array.Length; j++)
{
sb.Append('\t');
sb.Append(seq1Array[j]);
}
sb.Append('\n');
// dump array
for (int i = 0; i < seq2Array.Length + 1; i++)
{
sb.Append((i > 0) ? seq2Array[i - 1] : '-');
for (int j = 0; j < seq1Array.Length + 1; j++)
{
sb.Append('\t' + intArray[i, j].ToString());
}
sb.Append('\n');
}
System.Console.Write(sb.ToString());
}
public void print_traceback(char[,] charArray, string seq1, string seq2)
{
StringBuilder sb = new StringBuilder();
char[] seq1Array = seq1.ToCharArray();
char[] seq2Array = seq2.ToCharArray();
//Create Title
sb.Append('#');
sb.Append('\t');
for (int j = 0; j < seq1Array.Length; j++)
{
sb.Append('\t');
sb.Append(seq1Array[j]);
}
sb.Append('\n');
// dump array
for (int i = 0; i < seq2Array.Length + 1; i++)
{
sb.Append((i > 0) ? seq2Array[i - 1] : '-');
for (int j = 0; j < seq1Array.Length + 1; j++)
{
sb.Append('\t');
sb.Append(charArray[i, j]);
}
sb.Append('\n');
}
System.Console.Write(sb.ToString());
}
}
public class SmithWaterman
{
//internal class to store entry points fo local align fragments
class seed
{
private int x = 0;
private int y = 0;
public seed(int sx, int sy)
{
x = sx;
y = sy;
}
public int X
{ get { return x; } }
public int Y
{ get { return y; } }
}
public int GapPenalty = 2;
public int MatchScore = 2;
public int MisMatchScore = -1;
public SmithWaterman() { }
/// <summary>
/// constructor of Smith Waterman aligner
/// </summary>
/// <param name="gap"> minus score when a gap happend</param>
/// <param name="match">add score when bases are perfect-match</param>
/// <param name="mismatch">add score when bases are mis-match</param>
public SmithWaterman(int gap, int match, int mismatch)
{
GapPenalty = gap;
MatchScore = match;
MisMatchScore = mismatch;
}
/// <summary>
/// Align two sequence and return align result
/// </summary>
/// <param name="seq1">sequence 1 as string (case sensitive)</param>
/// <param name="seq2">sequence 2 as string (case sensitive)</param>
/// <returns>align result array of all highest score entry point</returns>
public AlignResult[] Align(string seq1, string seq2)
{
#region component
int l1 = seq1.Length;
int l2 = seq2.Length;
int[,] dpMatrix = new int[l2 + 1, l1 + 1]; //dynamic program matrix
char[,] pointerMatrix = new char[l2 + 1, l1 + 1]; //pointer matrix
#endregion
#region fill dynamic program
char[] seq_1 = seq1.ToCharArray();
char[] seq_2 = seq2.ToCharArray();
// inital dynamic matrix with zero
this.dpm_init(dpMatrix, pointerMatrix, l1, l2);
// fill dpMatrix and pointer matrix as Needleman Wunsch algorithmn
// two different point are marked below
int fU, fD, fL;
for (int i = 1; i <= l2; i++)
{
for (int j = 1; j <= l1; j++)
{
int match = (seq_1[j - 1] == seq_2[i - 1]) ? MatchScore : MisMatchScore;
fU = dpMatrix[i - 1, j] - GapPenalty;
fD = dpMatrix[i - 1, j - 1] + match;
fL = dpMatrix[i, j - 1] - GapPenalty;
int max = 0;
char ptr;
if (fD >= fU && fD >= fL)
{
max = fD;
ptr = '\\';
}
else if (fU > fL)
{
max = fU;
ptr = '|';
}
else
{
max = fL;
ptr = '-';
}
dpMatrix[i, j] = (max < 0) ? 0 : max; //S.W. method score will be not smaller than Zero;
pointerMatrix[i, j] = (dpMatrix[i, j] == 0) ? '.' : ptr; //S.W. method trace back will stop at cell with zero score (marked with ".");
}
}
#endregion
/* Debug code for present dynamic program ad pointer matrix
Console.WriteLine("DynamicMatrix");
print_matrix(dpMatrix, seq1, seq2);
Console.WriteLine("PointerMatrix");
print_traceback(pointerMatrix,seq1,seq2);
*/
// S.W method walk from the high score entry point (seed)
// The high score cell might be not unique.
#region find highest score seed
List<seed> highList = new List<seed>();
int highestScore = 0;
for (int i = 0; i <= l2; i++)
{
for (int j = 0; j <= l1; j++)
{
if (dpMatrix[i, j] >= highestScore)
{
if (dpMatrix[i, j] > highestScore)
{
highList.Clear();
highestScore = dpMatrix[i, j];
}
seed newEntry = new seed(i, j);
highList.Add(newEntry);
}
}
}
#endregion
//One seed will have one align result
// All aligned result will return as a AlignResult array
#region track back to zero
List<AlignResult> align = new List<AlignResult>();
foreach (seed s in highList)
{
align.Add(traceBack(pointerMatrix, seq1, seq2, s.X, s.Y));
}
return align.ToArray();
#endregion
}
/// <summary>
/// The trace back procedure of Smith Wunsch algorithm is similar to needlemen
/// but:
/// 1. start from the high score cells
/// 2. end at zero-score cell or 0,0 point
/// </summary>
public AlignResult traceBack(char[,] traceBackMatrix, string seq1, string seq2, int x, int y)
{
// start from the highest cells
int ti = x;
int tj = y;
char[] seq_1 = seq1.ToCharArray();
char[] seq_2 = seq2.ToCharArray();
StringBuilder seq1Builder = new StringBuilder();
StringBuilder seq2Builder = new StringBuilder();
while (ti > 0 || tj > 0) // stop at (0,0) point
{
if (traceBackMatrix[ti, tj] == '.') break; //stop at zero cell
switch (traceBackMatrix[ti, tj])
{
case '|':
seq1Builder.Insert(0, '-');
seq2Builder.Insert(0, seq_2[ti - 1]);
ti--;
break;
case '\\':
seq1Builder.Insert(0, seq_1[tj - 1]);
seq2Builder.Insert(0, seq_2[ti - 1]);
ti--;
tj--;
break;
case '-':
seq1Builder.Insert(0, seq_1[tj - 1]);
seq2Builder.Insert(0, '-');
tj--;
break;
default:
break;
}
}
return new AlignResult(seq1Builder.ToString(), seq2Builder.ToString());
}
public void dpm_init(int[,] dpMatrix, char[,] traceBackMatrix, int seq1Length, int seq2Length)
{
dpMatrix[0, 0] = 0;
traceBackMatrix[0, 0] = '.';
for (int i = 1; i <= seq1Length; i++)
{
dpMatrix[0, i] = 0;
traceBackMatrix[0, i] = '.';
}
for (int i = 1; i <= seq2Length; i++)
{
dpMatrix[i, 0] = 0;
traceBackMatrix[i, 0] = '.';
}
}
// methods for debug
public void print_traceback(char[,] charArray, string seq1, string seq2)
{
StringBuilder sb = new StringBuilder();
char[] seq1Array = seq1.ToCharArray();
char[] seq2Array = seq2.ToCharArray();
//Create Title
sb.Append('#');
sb.Append('\t');
for (int j = 0; j < seq1Array.Length; j++)
{
sb.Append('\t');
sb.Append(seq1Array[j]);
}
sb.Append('\n');
// dump array
for (int i = 0; i < seq2Array.Length + 1; i++)
{
sb.Append((i > 0) ? seq2Array[i - 1] : '-');
for (int j = 0; j < seq1Array.Length + 1; j++)
{
sb.Append('\t');
sb.Append(charArray[i, j]);
}
sb.Append('\n');
}
System.Console.Write(sb.ToString());
}
public void print_matrix(int[,] intArray, string seq1, string seq2)
{
StringBuilder sb = new StringBuilder();
char[] seq1Array = seq1.ToCharArray();
char[] seq2Array = seq2.ToCharArray();
//Create Title
sb.Append('#');
sb.Append('\t');
for (int j = 0; j < seq1Array.Length; j++)
{
sb.Append('\t');
sb.Append(seq1Array[j]);
}
sb.Append('\n');
// dump array
for (int i = 0; i < seq2Array.Length + 1; i++)
{
sb.Append((i > 0) ? seq2Array[i - 1] : '-');
for (int j = 0; j < seq1Array.Length + 1; j++)
{
sb.Append('\t' + intArray[i, j].ToString());
}
sb.Append('\n');
}
System.Console.Write(sb.ToString());
}
}
=========================== 以上是程式碼 ======================
我沒有轉台...我是快轉 ^^
回覆刪除哇~這對我來說簡直是天書!
回覆刪除前面的中文我還看得懂
後面的「希臘文」我就不明瞭了
ㄟ~跳過!
你這篇寫的真的太好了
回覆刪除容我引用在我的 blog
http://tw.myblog.yahoo.com/huiyi-li/
我會寫上來源出處 感謝你的大方提供喔~~~
感激不盡XD
可否請問一下用動態規畫的方式去找LCS的時候,如果兩個字串有兩種一上相同的字串 那麼這樣的作法是不是只找出其中的一種而已。是否會有所影響,因為最近剛學到演算法的時候,發現有這樣的可能性,所以不太了解用於序列的比對上面會有影響嗎?? 非常謝謝
回覆刪除SW 會找出所有可能的 相同的字串,所以是可行的,但尋找 LCS 有更厲害的演算法,你可以參考 http://www.csie.ntnu.edu.tw/~u91029/LongestCommonSubsequence.html
回覆刪除