Sequence alignment by C#

為了手上一個project, 不得不涉獵Sequence alignment, 這個類分析一個非常複雜的問題, 但又十分重要。在生物學上,我們是用這個方法來比對基因序列,例如貓跟狗的基因像不像,小貓熊寶寶的爹有沒有可能是隔壁籠的台灣黑熊。聽其來很有意思,不過,事實上,非常的枯燥,我們工作的結果像是下面這段序列比對一樣,基本上,是拿兩本有字天書,逐字比對。



且慢,如果你不是這個領域的朋友,別急著轉台....


##ReadMore##其實是 sequence alignment 也是很有趣及應用價值的。Alignment 的目標在於找尋兩段序列(文章也是一種序列)相似處,比方說,最近我發現了網路上有一篇文章,有種似曾相識的感覺,我懷疑是有人人惡意抄襲了我Blog 的文章,他把裡面的內容改一改字(mutant),加加減減(insert, delete),就好像自己的新文章一樣。他一定看準了,我一個字一個字對,可能會把頭髮給對白了,不過,他沒想到,我剛寫好了 Needleman-wunsche (NW)的演算法,不用一秒,NW 演算法可以找出兩篇文章的脈絡,並且告訴我們兩篇有多像了。




不過有些盜文者很聰明,他們知道我會 NW 演算法,所以他們採取截錄的方法,只偷一段,然後把幾段拚起來,就成了一篇新報告,因為文章結構不同,NW 就沒有那麼靈敏。不過,他們沒想到,我也順便寫了,Smith-waterman(SW) 的演算法,SW法會找出兩篇文章內,最相似的段落,只要有盜文,就抓的到。



這些看似來無趣的數學運算,用對了地方也是很有趣的,又到了期末交報告的季節了,以前幫忙改過學生的作業,改著改著常有似曾相識的感覺,天下文章一大抄,後 來發現其實原創的版本有限,大多都是由原版演化出來的新版本。有了這兩個演算法,我想應該可以幫學生報告建立親源樹,找出原創版本是誰,然後誰跟誰借來 抄,第二版又借給誰,那一個版本比較流行.....好像想太多了,好了,如果你是非生資領域的朋友,可以轉台了,接下來聊聊演算法。

Sequence alignment 的演算法, 一個非常複雜的問題, 因為兩段序列要對在一起可能有非常多種組合,任何一個位子,都可以突變,插入或剔除,使用遞迴的方法會產生非常多重複的計算,不但非常沒有效率,序列長了甚至癱瘓系統。為了解決這個問題,因而衍生出dynamic programming的寫法,有關 alignment 的原理及程式寫法,在這篇有詳細的介紹,用看的他的說明可能有點難懂,不過,如果有程式基礎,看source code 應該對了解原理會很有幫助。

Dynamic programming 的理論出來非常久了,我想要找公開的程式碼來用應該不難,所以我問了孤狗大神,大神說有!但只有JAVA C++的 版本,就是沒有C# 版(用DOT NET 在科學界是孤鳥啊!),我也不想為了這一小段碼,把之前寫的程式,全翻成 JAVA or C++ 。只能參考他們程式碼用C# 重寫一次,個人不太喜歡JAVA 的版本,物件化過了頭,結構很漂亮,但是效能上郤不如一個C++版簡單的陣列有效。我沒學過C/C++ ,為了看懂C++特別去圖館借了兩本入門書來修綀,經過一夜的閉關,C++的天書開始顯出影像,雖然還沒有練到十成功力,但至少可以把C++版的 Code 改寫成C#了,而且原本洛洛長的碼,用純物件的C#,變得很簡潔易讀,念在我用的MSDN是免錢的份上,我把這段碼放迴饋給科學界其他的孤鳥們,如果你想用C# 開發輕量級Alignmemt 程式,這個class 或許對你有些幫助。


使用方式:
For Needleman-Wunsch method

NeedlemanWunsch nw = new NeedlemanWunsch(2, 2, -1); \\ 參數設定參見程式碼
AlignResult newResult = nw.Align(seq1, seq2);


Needleman-Wunsch 會傳回一個最佳的 alignment 結果。

For Smith-Waterman method

SmithWaterman sw = new SmithWaterman(2, 2, -1);
AlignResult[] newResults = sw.Align(seq1, seq2);


Smith-Waterman可能會有多個最佳的 alignment 結果,所以陣列方式傳回。

For present result

AlignResult.MatchLength;

會傳回比對結果共有多少的base 可以對上。

AlignResult.MatchString;

傳回比對果字串,並三行,第一三行是 sequence, 第二行是兩 sequence間,base-base 的關係(| match, : mismatch, . gap)

好了,廢話不多說,看碼.....

=========================== 以下是程式碼 ======================

    public class AlignResult
    {
        public string Name = string.Empty;
        public override string ToString()
        {
            return Name;
        }
        public AlignResult(string seq1, string seq2)
        {
            seq_1_aligned = seq1;
            seq_2_aligned = seq2;
        }
        private string seq_1_aligned = string.Empty;
        private string seq_2_aligned = string.Empty;
        public string MatchString
        {
            get
            {
                StringBuilder sb = new StringBuilder();
                char[] array1 = seq_1_aligned.ToCharArray();
                char[] array2 = seq_2_aligned.ToCharArray();
                for (int i = 0; i < array1.Length; i++)
                {
                    if (array1[i] == '-' || array2[i] == '-')
                    {
                        sb.Append('.');
                    }
                    else
                    {
                        sb.Append((array1[i] == array2[i]) ? '|' : ':');
                    }
                }
                StringBuilder rsb = new StringBuilder();
                rsb.Append(seq_1_aligned + "\n");
                rsb.Append(sb.ToString() + "\n");
                rsb.Append(seq_2_aligned + "\n");
                return rsb.ToString();
            }
        }
        public int MatchLength
        {
            get
            {
                StringBuilder sb = new StringBuilder();
                char[] array1 = seq_1_aligned.ToCharArray();
                char[] array2 = seq_2_aligned.ToCharArray();
                int k = 0;
                for (int i = 0; i < array1.Length; i++)
                {
                    if (array1[i] == array2[i]) k++;
                }
                return k;
            }
        }
    }
    public class NeedlemanWunsch
    {
        public int GapPenalty = 2;
        public int MatchScore = 2;
        public int MisMatchScore = -1;
        public NeedlemanWunsch() { }
        /// <summary>
        /// constructor of Needleman Wunsch aligner
        /// </summary>
        /// <param name="gap"> minus score when a gap happend</param>
        /// <param name="match">add score when bases are perfect-match</param>
        /// <param name="mismatch">add score when bases are mis-match</param>
        public NeedlemanWunsch(int gap, int match, int mismatch)
        {
            GapPenalty = gap;
            MatchScore = match;
            MisMatchScore = mismatch;
        }

        /// <summary>
        /// Align two sequence and return align result
        /// </summary>
        /// <param name="seq1">sequence 1 as string (case sensitive)</param>
        /// <param name="seq2">sequence 2 as string (case sensitive)</param>
        /// <returns></returns>
        public AlignResult Align(string seq1, string seq2)
        {
            #region component
            int l1 = seq1.Length;
            int l2 = seq2.Length;
            int[,] dpMatrix = new int[l2 + 1, l1 + 1];   //dynamic program matrix
            char[,] pointerMatrix = new char[l2 + 1, l1 + 1]; //trace back char matrix
            #endregion

            #region fill dynamic program and traceBack matrix
            char[] seq_1 = seq1.ToCharArray();
            char[] seq_2 = seq2.ToCharArray();
            this.dpm_init(dpMatrix, pointerMatrix, l1, l2, GapPenalty); //initial matrix
            for (int i = 1; i <= l2; i++)
            {
                for (int j = 1; j <= l1; j++)
                {
                    char ptr; //pointer of each cell
                    int fU, fD, fL; //score of upper, diagon and left cells
                    int match = (seq_1[j - 1] == seq_2[i - 1]) ? MatchScore : MisMatchScore;
                    fU = dpMatrix[i - 1, j] - GapPenalty;     //if up shift will generate a gap, so give a penalty.
                    fD = dpMatrix[i - 1, j - 1] + match;        //if not shift, whether base is macth or not, give a match score
                    fL = dpMatrix[i, j - 1] - GapPenalty;      //if left shift will generate a gap, so give a penalty

                    // Assing max score among fU, fD, fL to current cell                          
                    int max = 0;
                    if (fD >= fU && fD >= fL)    // diagonal is prefered
                    { max = fD; ptr = '\\'; }      // next cell is diagonal cell
                    else if (fU > fL)
                    { max = fU; ptr = '|'; }        // next cell is upper cell
                    else
                    { max = fL; ptr = '-'; }         // next cell is left cell

                    dpMatrix[i, j] = max;
                    pointerMatrix[i, j] = ptr;
                }
            }
            #endregion

            /* Debug code for present dynamic program score and pointer matrix
                    Console.WriteLine("DynamicMatrix");
                    print_matrix(dpMatrix, seq1, seq2);
                    Console.WriteLine("TrackBackMatrix");
                    print_traceback (pointerMatrix, seq1, seq2);
                     */

            #region Trace Back and create aligned string
            // track back form the most left-bottom cornner
            int ti = l2--;
            int tj = l1--;
            // string builder for align string
            StringBuilder seq1Builder = new StringBuilder();
            StringBuilder seq2Builder = new StringBuilder();

            /*
            pointerMatrix[x,y]                    
             * n # S E Q 1 ( y axis)
            #
            S
            E
            Q
            2                   
             (x-axis)
            */

            while (ti > 0 || tj > 0) // walk to the most right-top cornner
            {
                switch (pointerMatrix[ti, tj])
                {
                    //move to up cell, shift x, Seq2 get a base and Seq1 get a "-"
                    case '|':
                        seq1Builder.Insert(0, '-');
                        seq2Builder.Insert(0, seq2[ti - 1]);
                        ti--;
                        break;
                    //move to diagonal cell, both Seq1 and Seq 2 add a base
                    case '\\':
                        seq1Builder.Insert(0, seq_1[tj - 1]);
                        seq2Builder.Insert(0, seq2[ti - 1]);
                        ti--;
                        tj--;
                        break;
                    //move to left cell, Seq 1 get a base, Seq 2 get a "-"
                    case '-':
                        seq1Builder.Insert(0, seq_1[tj - 1]);
                        seq2Builder.Insert(0, '-');
                        tj--;
                        break;
                    default:
                        break;
                }
            }
            #endregion

            return new AlignResult(seq1Builder.ToString(), seq2Builder.ToString());
        }
        public void dpm_init(int[,] dpMatrix, char[,] traceBackMatrix, int seq1Length, int seq2Length, int gapPenalty)
        {
            dpMatrix[0, 0] = 0;
            traceBackMatrix[0, 0] = 'n';
            for (int i = 1; i <= seq1Length; i++)
            {
                dpMatrix[0, i] = -i * gapPenalty;
                traceBackMatrix[0, i] = '-';
            }
            for (int i = 1; i <= seq2Length; i++)
            {
                dpMatrix[i, 0] = -i * gapPenalty;
                traceBackMatrix[i, 0] = '|';
            }
        }
        // methods for debug
        public void print_matrix(int[,] intArray, string seq1, string seq2)
        {
            StringBuilder sb = new StringBuilder();
            char[] seq1Array = seq1.ToCharArray();
            char[] seq2Array = seq2.ToCharArray();
            //Create Title
            sb.Append('#');
            sb.Append('\t');
            for (int j = 0; j < seq1Array.Length; j++)
            {
                sb.Append('\t');
                sb.Append(seq1Array[j]);
            }
            sb.Append('\n');
            // dump array
            for (int i = 0; i < seq2Array.Length + 1; i++)
            {
                sb.Append((i > 0) ? seq2Array[i - 1] : '-');
                for (int j = 0; j < seq1Array.Length + 1; j++)
                {
                    sb.Append('\t' + intArray[i, j].ToString());
                }
                sb.Append('\n');
            }
            System.Console.Write(sb.ToString());

        }
        public void print_traceback(char[,] charArray, string seq1, string seq2)
        {
            StringBuilder sb = new StringBuilder();
            char[] seq1Array = seq1.ToCharArray();
            char[] seq2Array = seq2.ToCharArray();
            //Create Title
            sb.Append('#');
            sb.Append('\t');
            for (int j = 0; j < seq1Array.Length; j++)
            {
                sb.Append('\t');
                sb.Append(seq1Array[j]);
            }
            sb.Append('\n');
            // dump array
            for (int i = 0; i < seq2Array.Length + 1; i++)
            {
                sb.Append((i > 0) ? seq2Array[i - 1] : '-');
                for (int j = 0; j < seq1Array.Length + 1; j++)
                {
                    sb.Append('\t');
                    sb.Append(charArray[i, j]);
                }
                sb.Append('\n');
            }
            System.Console.Write(sb.ToString());


        }
    }
    public class SmithWaterman
    {
        //internal class to store entry points fo local align fragments
        class seed
        {
            private int x = 0;
            private int y = 0;
            public seed(int sx, int sy)
            {
                x = sx;
                y = sy;
            }
            public int X
            { get { return x; } }
            public int Y
            { get { return y; } }
        }

        public int GapPenalty = 2;
        public int MatchScore = 2;
        public int MisMatchScore = -1;
        public SmithWaterman() { }
        /// <summary>
        /// constructor of Smith Waterman aligner
        /// </summary>
        /// <param name="gap"> minus score when a gap happend</param>
        /// <param name="match">add score when bases are perfect-match</param>
        /// <param name="mismatch">add score when bases are mis-match</param>
        public SmithWaterman(int gap, int match, int mismatch)
        {
            GapPenalty = gap;
            MatchScore = match;
            MisMatchScore = mismatch;
        }
        /// <summary>
        /// Align two sequence and return align result
        /// </summary>
        /// <param name="seq1">sequence 1 as string (case sensitive)</param>
        /// <param name="seq2">sequence 2 as string (case sensitive)</param>
        /// <returns>align result array of all highest score entry point</returns>
        public AlignResult[] Align(string seq1, string seq2)
        {
            #region component
            int l1 = seq1.Length;
            int l2 = seq2.Length;
            int[,] dpMatrix = new int[l2 + 1, l1 + 1];   //dynamic program matrix
            char[,] pointerMatrix = new char[l2 + 1, l1 + 1];  //pointer matrix
            #endregion

            #region fill dynamic program
            char[] seq_1 = seq1.ToCharArray();
            char[] seq_2 = seq2.ToCharArray();
            // inital dynamic matrix with zero
            this.dpm_init(dpMatrix, pointerMatrix, l1, l2);

            // fill dpMatrix and pointer matrix as Needleman Wunsch algorithmn
            // two different point are marked below
            int fU, fD, fL;
            for (int i = 1; i <= l2; i++)
            {
                for (int j = 1; j <= l1; j++)
                {
                    int match = (seq_1[j - 1] == seq_2[i - 1]) ? MatchScore : MisMatchScore;
                    fU = dpMatrix[i - 1, j] - GapPenalty;
                    fD = dpMatrix[i - 1, j - 1] + match;
                    fL = dpMatrix[i, j - 1] - GapPenalty;
                    int max = 0;
                    char ptr;
                    if (fD >= fU && fD >= fL)
                    {
                        max = fD;
                        ptr = '\\';
                    }
                    else if (fU > fL)
                    {
                        max = fU;
                        ptr = '|';
                    }
                    else
                    {
                        max = fL;
                        ptr = '-';
                    }
                    dpMatrix[i, j] = (max < 0) ? 0 : max;  //S.W. method score will be not smaller than Zero;
                    pointerMatrix[i, j] = (dpMatrix[i, j] == 0) ? '.' : ptr; //S.W. method trace back will stop at cell with zero score (marked with ".");
                }
            }
            #endregion

            /* Debug code for present dynamic program ad pointer matrix
                   
                     Console.WriteLine("DynamicMatrix");
                     print_matrix(dpMatrix, seq1, seq2);                   
                     Console.WriteLine("PointerMatrix");
                     print_traceback(pointerMatrix,seq1,seq2);                    
                     */


            // S.W method walk from the high score entry point (seed)
            // The high score cell might be not unique.
            #region find highest score seed
            List<seed> highList = new List<seed>();
            int highestScore = 0;
            for (int i = 0; i <= l2; i++)
            {
                for (int j = 0; j <= l1; j++)
                {
                    if (dpMatrix[i, j] >= highestScore)
                    {
                        if (dpMatrix[i, j] > highestScore)
                        {
                            highList.Clear();
                            highestScore = dpMatrix[i, j];
                        }
                        seed newEntry = new seed(i, j);
                        highList.Add(newEntry);
                    }
                }
            }
            #endregion

            //One seed will have one align result
            // All aligned result will return as a AlignResult array
            #region track back to zero
            List<AlignResult> align = new List<AlignResult>();
            foreach (seed s in highList)
            {
                align.Add(traceBack(pointerMatrix, seq1, seq2, s.X, s.Y));
            }
            return align.ToArray();
            #endregion
        }

        /// <summary>
        /// The trace back procedure of Smith Wunsch algorithm is similar to needlemen
        /// but:
        /// 1. start from the high score cells
        /// 2. end at zero-score cell or 0,0 point
        /// </summary>
        public AlignResult traceBack(char[,] traceBackMatrix, string seq1, string seq2, int x, int y)
        {
            // start from the highest cells
            int ti = x;
            int tj = y;
            char[] seq_1 = seq1.ToCharArray();
            char[] seq_2 = seq2.ToCharArray();
            StringBuilder seq1Builder = new StringBuilder();
            StringBuilder seq2Builder = new StringBuilder();
            while (ti > 0 || tj > 0) // stop at (0,0) point
            {
                if (traceBackMatrix[ti, tj] == '.') break; //stop at zero cell
                switch (traceBackMatrix[ti, tj])
                {
                    case '|':
                        seq1Builder.Insert(0, '-');
                        seq2Builder.Insert(0, seq_2[ti - 1]);
                        ti--;
                        break;
                    case '\\':
                        seq1Builder.Insert(0, seq_1[tj - 1]);
                        seq2Builder.Insert(0, seq_2[ti - 1]);
                        ti--;
                        tj--;
                        break;
                    case '-':
                        seq1Builder.Insert(0, seq_1[tj - 1]);
                        seq2Builder.Insert(0, '-');
                        tj--;
                        break;
                    default:
                        break;
                }
            }
            return new AlignResult(seq1Builder.ToString(), seq2Builder.ToString());
        }

        public void dpm_init(int[,] dpMatrix, char[,] traceBackMatrix, int seq1Length, int seq2Length)
        {
            dpMatrix[0, 0] = 0;
            traceBackMatrix[0, 0] = '.';
            for (int i = 1; i <= seq1Length; i++)
            {
                dpMatrix[0, i] = 0;
                traceBackMatrix[0, i] = '.';
            }
            for (int i = 1; i <= seq2Length; i++)
            {
                dpMatrix[i, 0] = 0;
                traceBackMatrix[i, 0] = '.';
            }
        }

        // methods for debug
        public void print_traceback(char[,] charArray, string seq1, string seq2)
        {
            StringBuilder sb = new StringBuilder();
            char[] seq1Array = seq1.ToCharArray();
            char[] seq2Array = seq2.ToCharArray();
            //Create Title
            sb.Append('#');
            sb.Append('\t');
            for (int j = 0; j < seq1Array.Length; j++)
            {
                sb.Append('\t');
                sb.Append(seq1Array[j]);
            }
            sb.Append('\n');
            // dump array
            for (int i = 0; i < seq2Array.Length + 1; i++)
            {
                sb.Append((i > 0) ? seq2Array[i - 1] : '-');
                for (int j = 0; j < seq1Array.Length + 1; j++)
                {
                    sb.Append('\t');
                    sb.Append(charArray[i, j]);
                }
                sb.Append('\n');
            }
            System.Console.Write(sb.ToString());


        }
        public void print_matrix(int[,] intArray, string seq1, string seq2)
        {
            StringBuilder sb = new StringBuilder();
            char[] seq1Array = seq1.ToCharArray();
            char[] seq2Array = seq2.ToCharArray();
            //Create Title
            sb.Append('#');
            sb.Append('\t');
            for (int j = 0; j < seq1Array.Length; j++)
            {
                sb.Append('\t');
                sb.Append(seq1Array[j]);
            }
            sb.Append('\n');
            // dump array
            for (int i = 0; i < seq2Array.Length + 1; i++)
            {
                sb.Append((i > 0) ? seq2Array[i - 1] : '-');
                for (int j = 0; j < seq1Array.Length + 1; j++)
                {
                    sb.Append('\t' + intArray[i, j].ToString());
                }
                sb.Append('\n');
            }
            System.Console.Write(sb.ToString());

        }
    }
=========================== 以上是程式碼 ======================

留言

  1. 我沒有轉台...我是快轉 ^^

    回覆刪除
  2. 哇~這對我來說簡直是天書!
    前面的中文我還看得懂
    後面的「希臘文」我就不明瞭了
    ㄟ~跳過!

    回覆刪除
  3. 你這篇寫的真的太好了
    容我引用在我的 blog
    http://tw.myblog.yahoo.com/huiyi-li/
    我會寫上來源出處 感謝你的大方提供喔~~~
    感激不盡XD

    回覆刪除
  4. 可否請問一下用動態規畫的方式去找LCS的時候,如果兩個字串有兩種一上相同的字串 那麼這樣的作法是不是只找出其中的一種而已。是否會有所影響,因為最近剛學到演算法的時候,發現有這樣的可能性,所以不太了解用於序列的比對上面會有影響嗎?? 非常謝謝

    回覆刪除
  5. SW 會找出所有可能的 相同的字串,所以是可行的,但尋找 LCS 有更厲害的演算法,你可以參考 http://www.csie.ntnu.edu.tw/~u91029/LongestCommonSubsequence.html

    回覆刪除

張貼留言