Faculty of Science, MSc in Computer Science

accademic year 2012-2013, fall semester

This course is worth 6 EC.

Lectures take place in room 405 of Snellius Building on Mondays from 13h45 until 15h30.

The goal of this course is to give the student an overview of algorithms on strings. We will range from basic problems to more specific ones, and we will show different solutions and corresponding example for the same (or related) problem highlighting how, besides theoretical complexity issues, things change in practice, leading the student to have hands-on the difference between worst case theoretical complexity of a problem, practical behavior of algorithms, and fixed parameter tractability.

String algorithms are well suited for educational purposes as they nicely combine elegant combinatorial results with practical applications of growing common interest. Very important data of huge size can be treated as a text. One example is the web, for which indexing and search issues have clear practical impact and deserve the use of the most sophisticated algorithmic techniques. Another main application is the treitment of biological sequences for which storage and analysis tasks raise many challenging algorithmic problems.

- SEQUENCES ALIGNMENTS: Dynamic Programming approaches for Sequences Alignments: global alignments, local alignments, semiglobal alignments. Alignment with affine gap penalty function; alignment of similar strings; alignment in linear space. Computing the Longest Common Subsequence with dynamic programming. Multiple Alignments.
- PATTERN MATCHING: Quadratic time of brute force approach. Introduction to linear time and space solutions: Algorithms with preprocessing of the pattern (Boyer-Moore, Knuth-Morris-Pratt, Aho-Corasick automata) amd algorithms with preprocessing of the text (use of indexes: suffix trees).
- MOTIFS ARE REPETITIONS SEARCH: Finding motifs with the suffix trees. Motifs maximality. KMR algorithm for finding exact motifs. Methods to approximate motifs inference, and relative algorithmic approaches. The problem of finding long repetitions: possible exact approaches using filters.

- JEWELS OF STRINGOLOGY, M.Crochemore and W.Rytter, World Scientific, 2002.
- INTRODUCTION TO COMPUTATIONAL MOLECULAR BIOLOGY, J.Setubal and J.Meidanis, PWS, 1997.

Herein is the chapter on sequences alignments: alignments.pdf.

- ALGORITHMS ON STRINGS, TREES AND SEQUENCES: COMPUTER SCIENCE AND COMPUTATIONAL BIOLOGY, Dan Gusfield, Cambridge University Press, 1997.

Herein is the material on pattern matching: patternmatching1.pdf and patternmatching2.pdf.

Herein is also the material on suffix trees and suffix arrays: SuffixTree.pdf, GeneralizedST.pdf, ApplicationST.pdf, and SuffixArray.pdf.

- This paper shows how to find (exacy and approximated) motifs using a suffix tree: SagotLatin98.pdf.

- This paper shows how to find motifs using the KMR algorithm: kmr.pdf.

- This paper shows how to find motifs using the KMRC algorithm: kmrc.pdf.

**September 10th**. Introduction to the course: description of the program. Comparing strings: Hamming Distance and Edit Distance. Exponential number of possible pairwise sequences alignments. Recalling dynamic programming.

**September 17th**. The dynamic programming approach to global alignment of strings: description of the algorithm, its complexity, and and some examples.

**September 24th**. The local alignemt of two strings: dynamic programming solution, example. The semiglobal alignment of two strings: dynamic programming solution and an example.

**October 1st**. Optimizing pairwise alignments: alignment of similar strings, and alignments in linear space. How to reconstruct the alignment when using linear space. Computing the longest common subsequence of two strings with dynanic programming.

**October 8th**. Global pairwise alignments with affine gap penalty function: new way to count on an optimal substructure and consequent dynamic programming solution. An example.

**October 22nd**. Some possible approaches to multiple alignments: sum of pairs, progressive alignments possibly using trees, and star alignments. Introduction to the problem of exact pattern matching: definition, brute-force approach, and idea of the Morris Pratt algorithm.

**October 29th**. The algorithm of Morris and Pratt: the idea, the algorithm, an example, correctness proof, complexity analysis.

**November 5th**. The algorithm of Knuth, Morris, and Pratt: the idea, an example. Comparison with Morris-Pratt. The algorithm of Boyer and Moore. The algorithm of Karp and Rabin. Some words on the Aho-Corasick automata.

**November 12th**. Pattern matching with preprocessing of the text: indexing. The suffix tree: definition, properties, and examples. Generalized suffix tree.

**November 19th** The suffix array: definition and example. Properties and use for exact pattern matching. Some words on approximate pattern matching techniques. Motifs extraction: the problem, a basic solution with suffix tree.

**November 26th** The KMR algorithm (naming technique) to find exact motifs: algorithm, complexity, an example. Finding approximated motifs on the suffix trees: approach description.

**December 3rd** Finding approximated motifs on the suffix tree: an example, and complexity analysis. Finding approximate motifs using a degenerate alphabet: the KMRC algorithm.