Algorithms for Big Data
November 2014 - March 2015
- Lecture 1 (Ferragina) - Data compression: From 0th order to kth order entropy compressors: Defition of 0-th order Entropy, Arithmetic compressor and its limitations, Move-To-Front coder, RLE-coder, Burrows-Wheeler transform, bzip compressor, Definition of k-th order entropy, Compression booster. [slides, paper1, paper2]
- Lecture 2 (Ferragina) - Text indexing: Suffix arrays and their properties, pattern search, text mining. Decompressing substrings from BWT. Searching in BWT. [slides, paper1]
- Lecture 3 (Ferragina) - Rank and Select data structures for binary arrays: Solution m + o(m) which untouches the binary array, solution based on Elias-Fano's encoding. Rank and Select data structures for arbitrary array: Wavelet tree. 2d range queries, solution based on wavelet trees and their applications to two string problems. [slides, paper1]
- Lecture 4 (Ferragina) - Auto-completion search: Trie (with pointers), RMQ data structure, Recursive-solution with max-heap. Patricia Trie. Tree encodings without pointers (LOUDS). [slides]
- Lecture 5 (Ferragina) - Locality-sensitive hashing and its applications: near-duplicate detection among documents, similarity among users, nearest-neighbor search in euclidean spaces (and beyond). Shingling, min-hashing, document sketches, and full example on hamming-distance. Definition of LSH-family and proof about the performance of the LSH-algorithm. [slides, paper]
- Lecture 6 (Ferragina) - Various Pitches on the project ideas of the students.
- Lecture 7 (Marino, 18 March 2015, 15:00, Gerace) - Diameter computation in real-world huge graphs. [slides]
- Lecture 8 (Marino, 20 March 2015, 11:00, Seminari EST) - Distance distribution approximation. [slides]
- Lecture 9 (Marino, 25 March 2015, 15:00, Seminari EST) - Probabilistic counting and sketches with applications. [slides]
- Lecture 10 (Marino, 31 March 2015, 11:00, Seminari EST) - Overview on centrality measures in complex networks. Web Crawling. [slides crawling, slides centrality]
Regarding the exam, there are two possibilities: (i) a pitch in which (at most) 2 students (per group) present a problem/idea spurring from their thesis/interests and that could deploy the algorithmic techniques described in the course could (commenting deeply on the advantage/limits/scenario); (ii) a written exam with few exercises. As far as the second part of the exam (Prof. Marino) is concerned, it will rely on possibility (i); any appointment or the date of the exam can be fixed writing to andrea.marino at for.unipi.it until November 2015.