Training and Research
PhD Programme Courses/classes
Non monotonic reasoning
Credits: 3
Language: English
Teacher: Matteo Cristani
Sustainable Embodied Mechanical Intelligence
Credits: 3
Language: English
Teacher: Giovanni Gerardo Muscolo
Brain Computer Interfaces
Credits: 3
Language: English
Teacher: Silvia Francesca Storti
A practical interdisciplinary PhD course on exploratory data analysis
Credits: 4
Language: English
Teacher: Prof. Vincenzo Bonnici (Università di Parma)
Multimodal Learning and Applications
Credits: 5
Language: English
Teacher: Cigdem Beyan
Introduction to Blockchain
Credits: 3
Language: English
Teacher: Sara Migliorini
Autonomous Agents and Multi-Agent Systems
Credits: 5
Language: English
Teacher: Alessandro Farinelli
Cyber-physical systems security
Credits: 3
Language: English/Italian
Teacher: Massimo Merro
Foundations of quantum languages
Credits: 3
Language: English
Teacher: Margherita Zorzi
Advanced Data Structures for Textual Data
Credits: 3
Language: English
Teacher: Zsuzsanna Liptak
AI and explainable models
Credits: 5
Language: English
Teacher: Gloria Menegaz, Lorenza Brusini
Automated Software Testing
Credits: 4
Language: English
Teacher: Mariano Ceccato
Elements of Machine Teaching: Theory and Appl.
Credits: 3
Language: English
Teacher: Ferdinando Cicalese
Introduction to Quantum Machine Learning
Credits: 4
Language: English
Teacher: Alessandra Di Pierro
Laboratory of quantum information in classical wave-optics analogy
Credits: 3
Language: English
Teacher: Claudia Daffara
Advanced Data Structures for Textual Data (2023/2024)
Teacher
Referent
Credits
3
Language
English
Class attendance
Free Choice
Location
VERONA
Learning objectives
Introduction to text indexes (data structures for very large textual data)
Prerequisites and basic notions
Algorithms and Data Structures
Program
Genome-scale textual data, i.e. strings of many giga- or even terabytes, are everywhere in today's world. This includes biological sequences (genomic data, protein sequences), digital books, web crawl data, emails, musical data, and many others. The main challenge nowadays is not so much how to store this data, but how to store it in such a way that it can be processed and queried efficiently. Text indexes are dedicated data structures for handling very large amounts of textual data. Propelled forward by the need arising from computational biology on the one hand, and from web search on the other, enormous progress has been made in this area in recent decades.
In this course, we will study some of these text indexes. We will start with a brief introduction to the suffix array, a classic data structure for strings, study its properties, some of its uses in string processing, and its efficient construction. We then introduce two supporting data structures, the LCP-array and the Burrows- Wheeler-Transform (BWT). The so-called ‘clustering property’ of the BWT allows compressed indexes to be built on it. We will close with several BWT-based text indexes: the FM-index, the RLFM-index, and the r- index.
Depending on the background of the students, some of the above topics may be replaced by others, such as a more thorough introduction to wavelet trees (a versatile data structure for efficient rank/select queries), or the extended Burrows-Wheeler-Transform (eBWT, a generalization of the BWT to string collections). String collections are of fundamental interest in many of the most common applications today, such as pangenomes, version control data, or web crawl data, where many different copies of highly similar strings are given in input.
Day 1: Introduction to problems on textual data, pattern matching, suffix arrays (SA)
Day 2: efficient SA construction, LCP-array
Day 3: BWT, backward search, wavelet trees
Day 4: FM-index, RLFM-index, r-index
Bibliography
Didactic methods
lectures, using slides and/or blackboard
Learning assessment procedures
takehome exam
Assessment
ability to apply the algorithms and data structures studied on concrete examples; ability to develop new algorithms using the text indexes studied
Criteria for the composition of the final grade
n.a.