
OpenUniud  Archivio istituzionale delle tesi di dottorato >
Udine Thesis Repository >
01  Tesi di dottorato >
Utilizza questo identificativo per citare o creare un link a questo documento:
http://hdl.handle.net/10990/816

Autori:  Nicola, Prezza 
Supervisore afferente all'Università:  POLICRITI, ALBERTO 
Centro di ricerca:  DIPARTIMENTO DI SCIENZE MATEMATICHE, INFORMATICHE E FISICHE  DMIF 
Titolo:  Compressed Computation for Text Indexing 
Abstract (in inglese):  This thesis deals with spaceefficient algorithms to compress and index texts. The aim
of compression is to reduce the size of a text by exploiting regularities such as repeti
tions or skewed character distributions. Indexing, on the other hand, aims at accelerating
pattern matching queries (such as locating all occurrences of a pattern) with the help of
data structures (indexes) on the text. Despite being apparently distant concepts, com
pression and indexing are deeply interconnected: both exploit the inner structure of the
text to, respectively, reduce its size and speed up pattern matching queries. It should not
be surprising, therefore, that compression and indexing can be “forced” (actually, quite
naturally) to coincide: compressed fulltext indexes support fast pattern matching queries
while taking almost the same size of the compressed text.
In the last two decades, several compressed text indexes based on the most efficient
text compressors have been designed. These indexes can be roughly classified in two main
categories: those based on suffixsorting (BurrowsWheeler transform indexes, compressed
suffix arrays/trees) and those based on the replacement of repetitions by pointers (Lempel
Ziv indexes, grammar indexes, block trees). Indexes based on a combination of the two
techniques have also been proposed. In general, suffix sortingbased methods support very
fast queries at the expense of space usage. This is due to several factors, ranging from
weak compression methods (e.g. entropy compression, used in early FMindexes, is not
able to capture long repetitions), to heavy structures (e.g. suffix array sampling) flanked
to the compressed text representation to speed up queries. The second class of indexes, on
the other hand, offers strong compression rates, achieving up to exponential compression
on very repetitive texts at the expense of query times—often quadratic in the pattern
length or—in the worst case—linear in the text length.
Among the most used compression techniques, runlength compression of the Burrows
Wheeler transform and LempelZiv parsing (LZ77) have been proved to be superior in
the compression of very repetitive datasets. In this thesis we show that these two tools
can be combined in a single index gathering the best features of the abovediscussed
indexes: fast queries (linear in the pattern length and logarithmic in the text length), and
strong compression rates (up to exponential compression can be achieved). We describe an
efficient implementation of our index and compare it with state of the art alternatives. Our
solution turns out to be as spaceefficient as the lightest index described in the literature
while supporting queries up to three orders of magnitude faster.
Apart from index definition and design, a third concern regards index construction
complexity. Often, the input text is too big to be fully loaded into main memory. Even
when this is feasible, classic compression/indexing algorithms use heavy data structures
such as suffix trees/arrays which can easily take several times the space of the text. This is
unsatisfactory, especially in cases where (i) the text is streamed and not stored anywhere
(e.g. because of its size) and (ii) the compressed text is small enough to fit into main
memory. A further contribution of this thesis consists in five algorithms compressing text
within compressed working space and in two recompression techniques (i.e. algorithms toconvert between different compression formats without full decompression). The complete
picture we offer consists of a set of algorithms to spaceefficiently convert among:
• the plain text
• two compressed selfindexes, and
• three compressedfile formats (entropy, LZ77, and runlength BWT)
The general idea behind all our compression algorithms is to read text characters from
left to right and build a compressed dynamic BurrowsWheeler transform of the reversed
text. This structure is augmented with a dynamic suffix array sampling to support fast
locate of text substrings. We employ three types of suffix array sampling: (i) evenlyspaced
(ii) based on BurrowsWheeler transform equalletter runs, and (iii) based on LempelZiv
factors. Strategy (i) allows us to achieve entropycompressed working space. Strategies
(ii) and (iii) are novel and allow achieving a space usage proportional to the output size
(i.e. the compressed file/index).
As a last contribution of this thesis, we turn our attention to a practical and usable
implementation of our suite of algorithmic tools. We introduce DYNAMIC, an opensource
C++11 library implementing dynamic compressed datastructures. We prove almost
optimal theoretical bounds for the resources used by our structures, and show that our
theoretical predictions are empirically tightly verified in practice. The implementation of
the compression algorithms described in this thesis using DYNAMIC meets our expectations:
on repetitive datasets our solutions turn out to be up to three orders of magnitude more
spaceefficient than stateofthe art algorithms performing the same tasks. 
Parole chiave:  BurrowsWheeler transform; Compressed computation; Recompression; LempelZiv; Repetitive collections; Runlength encoding; Indexing 
MIUR :  Settore INF/01  Informatica 
Lingua:  eng 
Data:  3apr2017 
Corso di dottorato:  Dottorato di ricerca in Informatica e scienze matematiche e fisiche 
Ciclo di dottorato:  29 
Università di conseguimento titolo:  Università degli Studi di Udine 
Luogo di discussione:  Udine 
Citazione:  Nicola, P. Compressed Computation for Text Indexing. (Doctoral Thesis, Università degli Studi di Udine, 2017). 
In  01  Tesi di dottorato

Full text:
File 
Descrizione 
Dimensioni  Formato  Consultabilità 
Compressed Computation for Text Indexing.pdf  Tesi di dottorato  1,53 MB  Adobe PDF  Visualizza/apri


Tutti i documenti archiviati in DSPACE sono protetti da copyright. Tutti i diritti riservati.
