TY - GEN

T1 - Bit-parallel approach to approximate string matching in compressed texts

AU - Matsumoto, T.

AU - Kida, T.

AU - Takeda, M.

AU - Shinohara, A.

AU - Arikawa, S.

PY - 2000/1/1

Y1 - 2000/1/1

N2 - Addresses the problem of approximate string matching on compressed text. We consider this problem for a text string described in terms of a collage system, which is a formal system proposed by T. Kida et al. (1999) that captures various dictionary-based compression methods. We present an algorithm that exploits bit-parallelism, assuming that our problem fits in a single machine word, e.g. (m - k)(k + 1) ≤ L, where m is the pattern length, k is the number of allowed errors and L is the length, in bits, of the machine word. For a class of simple collage systems, the algorithm runs in O(k2(∥D∥ + |S|) + km) time using O(k2∥D∥) space, where ∥D∥ is the size of dictionary D and |S| is the number of tokens in the sequence S. The LZ78 (Lempel-Ziv, 1978) and the LZW (Lempel-Ziv-Welch, 1984) compression methods are covered by this class. Since we can regard ∥D∥ + |S| as the compressed length, the time and space complexities are O(k2n + km) and O(k2n), respectively. For general k and m, they become O(k3mn/L + km) and O(k3mn/L). Thus, our algorithm is competitive to the algorithm proposed by J. Kärkkäinen, et al. (2000), which runs in O(km) time using O(kmn) space, when k = O(√L).

AB - Addresses the problem of approximate string matching on compressed text. We consider this problem for a text string described in terms of a collage system, which is a formal system proposed by T. Kida et al. (1999) that captures various dictionary-based compression methods. We present an algorithm that exploits bit-parallelism, assuming that our problem fits in a single machine word, e.g. (m - k)(k + 1) ≤ L, where m is the pattern length, k is the number of allowed errors and L is the length, in bits, of the machine word. For a class of simple collage systems, the algorithm runs in O(k2(∥D∥ + |S|) + km) time using O(k2∥D∥) space, where ∥D∥ is the size of dictionary D and |S| is the number of tokens in the sequence S. The LZ78 (Lempel-Ziv, 1978) and the LZW (Lempel-Ziv-Welch, 1984) compression methods are covered by this class. Since we can regard ∥D∥ + |S| as the compressed length, the time and space complexities are O(k2n + km) and O(k2n), respectively. For general k and m, they become O(k3mn/L + km) and O(k3mn/L). Thus, our algorithm is competitive to the algorithm proposed by J. Kärkkäinen, et al. (2000), which runs in O(km) time using O(kmn) space, when k = O(√L).

UR - http://www.scopus.com/inward/record.url?scp=84964461893&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84964461893&partnerID=8YFLogxK

U2 - 10.1109/SPIRE.2000.878198

DO - 10.1109/SPIRE.2000.878198

M3 - Conference contribution

AN - SCOPUS:84964461893

T3 - Proceedings - 7th International Symposium on String Processing and Information Retrieval, SPIRE 2000

SP - 221

EP - 228

BT - Proceedings - 7th International Symposium on String Processing and Information Retrieval, SPIRE 2000

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 7th International Symposium on String Processing and Information Retrieval, SPIRE 2000

Y2 - 27 September 2000 through 29 September 2000

ER -