TY - JOUR
T1 - On bit-parallel processing of multi-byte text
AU - Hyyrö, Heikki
AU - Takaba, Jun
AU - Shinohara, Ayumi
AU - Takeda, Masayuki
PY - 2005
Y1 - 2005
N2 - There exist practical bit-parallel algorithms for several types of pair-wise string processing, such as longest common subsequence computation or approximate string matching. The bit-parallel algorithms typically use a size-σ table of match bit-vectors, where the bits in the vector for a character λ identify the positions where the character λ occurs in one of the processed strings, and a is the alphabet size. The time or space cost of computing the match table is not prohibitive with reasonably small alphabets such as ASCII text. However, for example in the case of general Unicode text the possible numerical code range of the characters is roughly one million. This makes using a simple table impractical. In this paper we evaluate three different schemes for overcoming this problem. First we propose to replace the character code table by a character code automaton. Then we compare this method with two other schemes: using a hash table, and the binary-search based solution proposed by Wu, Manber and Myers. We find that the best choice is to use either the automaton-based method or a hash table.
AB - There exist practical bit-parallel algorithms for several types of pair-wise string processing, such as longest common subsequence computation or approximate string matching. The bit-parallel algorithms typically use a size-σ table of match bit-vectors, where the bits in the vector for a character λ identify the positions where the character λ occurs in one of the processed strings, and a is the alphabet size. The time or space cost of computing the match table is not prohibitive with reasonably small alphabets such as ASCII text. However, for example in the case of general Unicode text the possible numerical code range of the characters is roughly one million. This makes using a simple table impractical. In this paper we evaluate three different schemes for overcoming this problem. First we propose to replace the character code table by a character code automaton. Then we compare this method with two other schemes: using a hash table, and the binary-search based solution proposed by Wu, Manber and Myers. We find that the best choice is to use either the automaton-based method or a hash table.
UR - http://www.scopus.com/inward/record.url?scp=24344508165&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=24344508165&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-31871-2_25
DO - 10.1007/978-3-540-31871-2_25
M3 - Conference article
AN - SCOPUS:24344508165
VL - 3411
SP - 289
EP - 300
JO - Lecture Notes in Computer Science
JF - Lecture Notes in Computer Science
SN - 0302-9743
T2 - Asia Information Retrieval Symposium, AIRS 2004
Y2 - 18 October 2004 through 20 October 2004
ER -