FI Abstracts vol. 56

56(1-2)  2003


  1. Preface

    A string or word or sequence is an extremely simply mathematical object (data structure) that has only been deemed worthy of particular study - first by mathematicians, then later by computer scientists - since Axel Thue's groundbreaking work nearly a century ago. Until about 35 years ago, the published work discussed mathematical properties of strings or of special classes of strings; however, the simultaneous advent of the computer and of major application areas (especially molecular biology and massive data storage/transmission) shifted the emphasis toward computer science and algorithms on strings. Today there are at least several hundred research papers published every year that deal with algorithms on strings and related methodology; an annual conference, Combinatorial Pattern Matching (CPM), has since 1987 published research contributions to this field of the highest quality. Since the mid-1990s there have been several books [1-8] published that seek to provide a unified view of algorithms on strings and/or their numerous applications. These application areas include natural and obvious ones such as computational biology, data compression and cryptography, but also many others: the identification of musical motifs, internet routing, pattern recognition, computer graphics, and so on. Meanwhile, during the last 10 years, in many countries around the world, undergraduate and graduate programmes have been established whose subject matter includes heavy emphasis on string algorithms - perhaps chiefly bio-informatics programmes, but also others.

    Thus a special issue of Fundamenta Informaticae dealing with the computation of patterns in strings is both timely and relevant within the broader contexts of computer science and combinatorial algorithms. It is my hope that this issue will provide a snapshot of current research in string algorithms, perhaps motivate other researchers to take an interest in a dynamic and exciting research area. Certainly the papers collected here reflect the range of topics mentioned above: altogether six of them deal with subjects of particular interest in a biological context (four with some form of approximate pattern-matching, two with string alignment); tbree others treat aspects of data compression; and two final papers provide fresh insights into standard string-processing calculations (longest common prefix, suffix array).

    Altogether 17 papers were submitted to the special issue, all of them of high quality; after an extended refereeing process, 11 were accepted for publication. It is a pleasure to thank the authors for their interest and for their original contributions, also to express my particular appreciation of the careful and diligent, but of necessity unattributable, work of a cadre of some 35 referees.

     
    Bill Smyth
    Algorithms Research Group Department of Computing
    McMaster University Curtin University of Technology

    [1] Maxime Crochemore & Wojciech Rytter, Text Algorithms, Oxford University Press (1994) 412 pp.

    [2] Graham A. Stephen, String Searching Algorithms, World Scientific Publishing (1994) 243 pp.

    [3] Dan Gusfield, Algorithms on Strings, Trees & Sequences, Cambridge University Press (1997) 534 pp.

    [4] João Setubal & João Meidanis, Introduction to Computational Molecular Biology, PWS Publishing (1997) 296 pp.

    [5] Maxime Crochemore, Christophe Hancart & Thierry Lecroq, Algorithmique du Texte, Vuibert (2001) 347 pp.

    [6] Wojciech Szpankowski, Average Case Analysis of Algorithms on Sequences, Wiley-Interscience (2001) 551 pp.

    [7] Gonzalo Navarro & Mathieu Raffinot, Flexible Pattern Matching in Strings, Cambridge University Press (2002) 221 pp.

    [8] Bill Smyth, Computing Patterns in Strings, Pearson Addison-Wesley (2003) 423 pp.

  2. Occurrence and Substring Heuristics for d-Matching
    Maxime Crochemore, Costas S. Iliopoulos, Thierry Lecroq, Yoan J. Pinzon, Wojciech Plandowski and Wojciech Rytter 1-21

    We consider a version of pattern matching useful in processing large musical data: d-matching, which consists in finding matches which are d-approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols a, b is measured as |a-b|. We also consider (d,g)-matching, where g is a bound on the total sum of the differences. We first consider ``occurrence heuristics'' by adapting exact string matching algorithms to the two notions of approximate string matching. The resulting algorithms are efficient in practice. Then we consider ``substring heuristics''. We present d-matching algorithms fast on the average providing that the pattern is ``non-flat'' and the alphabet interval is large. The pattern is ``flat'' if its structure does not vary substantially. The algorithms, named d-BM1, d-BM2 and d-BM3 can be thought as members of the generalized Boyer-Moore family of algorithms. The algorithms are fast on average. This is the first paper on the subject, previously only ``occurrence heuristics'' have been considered. Our substring heuristics are much stronger and refer to larger parts of texts (not only to single positions). We use d-versions of suffix tries and subword graphs. Surprisingly, in the context of d-matching subword graphs appear to be superior compared with compact suffix trees.

  3. Fast Multipattern Search Algorithms for Intrusion Detection
    Josué Kuri, Gonzalo Navarro and Ludovic Mé 23-49

    We present new search algorithms to detect the occurrences of any pattern from a given pattern set in a text, allowing in the occurrences a limited number of spurious text characters among those of the pattern. This is a common requirement in intrusion detection applications. Our algorithms exploit the ability to represent the search state of one or more patterns in the bits of a single machine word and update all the search states in a single operation. We show analytically and experimentally that the algorithms are able of fast searching for large sets of patterns allowing a wide number of spurious characters, yielding in our machine about a 75-fold improvement over the classical dynamic programming algorithm.

  4. Better Filtering with Gapped q-Grams
    Stefan Burkhardt and Juha Kärkkäinen 51-70

    A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.

  5. Regexpcount, a symbolic package for counting problems on regular expressions and words
    Pierre Nicodème 71-88

    In previous work [10], we considered algorithms related to the statistics of matches with words and regular expressions in texts generated by Bernoulli or Markov sources. In this work these algorithms are extended for two purposes: to determine the statistics of simultaneous counting of different motifs, and to compute the waiting time for the first match with a motif in a model which may be constrained. This extension also handles matches with errors. The package is fully implemented and gives access to high and low level commands. We also consider an example corresponding to a practical biological problem: getting the statistics for the number of matches of words of size 8 in a genome (a Markovian sequence), knowing that an (overrepresented DNA protecting) pattern named Chi occurs a given number of times.

  6. Speeding-up Hirschberg and Hunt-Szymanski LCS Algorithms
    Maxime Crochemore, Costas S. Iliopoulos and Yoan J. Pinzon 89-103

    Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings. The first algorithm is an improvement of Hirschberg's divide-and-conquer algorithm. The second algorithm is an improvement of Hunt-Szymanski algorithm based on an efficient computation of all dominant match points. These two algorithms use bit-vector operations and are shown to work very efficiently in practice.

  7. A Fast Algorithm for Optimal Alignment between Similar Ordered Trees
    Jesper Jansson and Andrzej Lingas 105-120

    We present a fast algorithm for optimal alignment between two similar ordered trees with node labels. Let S and T be two such trees with |S| and |T| nodes, respectively. If there exists an optimal alignment between S and T which uses at most d blank symbols and d is known in advance, it can be constructed in O(n logn ·(maxdeg)3 ·d2) time, where n = max{|S|,|T|} and maxdeg is the maximum degree of all nodes in S and T. If d is not known in advance, we can construct an optimal alignment in O(n logn ·(maxdeg)3·f2) time, where f is the difference between the highest possible score for any alignment between two trees having a total of |S| + |T| nodes and the score of an optimal alignment between S and T, if the scoring scheme satisfies some natural assumptions. In particular, if the degrees of both input trees are bounded by a constant, the running times reduce to O(n logn ·d2) and O(n logn ·f2), respectively.

  8. Computing forbidden words of regular languages
    Marie-Pierre Béal, Maxime Crochemore, Filippo Mignosi, Antonio Restivo, and Marinella Sciortino 121-135

    We give a quadratic-time algorithm to compute the set of minimal forbidden words of a factorial regular language. We give a linear-time algorithm to compute the minimal forbidden words of a finite set of words. This extends a previous result given for the case of a single word only.
    We also give quadratic-time algorithms to check whether a regular language is factorial or anti-factorial.

  9. Time/Space Efficient Compressed Pattern Matching
    Leszek G±sieniec and Igor Potapov 137-154

    An exact pattern matching problem is to find all occurrences of a pattern p in a text t. We say that the pattern matching algorithm is optimal if its running time is linear in the sizes of t and p, i.e., O(t+p). Perhaps one of the most interesting settings of the pattern matching problem is when one has to design an efficient algorithm with a help of a small extra space. In this paper we explore this setting to the extreme. We work under an assumption that the text t is available only in a compressed form, represented by a straight-line program. The compression methods based on efficient construction of straight-line programs are as competitive as the compression standards, including the Lempel-Ziv compression scheme and recently intensively studied text compression via block sorting, due to Burrows and Wheeler. Our main result is an algorithm that solves the compressed string matching problem in an optimal linear time, with a help of a constant extra space. We also discuss an efficient implementation of a version our algorithm showing that the new concept may have also some interesting real applications. Our result is in contrast with many other compressed pattern matching algorithms where the goal is to find all pattern occurrences in time related to the size of the compressed text. However one must remember that all previous algorithms used at least a linear (in a compressed text, a dictionary, or a pattern) extra memory while our algorithm can be implemented in a constant size extra space. Also from the practical point of view, when the compression ratio is constant (very rarely smaller than 25%), there is no dramatic difference between the running time based on the size of the compressed text and the size of the original text, while an extra space resources might be strictly limited.

  10. Definability and Compression
    Foto Afrati, Hans Leiß  and Michel de Rougemont 155-180

    A compression algorithm takes a finite structure of a class K as input and produces a finite structure of a different class K' as output. Given a property P on the class K defined in a logic L, we study the definability of property P on the class K'. We consider two compression schemes on unary ordered structures (strings), compression by run-length encoding and the classical Lempel-Ziv-78 scheme.
    First-order properties of strings are first-order on run-length compressed strings, but this fails for images, i.e. 2-dimensional strings. We present simple first-order properties of strings which are not first-order definable on strings compressed with the Lempel-Ziv-78 compression scheme.
    We show that all properties of strings that are first-order definable on strings are definable on Lempel-Ziv compressed strings in FO(TC), the extension of first-order logic with the transitive closure operator. We define a subclass F of the first-order properties of strings such that if P is a property in F, it is also first-order definable on the Lempel-Ziv compressed strings. Monadic second-order properties of strings, i.e. regular languages, are dyadic second-order definable on Lempel-Ziv compressed strings.

  11. A Simple and Scalable Algorithm for the IP Address Lookup Problem
    Inbok Lee, Kunsoo Park, Sung Kwon Chung and Yanghee Choi 181-190

    The IP address lookup problem is to find the longest matching IP prefix from a routing table for a given IP address and it has been a central bottleneck in speeding up the Internet. In this paper we propose a new algorithm for this problem based on the segment tree data structure. Given n IP prefixes, our algorithm does the IP address lookup in O(logn) time. It also handles the insertion and deletion of IP prefixes efficiently without rebuilding the whole data structure.

  12. Compact Suffix Array - A Space-Efficient Full-Text Index
    Veli Mäkinen 191-210

    Suffix array is a widely used full-text index that allows fast searches on the text. It is constructed by sorting all suffixes of the text in the lexicographic order and storing pointers to the suffixes in this order. Binary search is used for fast searches on the suffix array. Compact suffix array is a compressed form of the suffix array that still allows binary searches, but the search times are also dependent on the compression. In this paper, we give efficient methods for constructing and querying compact suffix arrays. We also study practical issues, such as the trade off between compression and search times, and show how to reduce the space requirement of the construction. Experimental results are provided in comparison with other search methods. With a large text corpora, the index took 1.6 times the size of the text, while the searches were only two times slower than from a suffix array.


56(3)  2003


  1. Eliminating Unorthodox Derivation Rules in an Axiom System for Iteration-free PDL with Intersection
    Philippe Balbiani 211-242

    We devote this paper to the completeness of an axiom system for PDLÇ0 - a variant of PDL which includes the program operations of composition and intersection. Most of the difficulty in the proof of the completeness theorem for PDLÇ0 lies in the fact that intersection of accessibility relations is not modally definable. We overcome this difficulty by considering the concepts of theory and large program. Theories are sets of formulas that contain PDLÇ0 and are closed under the inference rule of modus ponens. Large programs are built up from program variables and theories by means of the operations of composition and intersection, just as programs are built up from program variables and tests. Adapting these concepts to the subordination method, we can prove the completeness of our deductive system for PDLÇ0.

     

  2. An Efficient and Practical (t, n) Threshold Proxy Signature Scheme with Known Signers
    Hui-Feng Huang and Chin-Chen Chang 243-253

    Previously, all of the proposed threshold proxy signature schemes which have been based on the discrete logarithms required a protocol to generate and verify a shared secret among the proxy group. Therefore, it is necessary for the proxy signers to execute a lot of expensive modular exponential computations and communications to obtain and verify a shared secret. Thus, it is very time-consuming to construct the proxy signature. Moreover, some of the existing threshold proxy signature schemes reveal that the receiver cannot find out who signed the proxy signatures. In this paper, we proposed a practical, efficient, and low communications (t, n) threshold proxy signature scheme based on RSA cryptosystem. By using our way, not only the original signer can know who generated the proxy signature, but also everyone can be a verifier to certify the actuality of the group signers who made it. So, it is very convenient to clarify the responsibility of the document's signers fairly.

  3. On the Infinigons of the Hyperbolic Plane, A combinatorial approach
    Maurice Margenstern 255-272

    In this paper, we pay a new visit to an object of hyperbolic geometry which, perhaps, did not draw on itself all the attention it may deserve. The paper gives a simple definition of this object, infinigons, which was implicit in general considerations about tilings of the hyperbolic plane, and which was not definied in its all possible extensions. From the simple construction of the infinigons and using the ideas of the splitting method being introduced by the author in the case of tilings being based on the replication of regular polygons, we give an algorithmic construction of the infinigrids. On the way, we give a simple geometrical characterisation of the infinigons in terms of pencils of lines.

  4. On the existence of a new family of Diophantine equations for Omega
    Toby Ord and Tien D. Kieu 273-284

    We show how to determine the k-th bit of Chaitin's algorithmically random real number W by solving k instances of the halting problem. From this we then reduce the problem of determining the k-th bit of W to determining whether a certain Diophantine equation with two parameters, k and N, has solutions for an odd or an even number of values of N. We also demonstrate two further examples of W in number theory: an exponential Diophantine equation with a parameter k which has an odd number of solutions iff the k-th bit of W is 1, and a polynomial of positive integer variables and a parameter k that takes on an odd number of positive values iff the k-th bit of W is 1.

  5. Center-Based Indexing in Vector and Metric Spaces
    Arkadiusz Wojna 285-310

    The paper addresses the problem of indexing data for k nearest neighbors (k-nn) search. Given a collection of data objects and a similarity measure the searching goal is to find quickly the k most similar objects to a given query object. We present a top-down indexing method that employs a widely used scheme of indexing algorithms. It starts with the whole set of objects at the root of an indexing tree and iteratively splits data at each level of indexing hierarchy. In the paper two different data models are considered. In the first, objects are represented by vectors from a multi-dimensional vector space. The second, more general, is based on an assumption that objects satisfy only the axioms of a metric space. We propose an iterative k-means algorithm for tree node splitting in case of a vector space and an iterative k-approximate-centers algorithm in case when only a metric space is provided. The experiments show that the iterative k-means splitting procedure accelerates significantly k-nn searching over the one-step procedure used in other indexing structures such as GNAT, SS-tree and M-tree and that the relevant representation of a tree node is an important issue for the performance of the search process. We also combine different search pruning criteria used in BST, GHT nad GNAT structures into one and show that such a combination outperforms significantly each single pruning criterion. The experiments are performed for benchmark data sets of the size up to several hundreds of thousands of objects. The indexing tree with the k-means splitting procedure and the combined search criteria is particularly effective for the largest tested data sets for which this tree accelerates searching up to several thousands times.

     


56(4)  2003


  1. Tissue-like P Systems with Active Membranes for Picture Generation
    Rodica Ceterchi, Radu Gramatovici, Natasa Jonoska and K.G. Subramanian 311-328

    We propose a variant of tissue-like P systems with symport rules and active membranes that generates two-dimensional picture languages. The method is unconventional in that, instead of using the membranes as regions for computation/writing, it uses them to hold elements of pictures. Thus, the picture itself, and actually the whole supporting rectangular grid, is composed of membranes, each one containing (among other symbols) a letter of the picture's alphabet. The method is illustrated by its application to the generation of local and recognizable picture languages.

  2. Evolution of Collective Commitment during Teamwork
    Barbara Dunin-K êplicz and Rineke Verbrugge 329-371

    In this paper we aim to describe dynamic aspects of social and collective attitudes in teams of agents involved in Cooperative Problem Solving (CPS). Particular attention is given to the strongest motivational attitude, collective commitment, and its evolution during team action. First, building on our previous work, a logical framework is sketched in which a number of relevant social and collective attitudes is formalized, leading to the plan-based definition of collective commitments. Moreover, a dynamic logic component is added to this framework in order to capture the effects of the complex actions that are involved in the consecutive stages of CPS, namely potential recognition, team formation, plan formation and team action.

    During team action, the collective commitment leads to the execution of agent-specific actions. A dynamic and unpredictable environment may, however, cause the failure of some of these actions, or present the agents with new opportunities. The abstract reconfiguration algorithm, presented in a previous paper, is designed to handle the re-planning needed in such situations in an efficient way. In this paper, the dynamic logic component of the logical framework addresses issues pertaining to adjustments in collective commitment during the reconfiguration process.

  3. On the Notion of Elementary Description
    Tomasz Terlikowski 373-387

    In this paper, we deal with systems S including set theory. The concept of an elementary description was introduced in [11] as an auxiliary notion. Such a notion occurs to be necessary to define risk function and risk-type description, that will be subject of the next part of this work. The present paper is a continuation of our research in [13], [14] and [15]; the reader will find all relevant notions in the quoted works.

  4. On the Finiteness of Picture Languages of synchronous, simple non-deterministic Chain Code Picture Systems
    Bianca Truthe 389-409

    In the present paper, synchronous, simple non-deterministic chain code picture systems based on Lindenmayer systems () are studied the finiteness of their picture languages. The finiteness is proved to be decidable. Additionally, a method is given for deciding whether an  generates a finite picture language or not.


File translated from TEX by TTH, version 2.00.
On 07 Aug 2003, 13:58.