Thesis Details

Automata with Counting in Regular Expression Matching

Ph.D. Thesis Student: Holíková Lenka Academic Year: 2021/2022 Supervisor: Holík Lukáš, doc. Mgr., Ph.D.
Czech title
Čítačové automaty ve vyhledávání podle regulárních výrazů
Language
English
Abstract

Matching of regular expressions (regexes) is widely used, e.g., for searching, data validation, parsing, finding and replacing, data scraping, or syntax highlighting in many programming languages. It is a computationally intensive process often applied on large texts. Predictability of its efficiency has a significant impact on the overall usability of software applications in practice. A problem is that standard approaches for regex matching suffer from high worst case complexity. An unlucky combination of a regex and text may increase the matching time by orders of magnitude. This can be a doorway for the so-called Regular expression Denial of Service (ReDoS) attack in which the attacker causes a denial of service by providing a specially crafted regex or text.

Automata-based matchers are the most efficient regex matching engines used nowadays in practice, especially in performance-critical industrial applications. There are years of empirical evidence showing that their performance is much more stable than that of the more traditional backtracking-based matchers. But automata matchers may run into troubles too. Bounded repetition, i.e., expressions such as [ab]{100} with a specified number of repetitions of a certain pattern, has been recognised as a major source of problems for even the fastest matchers. This thesis studies this issue systematically.In this thesis, we present a large-scale study of vulnerability of automata-based matching focused on bounded repetition. To this end, we propose a new ReDoS generator. It is the first generator capable of utilising bounded repetition to attack automata-based matchers, in fact the first generator that can attack them at all. We were then able to prove experimentally that bounded repetition indeed poses a serious security threat, for automata-based as well as backtracking-based matchers.We then propose a solution to the problem of efficient matching of regexes with bounded repetition. The approach is to compile the regexes into nondeterministic counting automata (CAs) and then to determinise them. The main problem is to find a succinct deterministic representation that can perform fast matching (naive determinisation builds a deterministic finite automata (DFAs) exponentially large to the size of the regex and of the repetition bounds in it). In the first step, we propose a determinisation algorithm based on general subset construction that generates deterministic CAs. They are exponentially more succinct than the corresponding DFAs. The main contribution of this thesis was then obtained when we elaborated the determinisation using the idea representing many counters with counting sets. We propose succinct transformation of a CA into a deterministic counting-set automaton (CsA), an automaton with a special type of registers that can hold a set of integer values. We also propose a novel compilation of regexes to CAs that generalizes the Antimirov's derivative construction. We design a framework for matching based on CsA simulation and the Antimirov's derivative construction. We compare the speed of matching of individual matching engines on a comprehensive set of real-world regexes with bounded repetition. We found that our algorithm is much more robust, outperforms the state-of-the-art matchers on regexes with bounded repetition, and is not dependent on the size of repetition bounds. It easily solves most cases in which the existing matchers struggle due to bounded repetition.

Keywords

Regular expression matching, bounded repetition, ReDoS, determinisation, Antimirov's derivatives, counting automata, counting-set automata.

Department
Degree Programme
Computer Science and Engineering, Field of Study Computer Science and Engineering
Files
Status
defended
Date
29 June 2022
Citation
HOLÍKOVÁ, Lenka. Automata with Counting in Regular Expression Matching. Brno, 2021. Ph.D. Thesis. Brno University of Technology, Faculty of Information Technology. 2022-06-29. Supervised by Holík Lukáš. Available from: https://www.fit.vut.cz/study/phd-thesis/1416/
BibTeX
@phdthesis{FITPT1416,
    author = "Lenka Hol\'{i}kov\'{a}",
    type = "Ph.D. thesis",
    title = "Automata with Counting in Regular Expression Matching",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2022,
    location = "Brno, CZ",
    language = "english",
    url = "https://www.fit.vut.cz/study/phd-thesis/1416/"
}
Back to top