Windows system >> Linux system Tutorial >> Linux Tutorial

DFA and NFA

Regular expressions sprouted in the neurophysiological study of the 1940s, first formally described by the famous mathematician Stephen Kleene. Specifically, Kleene summarized the aforementioned neurophysiological studies, and defined a "regular set" in a paper entitled "Regular Set Algebra", and defined an algebraic system on it, and introduced a The token system describes the regular set, which is called "regular expression". After being studied in the circle of theoretical mathematics for decades, in 1968, Ken Thompson, who later invented the UNIX system, first used regular expressions in the computer field, and developed two practical text processing tools, qed and grep. Great success. In the next decade or so, a large number of leading computer scientists and hackers conducted intensive research and practice on regular expressions. In the early 1980s, the two center Bell Labs of the UNIX Movement and the University of California at Berkeley studied and implemented the regular expression engine around the grep tool. At the same time, Alfred Aho, the author of the compiler "Dragon Book", developed the Egrep tool, which greatly expanded and enhanced the functionality of regular expressions. Since then, he has invented the popular awk text editing language with three others, Brian Kernighan, author of C Programming Language. By 1986, regular expressions had ushered in a leap. First, C language top hacker Henry Spencer released a regular expression library written in C language (not called open source at the time) in source code, thus bringing the mystery of regular expressions to ordinary people's homes, then technical blame. Jay Larry Wall was born and released the first version of the Perl language. Since then, Perl has been the standard-bearer of regular expressions. It can be said that the standard and status of today's regular expressions are shaped by Perl. After the release of Perl 5.x, regular expressions entered a period of stable maturity, and its powerful capabilities have conquered almost all major language platforms, becoming a basic tool that every professional developer must master.

2.DFA and NFA
Recommendations DFA and NFA regular expression engines fall into two categories, one called DFA (deterministic finite automaton) and the other called NFA (non-deterministic) Sexual autonomic machine). To work smoothly, both types of engines must have a regular expression and a text string, one in the hand and one to eat. DFA pinches the text string to compare the regular expression. When you see a sub-regular expression, you mark all possible matching strings, then look at the next part of the regular expression, and update the label according to the new matching result. The NFA is holding the regular style to compare the text, eat a character, compare it with the regular style, and the match is written down: “ Somewhere in a certain month, somewhere is matched! ”, then go on. Once there is no match, just spit out the character you just ate, spit one by one until you return to the last match. The difference between DFA and NFA mechanism brings five effects: 1. DFA only needs to scan once for each character in the text string, but it has fewer features; NFA has to turn over and eat characters and spit characters, but the speed is slow, but Rich in features, so it is widely used. Today's major regular expression engines, such as Perl, Ruby, Python re modules, Java and .NET regex libraries, are all NFA. 2. Only NFA supports features such as lazy and backreference; 3. NFA is eager to invite, so the leftmost regular style is matched first, so the best match result is occasionally missed; DFA is “the longest left child regular expression” Priority match success & rdquo;. 4. NFA defaults to the greedy quantifier (see item 4); 5. NFA may fall into the trap of recursive calls and behave very poorly. I will give an example here to illustrate the third impact. For example, using regular expression /perl| Perlman/to match the text ‘perlman book’. If it is NFA, it is guided by the regular style, holding the regular expression in the hand, looking at the text, one character and one character, eating & lsquo;perl& rsquo; later, matching with the first subregular /perl/Got up, so record it, look down, eat a & lsquo; m & rsquo;, this is bad, with the child /perl /does not match, so spit out m, report upwards that the match is successful & lsquo; perl & rsquo ;, no longer care about the other, do not try the latter sub-regular /perlman /, naturally can not see the better answer. If it is DFA, it is text-oriented, holding the text in his hand, looking at the regular style, eating it bit by bit. After eating /p/, put a hook on the ‘p’ in your hand, write a note, say that the character has been matched, and then eat down. When you see /perl/, DFA won't stop and will try another bite. At this time, the first sub-regular style has been exhausted, and it has not been eaten, so it will be smashed and eat the second sub-regular /m/. This was eaten, because it was matched again, so I went on to eat. Until the regular style is finished, I am satisfied that the match has been successfully matched with ‘perlman’. From this we can see that to make NFA work correctly, you should use /perlman| Perl/mode. From the above example, it can be understood why the NFA is the leftmost child match and the DFA is the longest left child match. In fact, if you analyze it carefully, you can find out the difference between NFA and DFA. And understanding these reasons is very meaningful for the effective application of regular expressions.

Writing the formal definition of regular expressions is deliberately very streamlined, avoiding the definition of redundant quantifiers ? and +, which can be expressed as: a+ = aa* and a? = (a| ε). Sometimes the complement operator ~ ;~R indicates a collection of all strings that are not in R on & Sigma;*. Complementary operators are superfluous because they are expressed using other operators (although the process of calculating such representations is complex and the results may increase exponentially). A regular expression in this sense can express a regular language, precisely a language class that can be accepted by a finite state automaton. But there are important differences in simplicity. A certain type of regular language can only be described by an automaton with an exponential increase in size, and the length of the required regular expression grows only linearly. The regular expression corresponds to the type-3 grammar of the Chomsky hierarchy. On the other hand, there is a simple mapping between regular expressions and non-deterministic finite state automata (NFA) that does not cause an explosion of this size; for this reason the NFA is often used as an alternative representation of regular expressions. We also need to study expression in this formalization. As the example below shows, different regular expressions can express the same language: There is redundancy in this formalization. It is possible to write an algorithm for two given regular expressions to determine whether the languages they describe are essentially equal, parsing each expression to a minimum to determine the finite automaton and determining if they are isomorphic (equivalent). How much can this redundancy be reduced? Can we find interesting subsets of regular expressions that are still fully expressive? Kleene asterisks and unions are clearly needed, but we may limit their use. This raises a surprisingly difficult problem. Because regular expressions are so simple, there is no way to grammatically rewrite them into some form of specification. The lack of axiomatization in the past led to a problem with the height of the asterisk. Recently Dexter Kozen aximed regular expressions with Klein algebra. Many real-world "regular expressions" engines implement features that cannot be expressed in regular expression algebra. The engine is currently supported language:

engine type program DFA awk (most versions), egrep (most versions), flex, lex, MySQL, Procmail traditional NFA GNU Emacs, Java, grep (large Most versions), less, more, .NET languages, PCRE library, Perl, PHP
(all three sets of regular libraries), Python, Ruby, set (most versions), vi POSIX NFA mawk, Mortice Lern System's utilities, GUN Emacs (used when explicitly specified) DFA/NFA Hybrid GNU awk, GNU grep/egrep, Tcl