It's worth pointing out that the resource https://swtch.com/~rsc/regexp/regexp1....

chubot · on July 28, 2020

But Cox would dislike this formulation of regular expressions. His whole point is that Perl-stye regexes are not regular languages, and RE2 goes to great lengths to work around this fact.

I'm taking up his arguments with some more traditional terminology that I think may help:

http://www.oilshell.org/blog/2020/07/eggex-theory.html

The entire O'Reilly book "Mastering Regular Expressions" by Friedl is based on the "debugging regexes by backtracking" approach you present here. It talks about "optimizing" regexes by reordering clauses and optimizing the backtracking.

Russ Cox HATES THAT APPROACH and he wrote the entire RE2 project to advocate that it NOT be used! You do not need to optimize, reorder clauses, debug, or trace backtracking when you're using automata-based engines, because they run in guaranteed linear time and constant space (with respect to the length of the input). There is no backtracking.

(I was at Google when he wrote RE2 and used it shortly afterward on big data, and regular languages are indeed better than regexes for such engineering. Also made clear by the HyperScan work.)

Since his site is down I can't find the original comments, but here's a twitter exchange from 2018 that backs this up:

https://twitter.com/geofflangdale/status/1060313603268501504

https://twitter.com/_rsc/status/1060702201159565313

Also, I believe your article would be improved if you simply take out this sentence:

In computer science theory, this node type should remind you of the difference between DFAs and NFAs.

It is treading close to the confusion that Cox is talking about. So I would just remove, add a proper explanation, or add a link to the authoritative source.

----

I would love to see a followup that goes through the rest of the article with the Thompson VM and Pike VM approaches! :) I think the whole point is that you don't have to backtrack to implement this VM, and this page confuses that issue.

robertelder · on July 28, 2020

The goal with this article was never to show the best or most efficient way to match regular expressions, but rather to convey the correct interpretation of what a given regular expression will do when you try to use it. Even discussions of slow and simple backtracking regular expression matchers are a pretty niche topic and likely too much for casual users of regular expressions.

Yes, I was thinking about also allowing it to optionally cross-compile to the more asymptotically efficient matching algorithm, also documented by Russ Cox. Everything takes time though, and I figured it would make to put this out there before investing too much time in something that possibly no one would use.

chubot · on July 28, 2020

Sure, but automata based engines are also "correct" and used in common tools like grep, awk, and sed.

It would be nice to make that clear in the article.

-----

It's a matter of preference, but to me debugging by backtracking is a pointless exercise. It's easier to reason about regexes as sets of strings, composed by sequence and alternation.

Write unit tests that enumerate what you accept and reject, and build them up by composition. There's no need for debugging or tracing.

I understand that the imperative style is more natural to many programmers, but I don't think it requires too much of a leap of thinking. And once you get used to the other style, it saves effort overall (as well as bounding your worst case running time).

robertelder · on July 28, 2020

Agreed on the point that there are better mental models out there. The difficulty is to make content that is simultaneously correct, but also not over-complicated. I finished off the post with a reference to Russ Cox's work so any sufficiently motivated individual will find the right models there.

ridiculous_fish · on July 28, 2020

How does the "sets of strings" model express the difference between:

   /x*/

and

   /x*?/

chubot · on July 29, 2020

Update: I tested it out a little more here.

https://github.com/oilshell/blog-code/blob/master/regular-la...

Basically everything you want with regard to greedy-nongreedy can be done through automata-based engines.

They jump through a lot of hoops to make them work!

However I would also argue that when you're using them, you're doing it as a performance hack in a Perl-style regex engine. In an automata-based engine, you can just write

    /x*/
    /(x*)/

and it's what you want, and it runs fast.

chubot · on July 29, 2020

Good question, and here's something I wrote a couple weeks ago to the author of the Regex Lingua Franca paper [1]:

I admit I occasionally use non-greedy repetition in Python, but I think that is the only construct I use that's not in the shell-ish engines. I believe it's usually pretty easy to rewrite without non-greedy, but I haven't tested that theory.

----

OK so I started testing that theory. I collected all the times I used

.*?

https://github.com/oilshell/blog-code/blob/master/regular-la...

With a cursory inspection, most of them are unnecessary in an automata-based "regular language" engine.

They are "nice to have" in a Perl-style regex engine for PERFORMANCE. But automata-based engines don't have that performance problem. They don't backtrack.

The distinction between .* and .*? doesn't affect whether it matches or not, by definition. It does affects how much it backtracks. (And it affects where the submatches are, which I need to test out)

So I think most of my own usages ARE "nice to have" in Python, but unnecessary in awk because it uses an automata-based engine.

But I'd like to look at it more carefully, and I'm interested in answering this question more thoroughly (feel free to contact me about it, e-mail in profile).

I started a demo here. Still needs work.

https://github.com/oilshell/blog-code/blob/master/regular-la...

----

So here's another conjecture: regexes are bad because they force you to make UNNECESSARY DISTINCTIONS in the name of performance.

You have to care about greedy vs. nongreedy, because there is backtracking.

In contrast, with regular languages, you just write what you mean, and they do it quickly for you. You can't write a slow expression for a regular language.

There are some caveats around that, like submatch extraction with nongreedy repetition, which I will test out. If anyone has specific examples or use cases, send them my way.

I can't see where I really rely on nongreedy submatches, but I'll have to look more closely. I think the nongreedy use case is usually just "skip quickly until this other string appears", e.g. the one in pgen2/tokenize.py that I didn't write.

----

As far as I remember submatch extraction works fine in automata-based engines, and Cox's articles are in a large part about the (then) little known algorithms for doing it in automata-based engines. re2c also has a 2017 paper for their algorithm to do it in DFAs:

http://re2c.org/2017_trofimovich_tagged_deterministic_finite...

----

[1] thread: https://news.ycombinator.com/item?id=23687478

ridiculous_fish · on July 29, 2020

I don't agree it's primarily about performance. /".+"/ will find a quoted string, but you will certainly prefer the non-greedy match instead.

Now you may torture the greedy regex to make it work (/"[^"]+"/), but the point stands. "Did it match" is rarely enough; the regex author usually wishes to know where it matched, and to exercise control over which match is preferred if there are multiple possibilities. It's indeed possible to do this with DFAs but the theory is comparatively hard, as a cursory reading of your linked paper will show!

chubot · on July 29, 2020

Yeah that is a good example, and I recorded it here:

https://github.com/oilshell/blog-code/blob/master/regular-la...

I don't mind the rewrite, but I'm not surprised that some would be inconvienced by it. (Though that is part of the reason for the Eggex syntax)

----

On the other side I'll offer that I wrote a lexer for bash and all its sublanguages without any nongreedy repetitions. It's dozens if not hundreds of regexes.

https://github.com/oilshell/oil/blob/master/frontend/lexer_d...

Though, regular languages are more like a building block for the lexer than the lexer itself. As mentioned elsewhere, they're too limited to express the entire lexer. They're too limited to express syntax highlighters (but so are regexes! They also need some logic outside the model, e.g. for lexing shell here docs)

-----

So I pose the question of whether there's a more powerful abstraction that's still linear time and constant space:

http://www.oilshell.org/blog/2020/07/ideas-questions.html#mo...

Of course there is, but it's a matter of picking one that will address a lot of use cases effectively.

I think re2c has grown something like

.*?

for example. grep needs something like that too. Often you want to be able to skip input in a linear fashion, without invoking the pattern matching logic at all.

Regexes are OK if you know how to use them, but evidently most people have lots of problems with them (the necessity of OP's demo being an example of that.) I think there's something better out there built on top of regular languages.

ridiculous_fish · on July 29, 2020

You hit upon an important cleavage: features vs worst-case perf. A generous lexer supports partial input, syntax highlighting, helpful error messages, etc., while a strict one is not so nice, but guarantees protection against malicious input. But the way you write these are very different! I also would like to write one lexer that can operate in both modes, but with one syntax.

BTW, high five from one shell author to another :)

chubot · on July 29, 2020

Yeah I've been looking at fish. Definitely has the cleanest code out of any shell I've looked at :) (e.g. bash, dash, zsh, mksh)

And I came to realize what a huge effort it is on top of the language:

http://www.oilshell.org/blog/2020/02/recap.html#fish-oil-is-...

Also I wonder if you knew of the existence of ble.sh, a fish-like UI written in pure bash! I think it's the biggest shell program in the world:

http://www.oilshell.org/cross-ref.html?tag=ble.sh#ble.sh

https://github.com/oilshell/oil/wiki/The-Biggest-Shell-Progr...

----

I did add little hack in the lexer for autocompletion of partial input. But my hope though is to keep most of that logic in the parser and not the lexer? Though it is true I was thinking of breaking up the exact regex we talked about for similar reasons, e.g. a single token:

    ' [^']* '

vs. 3 tokens

    '
    [^']*
    '

I think the latter could be better for completion of paths in quotes, which bash does, but I haven't revisited it yet.