safe regexp escape.md

September 26, 2023 ยท View on GitHub

Suppose we have a RegExp.escape which escapes:

  • every ASCII punctuator except _, i.e. (){}[]|,.?*+-^$=<>\/#&!%:;@~'"`.
  • whitespace
  • 0-9 if at the start of the string (with a hex escape)

And we make \- and other currently-illegal escape sequences which would be produced by this function legal in u/v-mode RegExps (to mean the unescaped char), including inside of character classes.

And you don't put the output in a place where it would obviously mean something else, i.e. not

  • immediately after \x, \x0, \u00, \c, etc
  • immediately after an odd number of backslashes
  • in (?${here}:asdf) (because of regexp modifiers)

Then escape is safe, i.e. it cannot lead to context escapes.

Specifically, we have the following contexts:

contextcannot leave context because
"base" contexttrivial
character classcan't output unescaped ], ^, -, &, \ (etc)
(...) groupcan't output unescaped ) or ?
\u{...}can't output unescaped }
\k<...>can't output unescaped >
(?<...>)can't output unescaped >
foo{...}can't output unescaped } or ,
\p{...}can't output unescaped } or =
\q{...}can't output unescaped } or |
after \1numbers at the start of strings are escaped

And the following proposed future contexts:

contextcannot leave context because
(?#...) can't output unescaped )
#... line commentscan't output unescaped line terminator
x-mode regexpscan't output unescaped whitespace
(?(...)...) conditionscan't output unescaped ) or |

This would be a commitment to only entering/exiting new contexts using whitespace or ASCII punctuators. That seems like it will not be a significant impediment to language evolution.