safe regexp escape.md
September 26, 2023 ยท View on GitHub
Suppose we have a RegExp.escape which escapes:
- every ASCII punctuator except _, i.e.
(){}[]|,.?*+-^$=<>\/#&!%:;@~'"`. - whitespace
- 0-9 if at the start of the string (with a hex escape)
And we make \- and other currently-illegal escape sequences which would be produced by this function legal in u/v-mode RegExps (to mean the unescaped char), including inside of character classes.
And you don't put the output in a place where it would obviously mean something else, i.e. not
- immediately after
\x,\x0,\u00,\c, etc - immediately after an odd number of backslashes
- in
(?${here}:asdf)(because of regexp modifiers)
Then escape is safe, i.e. it cannot lead to context escapes.
Specifically, we have the following contexts:
| context | cannot leave context because |
|---|---|
| "base" context | trivial |
| character class | can't output unescaped ], ^, -, &, \ (etc) |
(...) group | can't output unescaped ) or ? |
\u{...} | can't output unescaped } |
\k<...> | can't output unescaped > |
(?<...>) | can't output unescaped > |
foo{...} | can't output unescaped } or , |
\p{...} | can't output unescaped } or = |
\q{...} | can't output unescaped } or | |
after \1 | numbers at the start of strings are escaped |
And the following proposed future contexts:
| context | cannot leave context because |
|---|---|
(?#...) | can't output unescaped ) |
#... line comments | can't output unescaped line terminator |
x-mode regexps | can't output unescaped whitespace |
(?(...)...) conditions | can't output unescaped ) or | |
This would be a commitment to only entering/exiting new contexts using whitespace or ASCII punctuators. That seems like it will not be a significant impediment to language evolution.