Escape codes \x80 through \xFF get encoded as two bytes #989
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The compiler supports escape codes like "\x40", turning two arbitrary hexadecimal digits into one byte in the resulting string. This is useful for inserting non-printable characters, like "\x00" for a null character, but it behaves strangely when given 'extended ASCII' values, from 128 through to 255, or \x80 to \xFF.
The expectation would be that other non-printable character values would be inserted faithfully, and if the user wanted to escape a complex UTF-8 sequences, the user could simply enter each byte of the UTF-8 encoding directly. Instead, the compiler takes this extended ASCII value, interprets it as a unicode code point between 128 and 256, and encodes that in UTF-8, which then takes two bytes instead of one. This is nonsense, as the \x escape character only accepts two hex digits, so clearly isn't meant to represent unicode, but is meant to represent a single byte. Consider the unicode code point U+0100, which in UTF-8 is represented as
<<16#C4, 16#80>>. If we try to enter this into a string literal as "\x0100", then that is interpreted as U+0001, U+0030, U+0030, i.e. <<1, $0, $0>>. Whereas if we try to enter the UTF-8 ourselves, manually, as "\xC4\x80", then that is interpreted as U+00C4, U+0080, producing <<16#C3, 16#84, 16#C2, 16#80>>, ad infinitum!Instead these hex escape codes should not be passed to any unicode functions at all, and should just be dumped faithfully into the resulting binary/immediate.
It seems the design is that U+0100 is written as "\x{0100}", with curly braces included explicitly in the string literal. That doesn't work in this compiler, though, so maybe there should be a separate issue for that??