Escape codes \x80 through \xFF get encoded as two bytes #989

New Issue

spivee · 2026-01-29T15:44:49+09:00

spivee commented

2026-01-29 15:44:49 +09:00

The compiler supports escape codes like "\x40", turning two arbitrary hexadecimal digits into one byte in the resulting string. This is useful for inserting non-printable characters, like "\x00" for a null character, but it behaves strangely when given 'extended ASCII' values, from 128 through to 255, or \x80 to \xFF.

The expectation would be that other non-printable character values would be inserted faithfully, and if the user wanted to escape a complex UTF-8 sequences, the user could simply enter each byte of the UTF-8 encoding directly. Instead, the compiler takes this extended ASCII value, interprets it as a unicode code point between 128 and 256, and encodes that in UTF-8, which then takes two bytes instead of one. This is nonsense, as the \x escape character only accepts two hex digits, so clearly isn't meant to represent unicode, but is meant to represent a single byte. Consider the unicode code point U+0100, which in UTF-8 is represented as <<16#C4, 16#80>>. If we try to enter this into a string literal as "\x0100", then that is interpreted as U+0001, U+0030, U+0030, i.e. <<1, $0, $0>>. Whereas if we try to enter the UTF-8 ourselves, manually, as "\xC4\x80", then that is interpreted as U+00C4, U+0080, producing <<16#C3, 16#84, 16#C2, 16#80>>, ad infinitum!

Instead these hex escape codes should not be passed to any unicode functions at all, and should just be dumped faithfully into the resulting binary/immediate.

The compiler supports escape codes like "\x40", turning two arbitrary hexadecimal digits into one byte in the resulting string. This is useful for inserting non-printable characters, like "\x00" for a null character, but it behaves strangely when given 'extended ASCII' values, from 128 through to 255, or \x80 to \xFF. The expectation would be that other non-printable character values would be inserted faithfully, and if the user wanted to escape a complex UTF-8 sequences, the user could simply enter each byte of the UTF-8 encoding directly. Instead, the compiler takes this extended ASCII value, interprets it as a unicode code point between 128 and 256, and encodes *that* in UTF-8, which then takes two bytes instead of one. This is nonsense, as the \x escape character only accepts two hex digits, so clearly isn't meant to represent unicode, but is meant to represent a single byte. Consider the unicode code point U+0100, which in UTF-8 is represented as `<<16#C4, 16#80>>`. If we try to enter this into a string literal as "\x0100", then that is interpreted as U+0001, U+0030, U+0030, i.e. <<1, $0, $0>>. Whereas if we try to enter the UTF-8 ourselves, manually, as "\xC4\x80", then that is interpreted as U+00C4, U+0080, producing <<16#C3, 16#84, 16#C2, 16#80>>, ad infinitum! Instead these hex escape codes should not be passed to any unicode functions at all, and should just be dumped faithfully into the resulting binary/immediate.

spivee commented

2026-01-29 16:42:00 +09:00

It seems the design is that U+0100 is written as "\x{0100}", with curly braces included explicitly in the string literal. That doesn't work in this compiler, though, so maybe there should be a separate issue for that??

It seems the design is that U+0100 is written as "\x{0100}", with curly braces included explicitly in the string literal. That doesn't *work* in this compiler, though, so maybe there should be a separate issue for that??

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: QPQ-AG/sophia#989