LoginLogin
Might make SBS readonly: thread

Files with invalid UTF-8 data are cut off

Root / SmileBASIC Bug Reports / [.]

12Me21Created:
If you hack a file with invalid characters in SB, it will only show parts of the file up to the invalid characters. Also, it will FREEZE SmileBasic if you try to PRGEDIT a line that is after the character. L4CNN3Z4 Crossed out 4 is still regular 4 ;(

If you ask me, it's not a SmileBASIC bug if you have to use external tools/modding to cause it.

What is an 'invalid character'?

I think he meant Null character, which is 0000 in hex in UTF-16

The character that caused it was an accented letter. (part of some japanese text that got corrupted somehow) It is a smilebasic bug because you can download the file from the server and do the bug without hacking.

If I remember correctly, TXTs are encoded as UTF-8 externally. It would be really easy to somehow write a token outside the UTF-16 range, so SB should ideally handle it. I assume that's what happened here.

UTF-16 is still supposed to handle all unicode characters.
Q: What is UTF-16? A: UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode. Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.
Basically, unicode will never support character codes higher than 1,114,111, and all versions of Unicode (UTF-8, UTF-16, and UTF-32) are able to represent ALL unicode characters.

I've always disliked unicode. It's so messy and disorganized, and there are many symbols that look identical. Very few fonts have 100% unicode support, and I STILL can't find a symbol I need sometimes.

well, they don't really have a choice. The goal is to have a single code for every language in the world, and some languages just have completely identical-looking characters with completely different meanings. I don't really know about messy and disorganized, you're probably right about that, and fonts pretty much never try to support more than two or three languages (one of them nearly always being english). But there's really no better alternative. I think it's amazing that we have one, single code that can be shared with anyone in the world and any software, and not lose any data. Before unicode, everyone had a completely different system.

Gah, Unicode makes my head spin... If they encounter a character out of their range they could just cap it at &HFFFF or remove it...or something.