Files with invalid UTF-8 data are cut off

12Me21Created: ~9 years ago

If you hack a file with invalid characters in SB, it will only show parts of the file up to the invalid characters. Also, it will FREEZE SmileBasic if you try to PRGEDIT a line that is after the character. ~~L4CNN3Z4~~ Crossed out 4 is still regular 4 ;(

~9 years agoEdited ~8 years ago by 12Me21

NeatNit #2

If you ask me, it's not a SmileBASIC bug if you have to use external tools/modding to cause it.

~9 years ago

SquareFingers #3

What is an 'invalid character'?

~9 years ago

NeatNit #4

I think he meant Null character, which is 0000 in hex in UTF-16

~9 years agoEdited ~9 years ago by NeatNit

12Me21 #5

The character that caused it was an accented letter. (part of some japanese text that got corrupted somehow) It is a smilebasic bug because you can download the file from the server and do the bug without hacking.

~9 years ago

snail_#6

If I remember correctly, TXTs are encoded as UTF-8 externally. It would be really easy to somehow write a token outside the UTF-16 range, so SB should ideally handle it. I assume that's what happened here.

~9 years ago

NeatNit #7

UTF-16 is still supposed to handle all unicode characters.

Q: What is UTF-16? A: UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode. Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

Basically, unicode will never support character codes higher than 1,114,111, and all versions of Unicode (UTF-8, UTF-16, and UTF-32) are able to represent ALL unicode characters.

~9 years ago

12Me21 #8

I've always disliked unicode. It's so messy and disorganized, and there are many symbols that look identical. Very few fonts have 100% unicode support, and I STILL can't find a symbol I need sometimes.

~9 years ago

NeatNit #9

well, they don't really have a choice. The goal is to have a single code for every language in the world, and some languages just have completely identical-looking characters with completely different meanings. I don't really know about messy and disorganized, you're probably right about that, and fonts pretty much never try to support more than two or three languages (one of them nearly always being english). But there's really no better alternative. I think it's amazing that we have one, single code that can be shared with anyone in the world and any software, and not lose any data. Before unicode, everyone had a completely different system.

~9 years ago

snail_#10

Gah, Unicode makes my head spin... If they encounter a character out of their range they could just cap it at &HFFFF or remove it...or something.

~9 years ago