That's a good challenge, if you can make a faster version that'd be fantastic!
T[48] is useless, just delete that I was testing if filling the last 48 characters of W make it function the same, it does.
W[64] doesn't have to be that long, it can have a length of 16 and still function the same, but I made it 64 to avoid having to check for "SUBSCRIPT OUT OF RANGE" errors.
Lots of implementations do that differently. JavaScript uses a length of 16 because indexing negative numbers does not error, just gives undefined (which turns into 0 when applying math), which is perfect behavior and allows JavaScript to run this algorithm really fast because of no bound checking.
K is full of 64 constants (prime numbers), and I generate those every time it's ran, now that I think of it, that's most certainly what's making it slow...
Instructions:
Usage:DIM H%[8],K%[64] INITSHA256 'One-time initialization PRINT SHA256("FOO") PRINT SHA256("BAR")Prints out: "9520437CE8902EB379A7D8AAA98FC4C94EEB07B6684854868FA6F72BF34B0FD3" "81F5F5515E670645C30C6340FE397157BBD2D42CAA6968FD296A725EC9FAC4ED"
@New_3DS: 382.37 Hashes/Second @Original_3DS: 100.67 Hashes/Second