? Performance and the Cache ● SmileBASIC Source

Sign In

Register
*Usernames are case-sensitive
Forgot my password

Performance and the Cache

Often when people are trying to optimize code, they think very deeply about the logic of their code itself, but forget to actually consider the hardware their code is running on. One such considerations is the cache. Having an understanding of how the cache works can sometimes allow you to make code run just a little bit faster.

What is the cache?

Imagine you are doing a research paper and you need some sources to reference for your paper, so you go to a library and start looking for books. There's three fundamental reasons why you would to check out books from a library: (1) To bring the information closer to you. By moving the book from the library to a bookshelf in your home, you can access it much quicker. It no longer requires an entire drive to the library, but simply a walk to your bookshelf. (2) You assume that you will utilize that book not just once, but multiple times. When you look up information in a book, likely you are going to want to look up more information in that book in the near future. This is known as the principle of temporal locality. (3) You may also consider not just checking out one book, but also several books around the book you checked out. Why? Because maybe you find later that one book wasn't enough, and you don't want to go all the way back down to the library. So instead, beforehand, you checked out multiple books of the same topic that you may or may not have needed, that way if you do need more information, you have a few more books to look through. This is known as the principle of spacial locality. Caches operate on these same principles. The registers of your CPU hold a tiiiiny amount of memory and is where all the major CPU operations are performed. The RAM of a computer is actually pretty "far away" from the CPU, this means it takes a lot of time to load in data from RAM to the registers to actually carry out an operation. Caches act as an intermediate step, kinda like the bookshelf in the analogy that is close to you, while RAM is the library that's far away. When you write some code to fetch data from RAM, you assume it always goes to RAM, but it actually doesn't always go to RAM. When the CPU is told to get data from RAM, it fetches the data and data around it and brings that into the cache. Then, if you try to fetch the same data from RAM again, it remembers it's stored in the cache and instead reads it from there, preventing it from having to make the long trip to the RAM every time. Caches rely on two assumptions: (1) the principle of temporal locality, this being the assumption that any data you access will likely be accessed again in the near future, and (2) the principle of spacial locality, this being the assumption that data around data previously accessed will likely be accessed in the near future. As long as these two principles hold true (which they almost always do), then caches will provide a performance boost for computers. The CPU no longer needs to run back and forth between the "library" (the RAM) every time it needs some memory, instead it can take some of that information to put on its bookshelf at home (to the cache) and then simply access that bookshelf when it needs the information again. (Note that real CPUs have tons of caches, not just one, but the same principles apply to all of them.)

Performance

When the CPU needs some information from RAM, it first checks if it already exists inside the cache. If it does not exist in the cache, this is called a cache miss and it has to go to RAM to find the information. If it does find the information, this is called a cache hit and it can simply pull the data from the cache and save some time. So, to optimize performance, we want to optimize the number of cache hits and minimize the number of misses. When you first start your program, none of the information in the cache is going to be useful to it, so the first time the CPU sees code that calls for fetches to RAM, it will check the cache and only to find nothing there and so all initial fetches will be cache misses. These are called cold misses and there's nothing you can do about it. It's like checking your bookshelf for a book even though you've never been to the library before. Of course nothing will be there. Take a look at this very simple program: ACLS DIM A[500,500] DIM B[500,500] DIM C[500,500] VAR T=MILLISEC VAR I, J FOR I=0 TO 500-1 FOR J=0 TO 500-1 C[J,I]=A[J,I]+B[J,I] NEXT NEXT PRINT MILLISEC-T This shouldn't be too hard to understand. All it does is add the two matrices A and B and store the result in C. We also record how many milliseconds this took to run. Running it on the new3DS ten times, these are the results: Test 1: 1246 Test 2: 1235 Test 3: 1241 Test 4: 1241 Test 5: 1233 Test 6: 1236 Test 7: 1242 Test 8: 1241 Test 9: 1240 Test 10: 1236 Average: 1239.1 I pose a simple question to you: how can we improve the performance of this program without changing the program's logic? By this I mean, the program should, on paper, still do the exact same thing and perform exactly the same. But in practice, is there a way we can use our knowledge of caches to speed up the program? Let's change the program now to this... ACLS DIM A[500,500] DIM B[500,500] DIM C[500,500] VAR T=MILLISEC VAR I, J FOR I=0 TO 500-1 FOR J=0 TO 500-1 C[I,J]=A[I,J]+B[I,J] NEXT NEXT PRINT MILLISEC-T It might not be obviously noticeable what I even changed. I changed the indexes inside of the addition operation inside of the two for-loops. On paper, this won't change anything about our code. The logic is exactly the same, it should run through all the elements and sum them up. Let's run it 10 times again and see if there's any change in performance. Test 1: 1228 Test 2: 1229 Test 3: 1235 Test 4: 1234 Test 5: 1240 Test 6: 1243 Test 7: 1224 Test 8: 1229 Test 9: 1231 Test 10: 1230 Average: 1232.3 There is a change... albeit, very slight. Only about a 0.5% performance speedup. But it is there, you can keep collecting datapoints forever and you'll see it's not due to random chance but the performance boost is actually there. So why did this improve the performance ever so slightly? When we do our math operation of adding the two elements in the matrix, those two matrices are stored in RAM, so the CPU has to go find the data in RAM which is slow. But it also puts that data into the cache. Not only that, but it also brings in data around it and puts that in the cache as well. When we access data C[⁣I,J⁣] where we loop through J on the inside loop and I on the outside loop, we are access memory in RAM in a linear fashion. The data referenced by C[⁣I,J⁣] is stored right next to C[⁣I,J+1⁣] which is stored right next to C[⁣I,J+2⁣]. These pieces of data are adjacent to each other in RAM. But when we access data with C[J,I], this is not the case. C[J,I] is not adjacent to C[J+1,I]. Instead, it is an entire 500 memory addresses away, and C[J+1,I] is another 500 memory addresses away from that. Since these are not near each other, the principle of spacial locality doesn't hold. You are not accessing data near each other, so when the cache brings in data around the data you accessed, that data is useless since the next datapoint you access will be quite far away. But when they are adjacent in C[⁣I,J⁣], the principle of spacial locality does indeed hold, since each memory is one after the other, the data brought into the cache will indeed be useful for predicting what data you'll need in the future. One program promotes cache misses while the other promotes cache hits. Thus, an ever so slight performance improvement. Now, it's not exactly an amazing performance increase, but it does increase performance, so it's worth understanding. It's pretty hard to see a performance improve on the cache with something like BASIC since there's so much other processing going on that when you type in "C[⁣I,J⁣]" that's not directly being converted to machine code that will load that data. Instead, it's being processed by a ton of overhead stuff that also utilizes the cache, so its behaviour isn't truly predictable. But you typically will see a very slight performance improvement overall if you make your programs follow the principles of locality when applicable.

Extreme Example

The biggest violation of the principles of locality is random memory access. The principles of locality assume some structure to memory accesses, those being that they have a temporal and spacial relation with prior memory accesses. To try and make an extreme case, we could instead have random memory accesses vs predictable ones. Here's two example codes: ACLS DIM A%[1000000] VAR I%=0 VAR SUM%=0 VAR T%=MILLISEC FOR I%=1 TO 1000000 VAR P%=RND(LEN(A%)-1) P%=P% OR P% SUM%=SUM%+A%[P%] NEXT PRINT MILLISEC-T% ACLS DIM A%[1000000] VAR I%=0 VAR SUM%=0 VAR T%=MILLISEC FOR I%=1 TO 1000000 VAR P%=RND(LEN(A%)-1) P%=P% XOR P% SUM%=SUM%+A%[P%] NEXT PRINT MILLISEC-T% These two programs are nearly identical. The difference here is the "XOR" and "OR" line. Saying "P% XOR P%" always returns 0, while "P% OR P%" always returns P%. So in both of these programs, we are making 1,000,000 memory accesses to calculate our sum. The first code we are pulling the results from a random address, but the second we are always pulling it from address 0. The codes are nearly identical so they should perform identically. Running it five times... the first code I get an average of 3867.6 milliseconds, the second I get an average of 3715.4 milliseconds. That's 3.9% performance difference! You might think "well maybe OR is just really slow and XOR is just really fast"... well let me disprove that. Simply delete the "SUM%=SUM%+A%[P%]" line. This will remove our memory accesses from the code so the cache should no longer play a role, meaning the only difference is now the XOR and OR line. Doing this, I ran the code 5 times again. The first time with OR and the second with XOR. The averages were 2302.4 and 2298.2. The difference here is about 0.2%. This is not even half of our non-extreme example and cannot explain a 3.9% difference. This 3.9% difference is mostly due to the cache.
Author
amihart
Updated
Rating
9 votes
Categories
Keywords
13 Comment(s) Simeon Simeon Scholar Received for knowing a great deal about programming topics Achievements Amazing Page Hidden Achievements Drawing I like to draw! Hobbies Interesting ToadIsTheBest ToadIsTheBest Avatar Taboo I didn't change my avatar for 180 days Website Video Games I like to play video games! Hobbies Intermediate Programmer I can make programs, but I still have trouble here and there. Programming Strength
Often when people are trying to optimize code, they think very deeply about the logic of their code itself, but forget to actually consider the hardware their code is running on.
But what if someone plays a game made on 3DS on Wii U and the game is optimized this way. Didn't think about that, did you?
snail_ snail_ Helper Received for being very helpful around SmileBASIC Source Achievements OSP Contest 2 Contest Participant I participated in the second SmileBASIC Source OSP Contest! Night Person I like the quiet night and sleep late. Express Yourself But the CPU cache on both systems is likely to behave in pretty much the exact same manner, because they all do. ToadIsTheBest ToadIsTheBest Avatar Taboo I didn't change my avatar for 180 days Website Video Games I like to play video games! Hobbies Intermediate Programmer I can make programs, but I still have trouble here and there. Programming Strength but what if it's too different, snail_ YOU didn't think about THAT did you? amihart amihart uh okay
CosmicTacoCat CosmicTacoCat Intermediate Programmer I can make programs, but I still have trouble here and there. Programming Strength Video Games I like to play video games! Hobbies Night Person I like the quiet night and sleep late. Express Yourself Ooh! I like this a lot. Thanks, this makes a lot of sense. snail_ snail_ Helper Received for being very helpful around SmileBASIC Source Achievements OSP Contest 2 Contest Participant I participated in the second SmileBASIC Source OSP Contest! Night Person I like the quiet night and sleep late. Express Yourself Making the point of contiguous array access is very good because I think cache and memory access is something a lot of people don't consider. Good submission. snail_ snail_ Helper Received for being very helpful around SmileBASIC Source Achievements OSP Contest 2 Contest Participant I participated in the second SmileBASIC Source OSP Contest! Night Person I like the quiet night and sleep late. Express Yourself related reading Shelly Shelly Then again, wouldn't it make more sense for you to just let the CPU determine what goes in the cache by itself? Doesn't it have a fairly-efficient process for doing that? amihart amihart I really don't know what you mean. This post is about the CPU's cache and how you can write code to be slightly more efficient by knowing how it works. The CPU is hardware, not software, it can't change what it does on the fly. It's up to the programmer to change the code that goes into the CPU, and knowing a bit how CPUs are built will allow you to write slightly more efficient code. In this case, writing code that promotes cache hits in the CPU over cache misses is a little faster. amihart amihart I really have no clue what you're on about. The CPU puts data in the cache exactly how I described. It's a very simple process based on the principles of locality, which are assumptions the CPU makes. You can deliberately write software that accesses data randomly and the CPU's cache will be worthless because it would violate those principles, but if you write code that follows these principles then the CPU's cache will be useful and the CPU will run faster. Your argument seems to be that "we shouldn't care what the CPU does when writing our code and should just let the CPU do its thing". That's incredibly silly and I see no logic in it. snail_ snail_ Helper Received for being very helpful around SmileBASIC Source Achievements OSP Contest 2 Contest Participant I participated in the second SmileBASIC Source OSP Contest! Night Person I like the quiet night and sleep late. Express Yourself @Shelly You're misunderstanding the notion of cache here. SB isn't controlling its own cache memory; every CPU has its own onboard cache (sometime multiple levels) and memory is fetched into it automatically whenever something outside the CPU cache is loaded in from RAM. The processes running on the CPU have non control over this at all amd it's done to improve performance when reading commonly-used or adjacent memory values. Of course, the way the process uses memory will affect the way the CPU handles cache, because all memory reads go directly through cache first. Taking advantage of contiguous memory will make your program faster regardless of any other factors, because you'll be utilizing CPU cache. amihart amihart Your question doesn't make any sense. Did you actually read the post? You don't seem to understand what's being discussed at all.