Often when people are trying to optimize code, they think very deeply about the logic of their code itself, but forget to actually consider the hardware their code is running on. One such considerations is the cache. Having an understanding of how the cache works can sometimes allow you to make code run just a little bit faster.
What is the cache?
Imagine you are doing a research paper and you need some sources to reference for your paper, so you go to a library and start looking for books.
There's three fundamental reasons why you would to check out books from a library:
(1) To bring the information
closer to you. By moving the book from the library to a bookshelf in your home, you can access it much quicker. It no longer requires an entire drive to the library, but simply a walk to your bookshelf.
(2) You assume that you will utilize that book not just once, but multiple times. When you look up information in a book, likely you are going to want to look up more information in that book in the near future. This is known as the
principle of temporal locality.
(3) You may also consider not just checking out one book, but also several books
around the book you checked out. Why? Because maybe you find later that one book wasn't enough, and you don't want to go all the way back down to the library. So instead, beforehand, you checked out multiple books of the same topic that you may or may not have needed, that way if you do need more information, you have a few more books to look through. This is known as the
principle of spacial locality.
Caches operate on these same principles. The registers of your CPU hold a tiiiiny amount of memory and is where all the major CPU operations are performed. The RAM of a computer is actually pretty "far away" from the CPU, this means it takes a lot of time to load in data from RAM to the registers to actually carry out an operation.
Caches act as an intermediate step, kinda like the bookshelf in the analogy that is close to you, while RAM is the library that's far away. When you write some code to fetch data from RAM, you assume it always goes to RAM, but it actually doesn't always go to RAM. When the CPU is told to get data from RAM, it fetches the data and data around it and brings that into the cache. Then, if you try to fetch the same data from RAM again, it remembers it's stored in the cache and instead reads it from there, preventing it from having to make the long trip to the RAM every time.
Caches rely on two assumptions: (1) the
principle of temporal locality, this being the assumption that any data you access will likely be accessed again in the near future, and (2) the
principle of spacial locality, this being the assumption that data
around data previously accessed will likely be accessed in the near future.
As long as these two principles hold true (which they almost always do), then caches will provide a performance boost for computers. The CPU no longer needs to run back and forth between the "library" (the RAM) every time it needs some memory, instead it can take some of that information to put on its bookshelf at home (to the cache) and then simply access that bookshelf when it needs the information again.
(Note that real CPUs have
tons of caches, not just one, but the same principles apply to all of them.)
Performance
When the CPU needs some information from RAM, it first checks if it already exists inside the cache. If it does not exist in the cache, this is called a
cache miss and it has to go to RAM to find the information. If it does find the information, this is called a
cache hit and it can simply pull the data from the cache and save some time.
So, to optimize performance, we want to optimize the number of cache hits and minimize the number of misses.
When you first start your program, none of the information in the cache is going to be useful to it, so the first time the CPU sees code that calls for fetches to RAM, it will check the cache and only to find nothing there and so all initial fetches will be cache misses. These are called
cold misses and there's nothing you can do about it. It's like checking your bookshelf for a book even though you've never been to the library before. Of course nothing will be there.
Take a look at this very simple program:
ACLS
DIM A[500,500]
DIM B[500,500]
DIM C[500,500]
VAR T=MILLISEC
VAR I, J
FOR I=0 TO 500-1
FOR J=0 TO 500-1
C[J,I]=A[J,I]+B[J,I]
NEXT
NEXT
PRINT MILLISEC-T
This shouldn't be too hard to understand. All it does is add the two matrices A and B and store the result in C.
We also record how many milliseconds this took to run.
Running it on the new3DS ten times, these are the results:
Test 1: 1246
Test 2: 1235
Test 3: 1241
Test 4: 1241
Test 5: 1233
Test 6: 1236
Test 7: 1242
Test 8: 1241
Test 9: 1240
Test 10: 1236
Average: 1239.1
I pose a simple question to you: how can we improve the performance of this program
without changing the program's logic?
By this I mean, the program should, on paper, still do the exact same thing and perform exactly the same. But in practice, is there a way we can use our knowledge of caches to speed up the program?
Let's change the program now to this...
ACLS
DIM A[500,500]
DIM B[500,500]
DIM C[500,500]
VAR T=MILLISEC
VAR I, J
FOR I=0 TO 500-1
FOR J=0 TO 500-1
C[I,J]=A[I,J]+B[I,J]
NEXT
NEXT
PRINT MILLISEC-T
It might not be obviously noticeable what I even changed. I changed the indexes inside of the addition operation inside of the two for-loops.
On paper, this won't change anything about our code. The logic is exactly the same, it should run through all the elements and sum them up.
Let's run it 10 times again and see if there's any change in performance.
Test 1: 1228
Test 2: 1229
Test 3: 1235
Test 4: 1234
Test 5: 1240
Test 6: 1243
Test 7: 1224
Test 8: 1229
Test 9: 1231
Test 10: 1230
Average: 1232.3
There is a change... albeit, very slight. Only about a 0.5% performance speedup. But it is there, you can keep collecting datapoints forever and you'll see it's not due to random chance but the performance boost is actually there.
So why did this improve the performance ever so slightly? When we do our math operation of adding the two elements in the matrix, those two matrices are stored in RAM, so the CPU has to go find the data in RAM which is slow. But it also puts that data into the cache. Not only that, but it also brings in data
around it and puts that in the cache as well.
When we access data C[I,J] where we loop through J on the inside loop and I on the outside loop, we are access memory in RAM in a linear fashion. The data referenced by C[I,J] is stored right next to C[I,J+1] which is stored right next to C[I,J+2]. These pieces of data are
adjacent to each other in RAM.
But when we access data with C[J,I], this is not the case. C[J,I] is not adjacent to C[J+1,I]. Instead, it is an entire 500 memory addresses away, and C[J+1,I] is another 500 memory addresses away from that.
Since these are not near each other, the principle of spacial locality doesn't hold. You are not accessing data near each other, so when the cache brings in data
around the data you accessed, that data is useless since the next datapoint you access will be quite far away. But when they are adjacent in C[I,J], the principle of spacial locality does indeed hold, since each memory is one after the other, the data brought into the cache will indeed be useful for predicting what data you'll need in the future.
One program promotes cache misses while the other promotes cache hits. Thus, an ever so slight performance improvement.
Now, it's not exactly an amazing performance increase, but it
does increase performance, so it's worth understanding.
It's pretty hard to see a performance improve on the cache with something like BASIC since there's so much other processing going on that when you type in "C[I,J]" that's not directly being converted to machine code that will load that data. Instead, it's being processed by a ton of overhead stuff that also utilizes the cache, so its behaviour isn't truly predictable. But you typically will see a very slight performance improvement overall if you make your programs follow the principles of locality when applicable.
Extreme Example
The biggest violation of the principles of locality is random memory access. The principles of locality assume some structure to memory accesses, those being that they have a temporal and spacial relation with prior memory accesses. To try and make an extreme case, we could instead have random memory accesses vs predictable ones.
Here's two example codes:
ACLS
DIM A%[1000000]
VAR I%=0
VAR SUM%=0
VAR T%=MILLISEC
FOR I%=1 TO 1000000
VAR P%=RND(LEN(A%)-1)
P%=P% OR P%
SUM%=SUM%+A%[P%]
NEXT
PRINT MILLISEC-T%
ACLS
DIM A%[1000000]
VAR I%=0
VAR SUM%=0
VAR T%=MILLISEC
FOR I%=1 TO 1000000
VAR P%=RND(LEN(A%)-1)
P%=P% XOR P%
SUM%=SUM%+A%[P%]
NEXT
PRINT MILLISEC-T%
These two programs are nearly identical. The difference here is the "XOR" and "OR" line. Saying "P% XOR P%" always returns 0, while "P% OR P%" always returns P%. So in both of these programs, we are making 1,000,000 memory accesses to calculate our sum. The first code we are pulling the results from a random address, but the second we are always pulling it from address 0. The codes are nearly identical so they should perform identically.
Running it five times... the first code I get an average of 3867.6 milliseconds, the second I get an average of 3715.4 milliseconds.
That's 3.9% performance difference!
You might think "well maybe OR is just really slow and XOR is just really fast"... well let me disprove that.
Simply delete the "SUM%=SUM%+A%[P%]" line. This will remove our memory accesses from the code so the cache should no longer play a role, meaning the only difference is now the XOR and OR line.
Doing this, I ran the code 5 times again. The first time with OR and the second with XOR. The averages were 2302.4 and 2298.2. The difference here is about 0.2%. This is not even half of our non-extreme example and cannot explain a 3.9% difference. This 3.9% difference is mostly due to the cache.