I figure I should post something, but I haven't had a chance to really come up with anything about OpenCL or GL to post. I have been looking at information, just not had the chance to make any exploits. I watched through all 6 videos on OpenCL at Mac research, which are fantastic and I highly advise you to go watch those if you are reading this. But one of the videos comments that depending on how you use the memory banks reffered to as "Local" on the graphics card, they can essentially registers. Then it hit me, not every programmer knows what that means. not many really understand how memory flows when you are dealing with a CPU let alone a GPU. So I thought I'd take a moment to explain how this works.
I'm going to actually explain it, but I thought I'd give you a little analogy to reference before I get going. there are 4 kinds of memory, registers, Cache, Ram, and Disk. if you were to equate them to note books, registers would be the last couple and the next couple letters you wrote (a modern 32-bit processor has 6 4 byte registers, a 64-bit processor has 14 8 byte registers). The Cache is the current page of the note book you are writing on. RAM is the stack of notebooks on your desk. and Disk is the dozens of bookshelves filled with notebooks around the room.
reason for this is simply put here:
Blazing fast Fast Slow Painfully slow
Register Cache Ram Disk
Bytes Kilobytes Gigabytes Terrabytes
Register Cache Ram Disk
Now that you get the general idea for these various kinds of memory. Lets start first with where these are located. Both Registers and the Cache are in the actual processor, and therefore when you buy a computer, the processor is the peice that determines the size of these two memories. When the processor executes a machine level the only memory it is capable of accessing is the register. There for it makes sense that these are the fastest. the reason for the tiny size is a matter of addressing and locality. To keep the instructions small, the number of registers must be small. a typical instruction has to address 3 registers, for example addition: a+b store in c. it has to address a,b, and c. The other reason is that the further the information has to travel, the longer it takes to get there. Registers are right there ready to go into the circuitry that is about to is about to be executed.
Cache is the next level in memory. because of its proximity to the processing, it is also very fast. But when the processor wants to use information saved in the Cache it must do a load command to pull the information from the cache to the registers so it can be manipulated. Registers run on the order of 752 gigabits persecond while cache runs closer to 16-24 gigabits persecond. a bit of a speed difference.
Now, to the Processor, the Cache and RAM actually look like the same thing. The hardware will actually direct the processor's load command to the correct location, be it Cache or RAM. So why not have a larger cache? On a processor die(die, the actual circuitry of the processor.) which you can see here: http://www.chipsetc.com/uploads/1/2/4/4/1244189/3396835_orig.jpg?158 When you look at this you can see there is a large dark portion on the left that is just a repeating pattern and a lighter portion with lots of paths and bunch of constructs of some sort. the right side is the actual processor, the right side is the cache. where are the registers you might ask? well, they are far to small to even see. each of the 4 byte registers in this processor are the same size as the 4k registers that make up the cache. 4k registers, 6 registers. you can imagine why you might not be able to see them. but back to my point. the cache is half of this chip. half. this is why the cache isn't bigger. Memory, compared to the actual processing circuitry, is quite large.
The next level of memory is the system memory, or RAM. RAM stands for Random Access Memory. The cache and system memory are technically both RAM, but we make the distinction between the two based mostly on the cache being housed by the processor. In terms of speed, modern Ram is capable typically of about 8-16 gigabits persecond, which isn't much slower than the cache, it is slower, but not the speed difference between the cache and the registers. RAM is where we start working with latency though. Latency is the amount of time it takes from when you give a system input to when it starts giving out put. If you go buy ram online it will give it's latency as one of the statistics about it. ram has a latency between 6-9 ms(milisecond). that means that when you send a request from the processor to the ram there is a 6-9ms delay between the request and the response. this is huge. stargeringly huge.
lets say you have fast ram at 6ms. and you have a fast processor at 3GHz. 3GHz just means that in 1 second it makes 3,000,000,000 cycles, or 3,000,000 cycles per ms. if your latency is 6ms, that means 3,000,000*6 or 18,000,000 processor cycles pass in the time it takes to get your request from the RAM. that is a lot of wasted resources. Modern processors have memory controlers that deal with getting information from the ram to the cache to minimize the frequency of what are refered to as "cache misses" where you have to wait for information to be transfered from ram to the cache.
The last level of memory is disk. this may be solid state drives magnetic disk drives. Solid state drives are actually a system similar to RAM. They use a type of ram called flash memory. One of the features of RAM is that if it looses its power all the information is lost. Flash memory on the other hand does not loose its information if it looses memory. Solid state drives are pretty fast. Capable of 1-3 gigabits persecond (though you'll typically see them listed in megabytes persecond. translating what I said then into that same form, 125-375 megabytes persecond). Magnetic disks are slugish in comparison at 40-200 megabits persecond. thats 0.04-0.2 gigabits persecond.
So why not SSDs all the way? this one I'm sure most of you know as these systems are quite as abscure and mysterious as registers and caches. for 100$ you can get a 64gigabyte SSD, about an order of magnitude larger than your RAM. for 100$ you could also get a 750gb magnetic disk drive. So the constraining feature of SSDs then is cost.
So to summarize what I've said. Registers, extremely fast, capable of dumping their contents every cycle, but because of limitations imposed by the size of instructions and the speed at which light travels, there can only be 6 of them. Cache, quite fast, resides on the processor itself, but due to size limitations, can't be bigger. Ram, slow, it has vast amounts more data than the cache, but again because of limitations of size and the speed of light (a photon at one end of a stick of ram will only be able to travel as far as the other end of the stick of ram in the time it takes for your CPU to complete a cycle.) as well as heat (if we try to pack ram closer, it is going to start to melt itself due to an inability to disperse the heat it produces.) ram is limited to a couple gigabytes. Lastly there is hard drives. These hold huge amounts of data, but are quite slow in getting to and reading that information as well. So when you compare the two end of the spectrum, a registers hold about 32 bytes of information and can transfer about 752 gigabits. in contrast, a magnetic disk can hold a terabyte or more, which is 31,250,000,000 times as much data, but can only transfer about 0.2 gigabits of it persecond, which makes registers about 3960 times faster.
All the kinds of memory are quite necessary to get your computer running quickly and hold the large quantities of data that we have come to enjoy. In my mind it is kind of like orbits of planets. futher planets orbit very slowly but have huge orbits, and planets that are painful close to the sun have tiny orbits but orbit ridiculously fast.
So, when it the Mac Research guy said that the local memory buffers on the GPU are so fast they can function as registers, it is amazing. it is trun the small handful of registers into a large group of almost 4k. that has amazing potential. Storing information in the local memory you can get calculations to just scream. Typically the GPU are clocked at about a 5th of the speed of a CPU. If you can things running in the local memory buffer, then you can get each individual stream processor to start to give the CPU a run for its money. that is crazy fast.
That was a lot longer more than I had intended to write, and there are somethings that I left out. But I'll wrap this up here for now. Besides, its not like anyone reads this so it doesn't really matter. :p