Monday, February 27, 2012

Tutorial Architecture

A friend of mine critiqued my post on parallelism. It gave me some good insight. into what needs to be done and what was done well. It has made me think about what makes a good tutorial/guide/instructional writing of any kind. One thing I thought of was how much detail is necessary. If I put forth to much detail the reading becomes overwhelming, if I don't put in enough detail then it may not get the important information across. Also, Gadget commented on the value of examples and summaries. So this is the approach I came up with.

I start with a topic, in this case parallelism. A topic is a piece that is meant to be read together as one chunk. Kind of like section in a chapter or a post on a blog. Each topic is made up of concepts and key words. At the beginning of a topic introduce what the concepts to be covered and define the key words for that topic. Give a summary over view of the topic and then go into the concepts.

Each concept has several levels of explanation. The first paragraph of the concept is the summary of the concept for some one who already knows the concept. This way, if some one reads this summary and feels confident that they understand the concept, they can continue on to the next one. In this way some one who knows a lot but maybe missing a few things can skim really easily.

The next section of the concept is optional. It tries to explain the concept carefully. Some one who this is a new field would should be able to understand this explanation. And there may be more than one attempt. this part of the concept is attempt after attempt to explain the concept. coming from different angles. pictures, analogies, detailed explanations, what ever it takes to get help anyone understand the concept. the reader need only read how ever much it takes to understand the summarized earlier.

After the explanations, it continues on to the next concept. There isn't a summary because the reader is expected to read the summary that came first. I may change that and just restate the summary in different words.

After each concept has been covered there is a summary of the topic, how all the concepts fit together and what not as well as Links or references to further reading. What I hope to accomplish with this approach is that those who know can skip, the important information is presented up front and hashed over as many times as it takes to make it clear, the information is represented in a clear pattern that is carefully followed so that the user can keep focused and keep track of where they are. and if they still don't understand or want to learn more, there are references to further study. It isn't a format that standard textbooks would be able to support well, and will take a lot of work from me. but if I do it on my blog, I can have sections of text expand and collapse. and the time I spend writing the tutorial I'm sure I'll get the time i spend back in the confidence in the field I have found in myself.

Thursday, February 23, 2012

What is Parallelism?

I've decided that this is where I need to start with my new OpenCL tutorial. For me to be able to easily and accurately explain how OpenCL works, I need you to understand what the different kind of parallelisms are.

If you are reading this, you are interested in learning OpenCL, but have no experience programing in such environments. If you are reading this you probably don't have a background in electrical engineering or computer engineering. If you are reading this, you may be familiar with the concept of multi-threading, but may not be aware that there are more kinds of parallelism. Hopefully, you'll be more in the know here soon.

In computer science we can do things in 1 of two ways. In serial and in parallel. These typically speak for themselves, but just so we are on the same page I'll elaborate. Doing something in serial looks like:
A happens, then when its done, B happens, then when it is done, C happens, and D, and so on and so forth. In some cases there are whats called dependencies from A to B and down the line. or in other words, D can't execute before C can, it depends on some of the output values of C. C can't execute before B does, nor could B before A because there because C and B depend on the outputs of the step before them. In the case of dependencies, you must do things in serial.

However, if there are no dependencies A, B, C, and D could all execute at the same time. this is called executing in Parallel.
In order to do this, there are certain physical parameters. The easiest to understand (and we'll stick with the CPU for this example) is the number of processors. If you have only 2 processors you could only do:
If you only had one processor you would have to do something like
A->D->C->B (represented out of order here for the point that execution order doesn't matter)
The first parallel example would require 4 processors (or a 4-core processor). so the number of things you can do at once depends on the number of processors you have. This is the concept of multi-threading or multi-processing (the distinction between these two I will write about later).

Multi-threading/multi-processing is an example of task-level-parallelism. There are 4 levels of parallelism: (in order from the bottom up)
Bit Level
Instruction level
Data Level
Task Level

Bit level parallelism is not really a concern anymore. It was a concern back in the 1970s when 8-bit processors were the common processor. Inorder to add two 16-bit numbers in an 8-bit processor it would take 2 steps: add lower 8 bits, add upper 8 bits with a carry bit from the first step. This meant that doing this one things, adding 2 16-bit numbers required 2 instructions. now, it is about as likely that you are using a 64-bit processor as a 32-bit one. in either of these cases, the size of the integers are so large that this isn't a concern anymore. So I won't talk about this any further.

Instruction Level:
This level of parallelism gets divided into 2 categories. the one we aren't really concerned about because it happens under the hood is called instruction pipelines. in the average processor, there are typically several instructions being executed at once. But the way this works doesn't create multi-core processor like we know it today. This is just the common place way that instructions execute in a processor. Again, we aren't concerned with writing operating systems right now, so I'm going to leave this here. This isn't something we can really exploit or even interface with at all.

The other kind of instruction level parallelism is called "Superscalar" processors. In this case, a single processor executes one instruction more than onces at the same time. (Instruction: In machine language, binary essentially, what your processor actually sees, each function or step is called an instruction.) This is also sometimes called "SIMD" which stands for "Single Instruction Multiple Data". SIMD is what I want you to remember and wrap your mind around with this level.

Say I have 8 values, i want to add every 2 together, so first + second, third + fourth, fifth + sixth, and the seventh + the eighth. In a normal processor instruction it would happen just like that. group 1, group 2, group 3, group 4. with SIMD, which is actually available to you in most modern processors, the values in 1st, 2nd, 3rd, and 4th all get placed in special registers(a register is the lowest level of storage. RAM and the cache are made up of groups of registers. When ever your processor executes an instruction, it reads the data from a register to a register. if it wants to use something in RAM or cache, it reads it from them to a register and then it executes the instruction. see my post on memory access for further details.) in the processor and then the processor does executes the add instruction once, and it does both group 1 and group 2 at the same time. in this manner the number of instructions is halved.

A graphics card is an example of the first word I used. it is a "Superscalar" processor. Instead of loading group 1 and 2 into special registers, the graphics card loads all the groups in clumps of 2 into individual stream processors. so 4 stream processors get used. The thing about a stream processor is it has no control systems. A stream processor does what it is told, it never figures out what to do. So the GPU says add your first and your second to the stream processors. It is important to note here that the GPU need not and further cannot even if you wanted to, address each stream processor individually. When the GPU says "ADD!" all the stream processors add. when the GPU says "Load from your location + 1" stream processor 1 loads from memory location 2, stream processor 2 loads from memory location 3, etc. Its called performing instructions in "lock step" (meaning, each stream processor is locked into the same step as all the others) This may seem like it restricts what the graphics card can do, if you think that, you are right.

Here's the thing. Because we've restricted what the graphics card can do, there are fewer things to concern our selves about. We can then make those few instructions go super fast. So great amounts of speed are gained by applying these restrictions. This is how your graphics card can paint 1980x1080 pixels 60 times a second. (if the GPU were to do 1 instruction per pixel, that would be 128 million instructions a second. To put it in context. A GPU with 1024 stream processors [middle high end] can execute 640,000,000,000 instructions per second. A quad core CPU can execute 9,600,000,000 instructions instructions per second. see how restrictions can be useful?)

So, summarize, instruction level parallelism as we are concerned with, takes multiple data and performs the exact same function (the exact same function being the important point here) on them all at once.

Data level parallelism:
Data level parallelism is a theoretical concept. It actually rides on top of task level parallelism, but with some key differences. The easiest way to imagine data level parallelism is to think of a for loop. Assuming that one iterations of the for loop has no dependencies on any of the other iterations, a data level parallelism will execute all (or as many as physically possible) of the iterations of the for loop at once. Each processor is running the exact same code, but on different data. how does this differ from instruction level parallelism? The big thing is that while the same code is being executed, each execution is not in lock step. There is an efficiency gain because there is a sharing of code, but they don't have to all execute at exactly the same time.

Task level parallelism:
This is were a lot of attention is being put currently in computer engineering and software engineering. Here's what makes task level parallelism so important: there need not be anything in common between the separate tasks. Here's what makes it so hard: what if there is something common between the separate tasks?

Ok, now decoding my last two statements a little. Instruction level parallelism had each separate instance executing in lock step. data level it wasn't in lock step, but all instances were within a for loop. With task level there can be two completely un-related tasks running at the same time. consider your computer's network driver and a text editor. nothing at all in common, but they can run at the same time. This creates a huge amount of flexibility, but extra flexibility brings extra overhead.

But this also begs the question, what happens if/when the tasks share? There are a lot of things such as locks and semaphores that ensure that different tasks don't step on each others toes. But those control structures are not free. Some times there isn't concern over stepping on each others toes, but it can some times be hard to even share the information.

So my point is that while task level parallelism sounds attractive because you can execute what ever you want when ever you want, be careful, that freedom comes at a price. Instruction level parallelism may feel restricting, but because of those restrictions it avoids the overhead involved with task level parallelism.

So this, as with everything, is a trade off. So remember that appropriateness is key. Complete independent tasks? Task level. Large data sets with identical instructions? Instruction level.

Saturday, February 18, 2012


I wrote down a while back a file I called "Game Engine Guide" which were my original thoughts on making a game engine for my own academic purposes. I realized after I posted my last post that I referenced "Delta" with out any context for my 0 readers. (I consider this blog more of a journal than anything) In my game engine guide I planned for 3 engines. Delta, Epsilon, and Zeta. Delta was to learn the technology. OpenGL, OpenCL, OpenAL, OpenMP, etc. Epsilon was designed to be my fantastic failure. I'm over the learning curve of the technology, now lets over-plan and be over ambitious and see what lessons I can learn, what my limits really are. Then Finally Zeta is the acceptance of those limits and the design and creation of a usable engine that I am capable of creating.

In the interest of what I dreamed up last time, I want to accomplish Delta, and I want to do so soon. So I'm taking a moment to decide what Delta needs to be able to be considered complete, as well as musing a bit about its architecture.

Delta is written entirely in C and C++. It uses simple polygons for its graphics and physics, both of which are computed on the GPU. The CPU is left in charge of managing the GPU and the Audio through OpenAL. Collision detection and resolution happen on the GPU, something that will take some effort for me to master, but when done will be very powerful. I had originally said that it was to have mouse looking and keyboarding. I'm removing mouse looking. There will still be the ability to trackball items, but mouse looking is not possible through glut.

So, what do I need to complete Delta?

Finish reimplementing the graphics interface class to allow for a vector of models and the interchangable cameras and what not.
Importing .obj files
Sharing location data with Physics.

Collision detection
Collision resolution

Learn OpenAL
Triggers from GPU for sounds
Triggers from CPU for sounds
Internal hash library for sound effects

Loading main game
Loading levels

A fair bit to do. But I think I'll be able to accomplish a large chunk of this by the end of spring break. I've done a lot of work already over Christmas break. One important choice I've decided to make deals with the actual running of the engine. I've decided that the engine is to be 100% data driven. Meaning that the difference between one game and another game for the engine is solely in the assets. One game will have one set of models, sounds, scripts, etc. another will have a completely different set. The engine will be a single standalone executable. This is highly portable, highly flexible, and very simple. To play a different game you just send a different command line argument. Theoretically this would be just a different short cut to the end user that the theoretical developer would have made for the user. yes, there are flaws with this approach, but it will be sufficient for now. I just want to get Delta done so I can have something to show for my work.

Thursday, February 16, 2012

Kick starter

I was going to try to accomplish something on Facere, but a recent change in Ubuntu One broke it on my netbook, and things are currently re-assembling themselves, mean I can't access my files for Facere right now. Not wanting to waste time, I decided to move I decided to have a mental walk around an idea I had for next summer.

Between now and this time next year I hope to have completed Delta and Epsilon, and be working on Zeta. If such comes to pass, I might be able to make a step towards a kick starter. For a long time I've been interested in making a truly freeing space what-ever-you-want-to-do game which has a heavy focus on physics and custom ships. the custom ships being by far the singly most important feature. If I have zeta in alpha, then I might be able get a little interest in funding me towards that objective.

I estimated that I earned about 3k$ this summer working for computing services. If I were to try for a kick starter, I'd probably live with my mom over the summer which would cut costs a bit. But this project would essentially be my summer job. If I could get a kick starter fund of 2k$ I could probably justify it as a reasonable choice. If I could get 3-4k$ it not only would be a reasonable choice, but quite probably the most prudent one.

And my imagination flies off the hook here: If, and it is a big if I know, but If I accomplish something that is would be of the status of minecraft's first release, or dwarf fort, or overgrowth, and get something where people are purchasing pre-orders, I may very well be able to turn this in to my lively hood while I'm working towards my masters. That would be uber cool. I could settle where ever I need to and just work on what I'm truly passionate about.

Ok, the last paragraph maybe be a little far fetched, but as we have seen is entirely possible. I'm going to keep this idea in the back of my head, and press forward with Delta and Epsilon. If I can complete them and start working on Zeta by christmas break next year. I may just try to do this kick starter business.

Saturday, February 11, 2012

Simple Introduction

I recently had a little explosion on stack overflow. While I'm not afraid to admit I was probably in the wrong and the one being unreasonable, I think my point still stands. I wanted a simple introduction geometry shaders that would help me wrap my mind around how they work. Most of what I ran across was either very poor in quality (poor explanations, poor examples, specific to systems or environments) or far to technical (lots of words or phrases that aren't familiar to unskilled users, a feeling that you already understand how the rest of the system works and this is just adding on a new feature to what you already know, etc). So I wanted to see if anyone out there new of a good tutorial. My question was closed as not actually a question.

I'm taking a potentially permanent hiatus from any interactions with SO till I calm down. But also, it made me think. Of what quality is my OpenCL tutorial? I think Winkle was trying to get me to understand this too. It wasn't helpful for some one who doesn't already know how the system works. So. I'm taking a step back with my openCL intro. I'm going to see if I might be able to incorporate pictures and explain better how parallelization works. This will be a good exercise for me I think, and if I can explain it on such a level, that also means that I'll probably have to clarify things in my head. I may end up reading my own tutorial as I make it. xD