• Register

Indie gamedev for IfThen Software. I usually handle technical tasks, but I also pitch in with PR, marketing, and community management.

RSS My Blogs  (0 - 10 of 20)

Because I'm getting into the section on memory (and some parts of yesterday's example made me curious, particularly the RAS-to-CAS delay) I have decided to research RAM again. I did quite a bit of research in this area about a month ago, so a lot of this is review and should go by relatively quickly.

Data is stored in RAM in an array of bits called a "bank".

A horizontal line of bits make up a "word line" and a vertical line of bits make up a "bit line".

The array is split up into rows and columns. A row is equivalent to a word line, so that's easy enough. A column is made up of multiple contiguous bit lines, either 4, 8, or 16 depending on the architecture of the chip. The number of bit lines which make up a column is the word width of the chip, or just simply the "width". This is usually written with an "x" followed by the width: x4, x8, and x16.

The number of rows depends on the generation of DDR and the "density" of the chip. The density is the number of bits the chip has total across all banks. The number of columns depends on the density and width of the chip. The exact row and column counts for the possible densities and widths are specified by the DDR standard for the generation in question, although my experience is that you have to perform some calculations to find them. Here is a table I put together of the row and column counts for the first generation of DDR:

Reposted from Invisiblegdev.blogspot.com

A predicated instruction is an instruction whose execution depends on the result of a true/false test. Another way to look at it is a single instruction for code like the following:

if (a > b) c = 6;

Predicated instructions can help to reduce the number of branches in your code, which may increase how fast your program executes.

On a slight tangent, I also learned what a transistor is: A resistor whose value (resistance) changes. I still don't know how they are used or why there are so many in a processor, but I've satisfied my curiosity for the moment. I highly recommend this video on the subject: Youtube.c...h?v=CkX8SkTgB0g

You can classify a processor as having either a Brainiac design or a Speed-Demon design based on how much it pushes for ILP. A Brainiac design throws as much hardware at the problem as possible, sacrificing simplicity and size for more ILP. A Speed-Demon design relies on the compiler to schedule instructions in a way that extracts as much ILP out of the code as possible. A Speed-Demon design is relatively simple and small, allowing for higher clock speeds (until the thermal brick wall was hit in the early 2000s) which is how it got its name.

I finally started learning about memory access. One of the reasons I started researching CPU architecture was to find out why a Load-Hit-Store on a Xenon (XBox360 processor) could cause a stall of up to 80 cycles, and I think I am getting close to an answer. If I could reiterate an example from Modern Processors - A 90 Minute Guide, lets say you have a 2.0GHz CPU and 400MHz SDRAM. Lets also say that it takes 1 cycle for the memory address to be sent from the CPU to the memory controller, 1 cycle to get to the DIMM, 5 cycles from the RAS-to-CAS delay (assuming there is a page miss, as is likely with the memory hierarchy we have today), another 5 cycles from CAS, then 1 cycle to send the data to the prefetch buffer, 1 cycle to the memory controller, and finally 1 cycle to the CPU. In total we have 15 memory-clock cycles (assuming the FSB:DRAM ratio is 1:1) to get data from the main memory. To convert this into CPU clock cycles, multiply it by the CPU:FSB ratio (CPU multiplier), which in this case is 5. So 15*5 = 75 CPU clock cycles before the data is received from the main memory. A diagram is definitely easier for me to understand, so here is a drawing I created to help me understand how this latency was calculated:
Posted Image

Reposted from Invisiblegdev.blogspot.com

Trying out a different work schedule, so my blog entries may be a little erratic while I work out some of the kinks. So anyways, this blog post is for yesterday.

Today I learned about a couple new techniques. The first one is something called "very long instruction word" (VLIW). In VLIW, a single instruction is actually composed of multiple smaller instructions which have been packed together by the compiler. The fetch and decode stages of the pipeline can effectively work with multiple instructions in parallel, but they only have to deal with a single instruction. The decode stage unpacks the sub-instructions and sends them to the appropriate functional units. The decode stage does not detect hazards and the pipeline can generally only be stalled on a cache miss, so it is the job of the compiler to insert NOP instructions (no-operation) to prevent hazards. A processor which makes use of VLIW is said to have an explicitly parallel design.

The other technique is something called an "interlock". I am still researching the specifics, but an interlock in general is some mechanism which prevents harm to either the operator or the machine (an example would be the mechanism which locks the oven door during a self-clean cycle). In the case of processors, an interlock can detect a data hazard in the decode stage of the pipeline and stall the previous stages while sending NOPs out of the decode stage until the hazard has been resolved. I am assuming that an interlock is also used to prevent one stage from passing its output to the next stage in the event that the next stage is stalled (for example, a particular instruction is taking multiple clock cycles to execute and the fetch and decode stages must be stalled).

Reposted from Invisiblegdev.blogspot.com

I spent a good chunk of the day organizing my Google Docs page; it was becoming impossible to find anything. Whenever I came across a resource while researching something, I would tell the Google Viewer to save it to my documents page but that caused things to become quite cluttered.

The remainder of the day was spent reading Pipelining: An Overview at ars technica. It's mostly been review, so I don't have much to report. However, I did learn that a non-pipelined processor is also called a "Single-Cycle Processor". The article series also did a good job of explaining what a stall and pipeline bubble is, including a few interesting graphs.

Reposted from Invisiblegdev.blogspot.com

I finally have an answer to my question! PleasingFungus from the #tigIRC channel on Esper discussed the question with me for two hours straight today and this is the answer I ended up with:

If a scalar (not superscalar) processor has multiple nonredundant functional units (a single FPU and a single ALU for example) and it issues an instruction to the FPU at a certain clock tick, could it issue another instruction to the ALU at the next clock tick so that both the FPU and the ALU are executing at the same time (lets say that FPU instructions take longer to execute than ALU instructions)? Or would that make the processor superscalar?
A processor is only superscalar if its maximum instruction throughput is greater than 1. Because only a single instruction is being issued each clock cycle, the maximum instruction throughput is 1, so the processor is not superscalar. A processor can have multiple pipelines, but if the maximum instruction throughput is equal to or less than 1 the processor is not considered to be superscalar.

Instruction throughput is a measure of the number of instructions which can be processed by the pipeline (committed) per cycle. Because all instructions don't always get through the pipeline in the same amount of time, the "average instruction throughput" can change depending on which instructions were executed. The maximum instruction throughput is a static measurement however, representing the best case scenario.

Even if the instruction pipeline splits into two separate pipelines (one for each functional unit) at the issue stage (or some other point), if only one instruction is dispatched per clock cycle the processor has an instruction throughput less than or equal to 1. Two instructions might complete at the same time (if one functional unit is slower than the other), thus giving the illusion of having an instruction throughput of 2, but it will take at least one cycle to "recover" from this, during which no instructions will complete; so it evens out.

I am still doing some research into this topic, although I currently consider myself to be about 50% finished with the question. I plan on reading the Pipelining: An Overview series at ars technica after working through some example pipelines on paper.

Reposted from Invisiblegdev.blogspot.com

I spent most of the day trying to find the answer to the following question:

If a scalar (not superscalar) processor has multiple nonredundant functional units (a single FPU and a single ALU for example) and it issues an instruction to the FPU at a certain clock tick, could it issue another instruction to the ALU at the next clock tick so that both the FPU and the ALU are executing at the same time (lets say that FPU instructions take longer to execute than ALU instructions)? Or would that make the processor superscalar?

Unfortunately, I have not yet found a definite answer. After a rather long discussion in the #hardware channel on Freenode, one person said that such a processor would have to be superscalar. I need to get the opinion of several people on this matter though, especially since he seemed a little unsure of the answer he gave.

I asked for help on a hardware forum, but there have been no replies as of yet. I may search for a more technical hardware forum tomorrow.

Reposted from Invisiblegdev.blogspot.com

"Scoreboarding" is also called "Thornton's Algorithm". Tomasulo's Algorithm and Thornton's Algorithm were developed around the same time with roughly the same goals.

  • They are both dynamic scheduling methods.
  • They both resolve dependencies (RAW, WAR, and WAW).
  • They are both pipelined.
  • They both have limited out-of-order execution capabilities.
  • Neither was designed to be superscalar.

Tomasulo's Algorithm, while solving dependencies better, was more complicated than Thornton's Algorithm.

Reposted from Invisiblegdev.blogspot.com

Welp, I am still trying to figure out how an out-of-order scalar processor would work. The out-of-order execution part is easy enough, but I am having trouble figuring out how in-order completion would work. To this end, I am currently researching scoreboarding in the hopes of gleaning some hint as to how such a processor would work. If anyone knows of a processor which had out-of-order execution and in-order completion (different from commit), please let me know!

Research Questions

What is forwarding?
Forwarding is what you call the technique of passing the result of an instruction directly to a dependent instruction before the result is committed. This allows an instruction to execute before its operands are available in the register file.

Reposted from Invisiblegdev.blogspot.com

Today I was able to find an answer to the following research question:

Can a processor have out-of-order execution and in-order completion?
Yes; with one pipeline per active instruction. In other words, for every instruction in the active instruction window (every instruction which can be selected for execution) there needs to be a functional unit. This seems a little impractical and I doubt this sort of processor has been used much.

I am currently working out how the various combinations of in-order/out-of-order and scalar/superscalar affect the design of a processor. I have the superscalar combinations figured out, and I am currently working on the scalar combinations.

Reposted from Invisiblegdev.blogspot.com

Today I was able to find answers to the following questions:

What is the reorder buffer?
It is a buffer which is used to ensure that instructions commit in-order. It is also used in one method of register renaming. There is one entry for every active instruction. An instruction is active after it has been dispatched and before it has been committed. A reorder buffer is a circular FIFO queue, meaning it has a head pointer and a tail pointer which can wrap around once they have hit the end of the buffer, and entries are removed in the order they are added. When an instruction is dispatched, an entry for it is placed in the reorder buffer where the tail is pointing (and the tail pointer is incremented to the next entry address). When an instruction is being pointed to by the head pointer and the instruction has completed execution, it is committed and the head pointer is incremented to the next entry address.

Can a superscalar processor have in-order execution?
Yes. If sequential instructions are being executed in parallel, it is still considered to be in-order execution.

I also added the following unanswered questions to my research list:

  • What does "dynamically encountered" mean regarding memory operations in the dynamical reordering process of a processor?
  • Where is the active instruction window?
  • Can a processor have out-of-order execution and in-order completion?

One of our community members and moderators, Acaceol, created a rather cool IRC-based project tracking system. I am currently using it to track the various projects and tasks I have to do, and the best part is that anyone can check up on the current status of development/research by simply joining our online chat room located at Irc.ifthensoftware.net and either asking the project management system for the current status or watching as we update it throughout the day. This will allow you to keep up-to-date on exactly what is being worked on, including how much time has been spent on the various tasks and projects!

Reposted from Invisiblegdev.blogspot.com

Last Online
United States 🇺🇸
Become friends
Member watch
Blog Statistics
Views Today