|
Dear Open Source Hardware lovers,
I wrote a small 3-stages pipeline example in Verilog and tested it using Icarus Verilog. This particular pipeline example has no real interest, all it does is adding 3 to the input integer, it is for educational purposes only and to serve as a basis for more advanced things in the future. What is a pipeline ? Quoting Wikipedia : "In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel"
A drawing is better than 1 000 words, let assume we have a typical hardware block looking like this :

So we have basically :
- A clock (input) because we are doing a synchronous design, right ?
- A reset (input) because we want to be able to restart/reset our design.
- A "Data in" input, which can be a bus of several lines, for example 8 input lines for an 8-bits input : this is the path for data coming to the block in order to be processed.
- A "Data out" output, which can be a bus of several lines as well : this is the path for data coming out from the block, the result of the block's processing.
Actually this is usually not enough. As users of this block, we need to know :
- when it is available for computing (i.e not busy with another computing).
- when the block has sampled our input data (and started working on it) so that we can present another input data.
Moreover, the block needs to know : - when input data is correctly set, in order to sample it.
- when output data has been received/sampled, in order to start working on another input data and then be able to present the next results at the output data pins.
Therefore, the previous block would usually look a little bit more like this :

This block is doing a job on its input data, therefore producing it's output data. Very simple so far, right ?
Let assume this block does its computation in 10 clock cycles (10 periods T of the clock signal), this means that after feeding this block with data, you will wait for 10*T seconds to get the output out of it.
This means the block has a 10*T seconds latency AND that the block will only output data each 10*T seconds.
Pipelining is a way of improving the block throughput.
We won't be able to improve the block latency, because we are not going to optimize the algorithm itself used inside this hardware block : the computation needed to transform an input into an output will still take 10*T seconds.
What we can do with pipelining is making it possible for this block to reduce the time between two outputs, therefore increasing the throughput of the block.
"But how is it possible?" "You said the block can only compute in 10*T seconds and cannot accept another input while it is still computing!"
Yes! That's the trick! You have to let the block accept another input BEFORE it has totally computed the previous input.
The idea is to break the algorithm down to smaller blocks, all chained (pipelined) together : this is pipelining.
Each of these smaller blocks constituting the pipeline is called "a stage".
A 3-stages pipelined version of this hardware block would look like this :

This is a nice simplified drawing of what a typical pipeline looks like. Simplified? Yes. This just shows the idea of a chain of elements with a few control lines, I removed the clock and reset lines to make it simpler but we still need those.
"So now what? We now have 3 blocks instead of 1, and that's it? Why is this any better?"
Well, yes. That's it.
Let assume the following statements are correct :
- The first stage takes 2 clock cycles to do its job.
- The second stage takes 3 clock cycles to do his share of the job.
- The third stage of the pipeline takes a little bit longer : 5 clock cycles to finish the job.
2 + 3 + 5 = 10 OK, that sounds logical, we didn't change the algorithm so it's not any better, we just split it up in 3 parts.
But something has changed.
Now when the "stage 1" is done with its data, it can give the output to the "stage 2" whenever the latter is ready and then start processing a new output. And this applies to the following stages too.
Indeed none of the smaller block can start a new processing until it has passed its output data to the next smaller block, i.e untill the next smaller block is ready
As a result, the pipeline "speed" will be the "speed" of its slowest stage.
Which means that in our example the pipeline will output data every 5*T seconds! That's an improvement, we have twice more data coming out from the pipelined block than from the non-pipelined block in a given time period.
This is why pipelines are widely used in the conception of CPUs. Usually, a CPU contains an Instruction Pipeline whose goal is to fetch the machine code instructions from main memory, decode them, execute them and write the results back into registers and memory.
Naturally, the different stages of this Instruction Pipeline are :
- Fetch
- Decode
- Execute
- Write Back
Here is a an example of such an Instruction Pipeline :
IF : Instruction Fetch ID : Instruction Decode EX : Execute MEM : Memory access WB : Register write back
For more informations about pipelines you can look at the Instruction Pipeline Wikipedia page, it's pretty well documented.
In this code each stage is doing exactly the same thing, I just duplicated the code and renamed the stages.
Each stage takes an 8-bits integer as input, increments it and then outputs it.
I guess you can therefore easily conclude that all this pipeline does is adding 3 to a given 8-bits integer.
How to run the code ?
- Install Icarus Verilog
- git clone git://github.com/fallen/tinycpu.git && cd tinycpu/examples/pipeline/
- make run
Thanks for reading me !
|