|
Hardware pipelining example |
|
|
|
|
Written by Yann Sionneau
|
|
Tuesday, 01 November 2011 11:24 |
|
Dear Open Source Hardware lovers,
I wrote a small 3-stages pipeline example in Verilog and tested it using Icarus Verilog. This particular pipeline example has no real interest, all it does is adding 3 to the input integer, it is for educational purposes only and to serve as a basis for more advanced things in the future. What is a pipeline ? Quoting Wikipedia : "In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel"
A drawing is better than 1 000 words, let assume we have a typical hardware block looking like this :

So we have basically :
- A clock (input) because we are doing a synchronous design, right ?
- A reset (input) because we want to be able to restart/reset our design.
- A "Data in" input, which can be a bus of several lines, for example 8 input lines for an 8-bits input : this is the path for data coming to the block in order to be processed.
- A "Data out" output, which can be a bus of several lines as well : this is the path for data coming out from the block, the result of the block's processing.
Actually this is usually not enough. As users of this block, we need to know :
- when it is available for computing (i.e not busy with another computing).
- when the block has sampled our input data (and started working on it) so that we can present another input data.
Moreover, the block needs to know : - when input data is correctly set, in order to sample it.
- when output data has been received/sampled, in order to start working on another input data and then be able to present the next results at the output data pins.
Therefore, the previous block would usually look a little bit more like this :

This block is doing a job on its input data, therefore producing it's output data. Very simple so far, right ?
Let assume this block does its computation in 10 clock cycles (10 periods T of the clock signal), this means that after feeding this block with data, you will wait for 10*T seconds to get the output out of it.
This means the block has a 10*T seconds latency AND that the block will only output data each 10*T seconds.
Pipelining is a way of improving the block throughput.
We won't be able to improve the block latency, because we are not going to optimize the algorithm itself used inside this hardware block : the computation needed to transform an input into an output will still take 10*T seconds.
What we can do with pipelining is making it possible for this block to reduce the time between two outputs, therefore increasing the throughput of the block.
"But how is it possible?" "You said the block can only compute in 10*T seconds and cannot accept another input while it is still computing!"
Yes! That's the trick! You have to let the block accept another input BEFORE it has totally computed the previous input.
The idea is to break the algorithm down to smaller blocks, all chained (pipelined) together : this is pipelining.
Each of these smaller blocks constituting the pipeline is called "a stage".
A 3-stages pipelined version of this hardware block would look like this :

This is a nice simplified drawing of what a typical pipeline looks like. Simplified? Yes. This just shows the idea of a chain of elements with a few control lines, I removed the clock and reset lines to make it simpler but we still need those.
"So now what? We now have 3 blocks instead of 1, and that's it? Why is this any better?"
Well, yes. That's it.
Let assume the following statements are correct :
- The first stage takes 2 clock cycles to do its job.
- The second stage takes 3 clock cycles to do his share of the job.
- The third stage of the pipeline takes a little bit longer : 5 clock cycles to finish the job.
2 + 3 + 5 = 10 OK, that sounds logical, we didn't change the algorithm so it's not any better, we just split it up in 3 parts.
But something has changed.
Now when the "stage 1" is done with its data, it can give the output to the "stage 2" whenever the latter is ready and then start processing a new output. And this applies to the following stages too.
Indeed none of the smaller block can start a new processing until it has passed its output data to the next smaller block, i.e untill the next smaller block is ready
As a result, the pipeline "speed" will be the "speed" of its slowest stage.
Which means that in our example the pipeline will output data every 5*T seconds! That's an improvement, we have twice more data coming out from the pipelined block than from the non-pipelined block in a given time period.
This is why pipelines are widely used in the conception of CPUs. Usually, a CPU contains an Instruction Pipeline whose goal is to fetch the machine code instructions from main memory, decode them, execute them and write the results back into registers and memory.
Naturally, the different stages of this Instruction Pipeline are :
- Fetch
- Decode
- Execute
- Write Back
Here is a an example of such an Instruction Pipeline :
IF : Instruction Fetch ID : Instruction Decode EX : Execute MEM : Memory access WB : Register write back
For more informations about pipelines you can look at the Instruction Pipeline Wikipedia page, it's pretty well documented.
In this code each stage is doing exactly the same thing, I just duplicated the code and renamed the stages.
Each stage takes an 8-bits integer as input, increments it and then outputs it.
I guess you can therefore easily conclude that all this pipeline does is adding 3 to a given 8-bits integer.
How to run the code ?
- Install Icarus Verilog
- git clone git://github.com/fallen/tinycpu.git && cd tinycpu/examples/pipeline/
- make run
Thanks for reading me !
|
|
Last Updated on Friday, 04 November 2011 11:58 |
|
Touchsurface Android app now PC compatible ! |
|
|
|
|
Written by Yann Sionneau
|
|
Friday, 21 January 2011 15:57 |
|
Hi guys !
The touchsurface application (port of the JGroups Draw demo for Android phones) I talked about in my last blog post is now PC compatible.
Which means you can now play with touchsurface on several phones AND on several computers at the same time !
Colors are now supported on the phone application :)
Source code of the app : https://github.com/fallen/touchsurface-android-jgroups
Source code of the JGroups port to Android : https://github.com/fallen/JGroups
Wanna try the application on your Android phone ? Just scan the following QRCode with your favourite barcode scanner :

Application link : http://sionneau.net/touchsurface.apk
All you have to do :
- Install the app on your Android ( >= 2.1 )
- Connect your phone to a WiFi Access Point
- The Access Point must not have a feature like "Access Point Isolation" activated
- The Access Point must accept to forward broadcast packets to AP Clients for the discovery protocol (BPING) to work
- Each device connected to the same WiFi Access Point and running the app (or the Draw demo from JGroups) should be able to participate in the game.
Enjoy :)
See ya !
|
|
Last Updated on Friday, 21 January 2011 17:29 |
|
Written by Yann Sionneau
|
|
Tuesday, 11 January 2011 22:15 |
|
Hi guys !
It's been a long time ... I really don't have the time and the motivation to blog, I guess blogging is not for me !
Anyway, I am doing some relatively nice stuff on a school project lately.
The goal of the project : making games on smart phones where several players can play together with automatic discovery of the different players available (auto configuration). Basically all players will connect to a Wifi Access Point or do some bluetooth PAN, begin a game and play in a mobile context where connection can be lost, data can be lost, anyone could be disconnected any second because of distance, bug, shutting down the device or battery outage. What's interesting about that ?
Several points :
- No Client/Server architecture, the different instances of the game on each phone will exchange the objects they need and communicate with each other, using group communication (broadcast / multicast) or unicast.
- There can't be a disconnection of everybody caused by the disconnection (or bug, battery fail, whatever ...) of the server phone, since there will be *no* server phone.
- No need to enter an IP address or choose a phone to connect to or whatever, as soon as the phones are connected to the same subnetwork (wifi, bluetooth or whatever) they discover each other and can join or leave a game.
- Each game instance has the same code running, no server-side, no single point of failure.
That's it, so all of this is beautiful theory, how do you do that now ?
We are trying to use JGroups as the lowest level communication API (over IP), it is a Java API to do "multicast communications".
Basically JGroups allows you to create a group (represented by a name), everyone in the same group (the same name) can automatically discover the other participants of the group and begin to discuss with them either on a one-to-one (unicast) mode or one-to-many (multicast/broadcast) mode.
In this case the group name can be the name of the game so that each phone trying to play to the same game would be "connected" to the same group and could discuss with each other.
So the game would use a game API (defined by some researchers from Télécom SudParis), and this game API would be implemented using the JGroups API.
So what is this blog post about ?
I successfully ran some demo programs using JGroups on android phones (HTC Desire, HTC Hero and Nexus One), some demos were also run on an Ubuntu and a Mac OS.
I ported and ran on those three phones a modified version of the "Draw" JGroups demo program.
It's a whiteboard, you can draw on the whiteboard with your finger touching the screen of the phone, and each point you draw is then transmitted to other group members. Several players can draw on the same whiteboard using their own phone.
Then I ported the SimpleChat JGroups demo program, there is no GUI though, but this time it is compatible with the computer version.
It has been tested simultaneously with 3 different phones and 2 computers (1 Mac OS X and 1 Ubuntu Linux).
You can write on the two computers console and the messages will be transmitted to everyone, you will be able to read them in the phones' syslogs (via adb logcat).
The phones will send in an infinite loop a message to the group with "Hello world from *phone name*", you will be able to read them in the phone's syslogs as well as in the computers' consoles.
I will keep you posted if I have something new about this project !
|
|
Last Updated on Monday, 17 January 2011 09:26 |
|
Written by Yann Sionneau
|
|
Friday, 23 July 2010 11:52 |
|
Hi again :)
This is great news guys, the Ethernet driver is now able to send and receive Ethernet frames :)
That's basically all we need an Ethernet driver to be able to do !
Multicast isn't implemented, statistics about number of rx and tx framed, errors and collisions are not either yet !
But it does work under qemu and on real hardware on the Milkymist One board :)
I have done a simple sample application that configures the network interface of RTEMS with these settings :
IP address statically set : 192.168.101.100
netmask : 255.255.255.0
Default gateway : 192.168.101.254
So i start the network and just send an UDP packet to 4.2.2.1:1234 with "toto" as payload :)
FYI RTEMS is using the BSD network stack which works really well and has similar functionality than Linux's one. It even has the same "packet structure" idea in order not to copy data over and over when passing a packet from a network stack layer to another (BSD uses struct mbuff and Linux uses struct skbuff)
So the code of the sample looks like that :
char string[] = "toto"; // The string we want to send over the network to 4.2.2.1:1234
struct sockaddr_in farAddr;
int sock, ret;
rtems_bsdnet_initialize_network(); // initializes network stack, network driver and set ip address, default gateway and such
sock = socket(AF_INET, SOCK_DGRAM, 0);
if (sock == -1)
perror("socket:");
else
printf("socket:OK\n");
memset(&farAddr, 0, sizeof farAddr);
farAddr.sin_addr.s_addr = htonl(inet_addr("4.2.2.1"));
farAddr.sin_port = htons(1234);
farAddr.sin_family = AF_INET;
ret = sendto(sock, string, strlen(string), 0, (struct sockaddr *)&farAddr, sizeof farAddr); // we send the UDP packet to 4.2.2.1:1234
if (ret == -1)
perror("sendto:");
else
printf("sendto:OK\n");
This code is very portable since it uses socket API over BSD network stack, it can run on Linux or BSD (or even MAC OS i guess ...)
The only thing that you would have to remove is the rtems_bsdnet_initialize_network(); since it's not usually the application's job to set-up the network configuration :)
Here is a wireshark capture screenshot of the networking sample application i just described :

Have fun !
|
|
Last Updated on Friday, 23 July 2010 12:13 |
|
Framebuffer driver works in triple buffered mode |
|
|
|
|
Written by Yann Sionneau
|
|
Friday, 23 July 2010 11:29 |
|
Hi folks !
The Framebuffer driver is now able to work in triple buffered mode in order to provide anti-flickering !
That's good news, but i have even better !
The framebuffer has been tested with a more complex program than just showing monochroms : the graphic toolkit for embedded systems Genode-FX has been successfully run on top of the framebuffer driver with RTEMS in single buffered mode :)
Here is a screenshot of Genode-FX running on qemu-lm32 :

I add that it works too on the Milkymist One board :)
By default the driver uses single buffered mode, you can switch to triple buffered mode using an ioctl.
Here is a piece of exemple code to show how to use the framebuffer :
int fb;
struct fb_fix_screeninfo fb_fix;
unsigned short int *screen; // a pointer to the framebuffer memory area
fb = open("/dev/fb", O_RDWR); // We open the framebuffer
ioctl(fb, FBIOSETBUFFERMODE, FB_TRIPLE_BUFFERED); // we switch to triple buffered mode (optional)
ioctl(fb, FBIOGET_FSCREENINFO, &fb_fix);
screen = (unsigned short int *)fb_fix.smem_start; // Here we assign the memory address to our pointer
screen[50 * 640 + 50] = 0xffff; // We set the pixel (x;y) = (50;50) to white
ioctl(fb, FBIOSWAPBUFFERS); // We swap the buffers when we have finished to draw the next frame
and we can go on, repeat this pattern, getting the new address of the framebuffer (yes because we swapped, it changed !) and writting again then swapping :)
Have fun !
|
|
Last Updated on Friday, 23 July 2010 11:47 |
|
Framebuffer driver of Milkymist for RTEMS works ! |
|
|
|
|
Written by Yann Sionneau
|
|
Monday, 05 July 2010 01:31 |
|
Hi again folks !
A short post on my blog just to say that i have written the framebuffer driver of Milkymist SoC for RTEMS.
I did a small sample framebuffer testsuit to check that the framebuffer driver was really functional and it indeed confirmed that the driver is working :)
This program only writes red pixels in the framebuffer, resulting in showing a totally red screen in lm32-qemu !
The driver is operating in single buffer mode for now, so using it right now may result in flicker since the changes in the buffer are directly made to the frontbuffer which is scanned at 60 Hz by the Milkymist SoC VGA IP core. So a change in the framebuffer can (and will most of the time) happen while the screen is refreshing, for exemple in the middle of the screen refreshing and only half of the screen would be updated, and we would have to wait another refreshing to have the full screen updated with the new framebuffer. This wait time (during 0.016 seconds) is actually seen by the eyes and makes what we call "flicker effect".
This driver will be improved to add support for double and triple buffering to have a totally flicker-free experience with RTEMS on Milkymist :)
That's all for now, stay tuned ;)
PS : RTEMS repository is not anymore hosted on the Milkymist github account, if you are searching for the driver please see the first link of this article. If you are searching for the testsuite please download my source archive from the GSoC google web page : http://code.google.com/p/google-summer-of-code-2010-rtems/downloads/list
|
|
Last Updated on Wednesday, 23 November 2011 22:34 |
|
RTEMS runs on Milkymist ! |
|
|
|
|
Written by Yann Sionneau
|
|
Saturday, 03 July 2010 15:19 |
|
Hi !
It's been a while since I last posted on my blog about my GSoC... and I apologize for it !
I have made some progress in the port of RTEMS to the Milkymist System-on-Chip :)
Thanks to Sebastien Bourdeauducq (alias lekernel) who took the time to work with me on June the 19th I managed to write the Clock, Timer, UART and Console driver for Milkymist SoC !
We went through some problems understanding the autotools crapwares used to configure and build RTEMS and lost some time with the bootstraping of the entire source tree because we didn't knew at the beginning that we could just bootstrap the working directory !
But anyway, now my Development Environment is all set-up and I am more aware of how RTEMS BSP (board support package) development works !
Because reading the documentation is one thing ... actual coding is another !
So finally we managed to get the hello and ticker RTEMS sample programs to run both on the REAL hardware (on the ML401 dev board) and on the lm32-qemu simulator emulating the Milkymist SoC using the parameter '-M milkymist' on command line ! :)
Which is really great news, the port is at last being born !
The hello sample testsuit is a really basic program which just needs the console driver (which uses the uart driver) and has for unique task to print some text over the serial console as a classic "Hello World" would do :)
The ticker sample testsuit (init.c , tasks.c) does a little more, it tests the timer driver by launching 3 tasks (the equivalent of POSIX threads in term of RTEMS API) simultaneously.
These tasks are named "TA1" , "TA2" and "TA3".
Each task enters an infinite loop which prints the time of the day and suspends itself and reschedule itself to run respectively 5, 10 and 15 seconds later.
So finally each task is running independently from the others, and prints a message with current time each 5 seconds for TA1, 10 seconds for TA2 and 15 seconds for TA3.
This tests the timer for two reasons :
- The rtems_task_wake_after() function needs the timer to reschedule the task at desired time.
- The RTEMS internal scheduler needs the timer to do the scheduling between the different tasks, accounting the different time slices given to each tasks.
Click on "Read more..." to see screenshots of those tests running !
|
|
Last Updated on Saturday, 03 July 2010 18:49 |
|
GSoC 2010 welcome/gift package arrived ! |
|
|
|
|
Written by Yann Sionneau
|
|
Friday, 18 June 2010 20:47 |
|
Hi !
I have just received my welcome/gift package from Google for the GSoC (Google Summer of Code) 2010 :)
Here some pictures of the content of the Fedex package :
- A letter with informations about the Google Credit Card
- A Google plastic ball pen
- Two Google Summer of Code 2010 stickers, one of wich is shiny :)
- A nice Google Summer of Code 2010 Notebook (to use in combination with the plastic ball pen i guess ;))
- A very nice and custom transparent Google Credit Card (VISA)
Click on "Read more..." to see the pictures i have taken of the content of the Fedex package :)
|
|
Last Updated on Saturday, 03 July 2010 15:08 |
|
MD5 hardware bruteforcer ported to Xilinx Spartan-3AN Starter Kit |
|
|
|
|
Written by Yann Sionneau
|
|
Thursday, 20 May 2010 09:12 |
|
Some news again !
We received a nice little package here at MiNET, 2 Xilinx FPGA development boards !
A quick porting full of excitment and a test allow me to say that IT WORKS !
The MD5-hbf project works as well on this development board as on the first one (Avnet sp3Aeval).
The ported design even runs at 25 MHz, which is better than in previous tests where the frequency was 16 MHz only ! :)
"It was predictible" you would say, and you would be right ! Indeed i didn't take a big risk with this test, the Spartan-3A being very similar to the Spartan-3AN and my design doesn't use any external peripheral other than the quartz and the RS-232 link !
Nevertheless, on the Avnet Spartan-3A Evaluation Kit the USB-UART bridging between the computer and the FPGA is done by a CY8C24894-24LFXI Cypress microcontroller, whereas the Xilinx Spartan-3AN Starter Kit uses an ICL 3232E component to do the TTL<->RS-232 signal conversion.
However, the RS-232 serial link works well anyway on both boards which make me think that my usart.v module isn't that buggy since it works on 2 boards at the moment with 2 different RS-232 drivers.
News of the 4ed32520 commit : It is now possible to synthetize md5-hbf inside a shell console, without starting the huge enormous and buggy ISE Webpack GUI.
Upcoming in the md5-hbf project : a conditional compilation system which will allow to synthetize easily choosing which development board is the target, using DCM ( Digital Clock Manager ) to cope with different oscillators frequency on the different boards.
|
|