Parallel processing on FPGA. How to start with? - parallel-processing

I have a computational intensive task which I used CUDA to implement it and now I want to make it even faster with FPGAs (if possible)
The system I want to implement is a series of computations each similar to matrix multiplication in sense of being parallel. It also has some non-parallel parts in between. It works with big amounts of data.
Although I want it as fast as possible, I have enough time to learn and explore with FPGAs.
here I'm asking for suggestions on how I start my path? Which FPGA to choose and where to learn about it. any website or online class or books? I've decided to do this anyway but your idea of whether this will be faster on FPGA or not would be helpful too.

The big wins from an FPGA over using a GPU come from:
Using non-standard word widths optimised to your application. This allows denser logic, which allows more parallel processing blocks
using your knowledge of the required accesses to external RAM to schedule them in hardware more efficiently than a general purpose memory controller can.
The downside is getting data to and from the FPGA. Draw a data-transfer diagram before you start. Even if the FPGA provides infinite speedup, you might still find it's not worth the effort if there's loads of data to be shuffled to and fro!
It's likely you'll be wanting a PCI express based board. Which is (I imagine) a whole new learning-curve before you get to do anything with the FPGA - but if you're up for it, it'll be a very interesting task!
In terms of choosing FPGAs, have a play with the software tools from the various vendors - at the learning stage that's much more important than the chips themselves. You won't find (at this early learning-stage) a show-stopper feature in any of the various chips. Also take into account the availability of boards with your required interfaces on, and any IP-core you might need to do the high-speed interfacing (eg PCIe)

You can get a substantial speedup on most parallel problems with an FPGA.
However, in addition to implementing your computation on the FPGA, there's a lot of work involved in getting the data back and forth from the CPU/main memory. This will require implementation of (for example) a PCI Express endpoint in the FPGA logic (bus mastering for maximum speed) and custom drivers on the software side. Most operating systems will require those drivers to be developed in kernel mode.
And you can't just use the most straightforward approach for FPGA programming either. You're going to need to worry about pipelining and clock synchronization in order to maximize throughput.
In other words, it's a substantially difficult task even for engineers with years of FPGA experience. I strongly suggest you find someone to work with on this. Depending on how proprietary your project is, you might find skilled academics willing to work with you as long as you provide them with all materials and publication rights.
If you're determined to go it alone, you'll need some hardware. Many different companies offer FPGA wired up as accelerators, for example http://www.nallatech.com/pci-express-cards.html
Depending on whether you choose a Xilinx or Altera FPGA, you'll find considerable sample code and tutorials for getting PCI express working.

Related

What is the more common way to build up a robot control structure?

I’m a college student and I’m trying to build an underwater robot with my team.
We plan to use stm32 and RPi. We will put our controller on stm32 and high-level algorithm (like path planning, object detection…) on Rpi. The reason we design it this way is that the controller needs to be calculated fast and high-level algorithms need more overhead.
But later I found out there is tons of package on ROS that support IMU and other attitude sensors. Therefore, I assume many people might build their controller on a board that can run ROS such as RPi.
As far as I know, RPi is slower than stm32 and has less port to connect to sensor and motor which makes me think that Rpi is not a desired place to run a controller.
So I’m wondering if I design it all wrong?
Robot application could vary so much, the suitable structure shall be very much according to use case, so it is difficult to have a standard answer, I just share my thoughts for your reference.
In general, I think Linux SBC(e.g. RPi) + MCU Controller(e.g. stm32/esp32) is a good solution for many use cases. I personally use RPi + ESP32 for a few robot designs, the reason is,
Linux is not a good realtime OS, MCU is good at handling time critical tasks, like motor control, IMU filtering;
Some protection mechnism need to be reliable even when central "brain" hang or whole system running into low voltage;
MCU is cheaper, smaller and flexible to distribute to any parts inside robot, it also helps our modularized design thinking;
Many new MCU is actually powerful enough to handle sophisticated tasks and could offload a lot from the central CPU;

DE1-SoC Board FPGA for evolvable hardware

I would like to reproduce the experiment from Dr. Adrian Thompson, who used genetic algorithm to produce a chip (FPGA) which can distinguish between two different sound signals in a extreme efficient way. For more information please visit this link:
http://archive.bcs.org/bulletin/jan98/leading.htm
After some research I found this FPGA board:
http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=167&No=836&PartNo=1
Is this board capable of reproducing Dr. Adrian Thompsons experiment or am I in need of another?
Thank you for your support.
In terms of programmable logic, the DE1-SoC is about ~20x bigger, and has ~70x as much embedded memory. Practically any modern FPGA is bigger than the "Xilinx XC6216" cited by his papers, as was linked to you in the other instance of this question you asked.
That said, most modern FPGAs don't allow for the same fine-granularity of configuration, as compared to older FPGAs - the internal routing and block structures are more complex, and FPGA vendors want to protect their products and compel you to use their CAD tools.
In short, yes, the DE1-SoC will be able to contain any design from 12+ years ago. As for replicating the specific functions, you should do some more research to determine if the methods used are still feasible with modern chips and CAD tools.
Edit:
user1155120 elaborated on the features of the XC6216 (see link below) that were of value to Thompson.
Fast Configuration: A larger device will generally take longer to configure, as you have to send more configuration data. That said, I/O interfaces are faster than they were 15 years ago, so it depends on your definition of "fast".
Reconfiguration: Cyclone V chips (like the one in the DE1-SoC) do support partial reconfiguration, but the subscription version of the Quartus II software is required, in addition to a separate license to support PR. I don't believe it supports wildcard reconfiguration, though I could be mistaken.
Memory-Mapped Addressing: The DE1-SoC's internal data can be access through the USB Blaster interface. However, this requires using the SystemConsole on the host PC, so it's not a direct access.

Most effective method to use parallel computing on different architectures

I am planning to write something to take advantages of the many devices that I have at home.
Basically my aim is to use the laptop to execute calculations, and also to use my main desktop computer to add more power (and finish the task quicker). I work with cellular simulation and chemical interactions, so to me would be great to take advantage of all that I have available at home.
I am using mainly OSX, so I need something that may work with that OS. I can code in objective-C, C and C++.
I am aware of GCD, OpenCL and MPI, but I am not sure which way to go.
I was planning to not use the full power of my desktop but only some of the available cores (in this way I can continue to work on the desktop doing other tasks that are not so resource intensive). In particular I would love to use the graphic card power (it is an ATI card, so no CUDA), since all that I do mainly is spreadsheet, word and coding with Xcode, and the graphic card resources are basically unused in that scenario.
Is there a specific set of libraries or API, among the aforementioned 3, that would allow me to selectively route tasks, and use resources on another machine without leaving the control totally to the compiler? I've heard that GCD is great but it has very limited control on where the blocks are executed, while MPI is on the other side of the spectrum....OpenCL seems to be in the middle.
Before diving in one of these technologies I would like to know which one would most likely suit my needs; I am sure that some other researcher has already used successfully parallel computing to achieve what I am trying to achieve.
Thanks in advance.
MPI is more for scientific computing large scale many processors many nodes exc not for a weekend project, for what you describe I would suggest using OpenCl or any one the more distributed framework of AMQP protocol families, such as zeromq or rabbitMQ, or a combination of OpenCl and AMQP , or even simpler consider multithreading , i would suggest OpenMP for that. I'm not sure if you are looking for direct solvers or parallel functions but there are many that exist as well for gpu's and cpu's which you can find on the web
Sorry, but this question simply cannot be meaningfully answered as posed. To be sure, I could toss out a collection of buzzwords describing various technologies to look at like GCD, OpenMPI, OpenCL, CUDA and any number of other technologies which allow one to run a single program on multiple cores, multiple programs on different cooperating computers, or a single program distributed across CPU and GPU, and it sounds like you know about a number of those already so I wouldn't even be adding much value in listing the buzzwords.
To simply toss out such terms without knowing the full specifics of the problem you're trying to solve, however, is a bit like saying that you know English, French and a little German so sure, by all means - mix them all together in a single paragraph without knowing anything about the target audience! Similarly, you can parallelize a given computation in any number of ways, across any number of different processing elements, but whether that parallelization is actually a win or not is going to be entirely dependent on the nature of the algorithm, its data dependencies, how much computation is expected for each reasonable "work chunk", and whether it can be executed on a GPU with sufficient numeric precision, among many other factors. The more complex the technology you choose, the more those factors matter and the greater the possibility that the resulting code will actually be slower than its single-threaded, single machine counterpart. IPC overhead and data copying can, and frequently do, swamp all of the gains one might realize from trying to naively parallelize something and then add additional overhead on top of that, resulting in a net loss. This is why engineers who can do this kind of work meaningfully and well are in such high demand. :)
Without knowing anything about your calculations, I would move in baby steps. First try a simple multi-processor framework like GCD (which is already built in to OS X and requires no additional dependencies to use) and figure out how to factor your code such that it can effectively use all of the available cores on a single machine. Once you've learned where the wins are (and if there even are any - if multi-threading isn't helping, multi-machine parallelization almost certainly won't either), try setting up several instances of the calculation on several machines with a simple IPC model that allows for distributing the work. Having already factored your algorithm(s) for multiple threads, it should be comparatively straight-forward to further generalize the approach across multiple machines (though it bears noting that the two are NOT the same problem and either way you still want to use all the cores available on any of the given target machines, so the two challenges are both complimentary and orthogonal).

Where to learn about low-level, hard-core performance stuffs?

This is actually a 2 part question:
For people who want to squeeze every clock cycle, people talk about pipelines, cache locality, etc.
I have seen these low level performance techniques mentioned here and there but I have not seen a good introduction to the subject, from start to finish. Any resource recommendations? (Google gave me definitions and papers, where I'd really appreciate some kind of worked examples/tutorials real-life hands-on kind of materials)
How does one actually measure this kind of things? Like, as in a profiler of some sort? I know we can always change the code, see the improvement and theorize in retrospect, I am just wondering if there are established tools for the job.
(I know algorithm optimization is where the orders of magnitudes are. I am interested in the metal here)
The chorus of replies is, "Don't optimize prematurely." As you mention, you will get a lot more performance out of a better design than a better loop, and your maintainers will appreciate it, as well.
That said, to answer your question:
Learn assembly. Lots and lots of assembly. Don't MUL by a power of two when you can shift. Learn the weird uses of xor to copy and clear registers. For specific references,
http://www.mark.masmcode.com/ and http://www.agner.org/optimize/
Yes, you need to time your code. On *nix, it can be as easy as time { commands ; } but you'll probably want to use a full-features profiler. GNU gprof is open source http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html
If this really is your thing, go for it, have fun, and remember, lots and lots of bit-level math. And your maintainers will hate you ;)
EDIT/REWRITE:
If it is books you need Michael Abrash did a good job in this area, Zen of Assembly language, a number of magazine articles, big black book of graphics programming, etc. Much of what he was tuning for is no longer a problem, the problems have changed. What you will get out of this is the ideas of the kinds of things that can cause bottle necks and the kinds of ways to solve. Most important is to time everything, and understand how your timing measurements work so that you are not fooling yourself by measuring incorrectly. Time the different solutions and try crazy, weird solutions, you may find an optimization that you were not aware of and didnt realize until you exposed it.
I have only just started reading but See MIPS Run (early/first edition) looks good so far (note that ARM took over MIPS as the leader in the processor market, so the MIPS and RISC hype is a bit dated). There are a number of text books old and new to be had about MIPS. Mips being designed for performance (At the cost of the software engineer in some ways).
The bottlenecks today fall into the categories of the processor itself and the I/O around it and what is connected to that I/O. The insides of the processor chips themselves (for higher end systems) run much faster than the I/O can handle, so you can only tune so far before you have to go off chip and wait forever. Getting off the train, from the train to your destination half a minute faster when the train ride was 3 hours is not necessarily a worthwhile optimization.
It is all about learning the hardware, you can probably stay within the ones and zeros world and not have to get into the actual electronics. But without really knowing the interfaces and internals you really cannot do much performance tuning. You might re-arrange or change a few instructions and get a little boost, but to make something several hundred times faster you need more than that. Learning a lot of different instruction sets (assembly languages) helps get into the processors. I would recommend simulating HDL, for example processors at opencores, to get a feel for how some folks do their designs and getting a solid handle on how to really squeeze clocks out of a task. Processor knowledge is big, memory interfaces are a huge deal and need to be learned, media (flash, hard disks, etc) and displays and graphics, networking, and all the types of interfaces between all of those things. And understanding at the clock level or as close to it as you can get, is what it takes.
Intel and AMD provide optimization manuals for x86 and x86-64.
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html/
http://developer.amd.com/documentation/guides/pages/default.aspx
Another excellent resource is agner.
http://www.agner.org/optimize/
Some of the key points (in no particular order):
Alignment; memory, loop/function labels/addresses
Cache; non-temporal hints, page and cache misses
Branches; branch prediction and avoiding branching with compare&move op-codes
Vectorization; using SSE and AVX instructions
Op-codes; avoiding slow running op-codes, taking advantage of op-code fusion
Throughput / pipeline; re-ordering or interleaving op-codes to perform separate tasks avoiding partial stales and saturating the processor's ALUs and FPUs
Loop unrolling; performing multiple iterations for a single "loop comparison, branch"
Synchronization; using atomic op-code (or LOCK prefix) to avoid high level synchronization constructs
Yes, measure, and yes, know all those techniques.
Experienced people will tell you "don't optimize prematurely", which I relate as simply "don't guess".
They will also say "use a profiler to find the bottleneck", but I have a problem with that. I hear lots of stories of people using profilers and either liking them a lot or being confused with their output.
SO is full of them.
What I don't hear a lot of is success stories, with speedup factors achieved.
The method I use is very simple, and I've tried to give lots of examples, including this case.
I'd suggest Optimizing subroutines in assembly
language
An optimization guide for x86 platforms.
It's quite heavy stuff though ;)

Digital Circuit understanding

In my quest for getting some basics down before I start going into programming I am looking for essential knowledge about how the computer works down at the core level.
I have a theory that actually understanding what for instance a stackoverflow let alone a stack is, instead of my sporadic knowledge about computer systems, will help me longer term.
Is there any books or sites that take you through how processors are structured and give a holistic overview and that somehow relates to good to know about digital logic?
Am i making sense?
Yes, you should read some topics of
John L. Hennessy & David A. Patterson, "Computer Architecture: A quantitative Approach"
It has microprocessors' history and theory , (starting with RISC archs - MIPS), pipelining, memory, storage, etc.
David Patterson is a Professor of Computer of Computer Science on EECS Department - U. Berkeley. http://www.eecs.berkeley.edu/~pattrsn/
Hope it helps, here's the link
Tanenbaum's Structured Computer Organization is a good book about how computers work. You might find it hard to get through the book, but that's mostly due to the subject, not the author.
However, I'm not sure I would recommend taking this approach. Understanding how the computer works can certainly be useful, but if you don't really have any programming knowledge, you can't really put your knowledge to good use - and you probably don't need that knowledge yet anyway. You would be better off learning about topics like object-oriented programming and data structures to learn about program design, because unless you're looking at doing embedded programming on very limited systems, you'll find those skills far more useful than knowledge of a computer's inner workings.
In my opinion, 20 years ago it was possible to understand the whole spectrum from BASIC all the way through operating system, hardware, down to the transistor or even quantum level. I don't know that it's possible for one person to understand that whole spectrum with today's technology. (Years ago, everyone serviced their own car. Today it's too hard.)
Some of the "layers" that you might be interested in:
http://en.wikipedia.org/wiki/Boolean_logic (this will be helpful for programming)
http://en.wikipedia.org/wiki/Flip-flop_%28electronics%29
http://en.wikipedia.org/wiki/Finite-state_machine
http://en.wikipedia.org/wiki/Static_random_access_memory
http://en.wikipedia.org/wiki/Bus_%28computing%29
http://en.wikipedia.org/wiki/Microprocessor
http://en.wikipedia.org/wiki/Computer_architecture
It's pretty simple really - the cpu loads instructions and executes them, most of those instructions revolve around loading values into registers or memory locations, and then manipulating those values. Certain memory ranges are set aside for communicating with the peripherals that are attached to the machine, such as the screen or hard drive.
Back in the days of Apple ][ and Commodore 64 you could put a value directly in to a memory location and that would directly change a pixel on the screen - those days are long gone, it is abstracted away from you (the programmer) by several layers of code, such as drivers and the operating system.
You can learn about this sort of stuff, or assembly language (which i am a huge fan of), or AND/NAND gates at the hardware level, but knowing this sort of stuff is not going to help you code up a web application in ASP.NET MVC, or write a quick and dirty Python or Powershell script.
There are lots of resources out there sprinkled around the net that will give you insight into how the CPU and the rest of the hardware works, but if you want to get down and dirty i honestly think you should buy one of those older machines off eBay or somewhere, and learn its particular flavour of assembly language (i understand there are also a lot of programmable PIC controllers out there that might also be good to learn on). Picking up an older machine is going to eliminate the software abstractions and make things way easier to learn. You learn way better when you get instant gratification, like making sprites move around a screen or generating sounds by directly toggling the speaker (or using a PIC controller to control a small robot). With those older machines, the schematics for an Apple ][ motherboard fit on to a roughly A2 size sheet of paper that was folded into the back of one of the Apple manuals - i would hate to imagine what they look like these days.
While I agree with the previous answers insofar as it is incredibly difficult to understand the entire process, we can at least break it down into categories, from lowest (closest to electrons) to highest (closest to what you actually see).
Lowest
Solid State Device Physics (How transistors work physically)
Circuit Theory (How transistors are combined to create logic gates)
Digital Logic (How logic gates are put together to create digital functions or digital structures i.e. multiplexers, full adders, etc.)
Hardware Organization (How the data path is laid out in the CPU, the components of a Von Neuman machine -> memory, processor, Arithmetic Logic Unit, fetch/decode/execute)
Microinstructions (Bit level programming)
Assembly (Programming with words, but directly specifying registers and takes forever to program even simple things)
Interpreted/Compiled Languages (Programming languages that get compiled or interpreted to assembly; the operating system may be in one of these)
Operating System (Process scheduling, hardware interfaces, abstracts lower levels)
Higher level languages (these kind of appear twice; it depends on the language. Java is done at a very high level, but C goes straight to assembly, and the C compiler is probably written in C)
User Interfaces/Applications/Gui (Last step, making it look pretty)
You can find out a lot about each of these. I'm only somewhat expert in the digital logic side of things. If you want a thorough tutorial on digital logic from the ground up, go to the electrical engineering menu of my website:
affablyevil.wordpress.com
I'm teaching the class, and adding online lessons as I go.

Resources