Which areas can be accelerated by FPGA and GPU

Which areas can be accelerated by FPGA and GPU - parallel-processing

I'm trying to accelerate any of my software using FPGA/GPU. I'm little confused to choose among these two. Which areas are suitable for FPGA and which areas are suitable for GPU (like Image processing is suitable for GPUs). Also it'd be good to know the areas which can be accelerated by more than 20x. I'm more interested on GPU as they are cheap and programming is easier compared to FPGA

Main difference between FPGA and GPU is that in fact today GPGPU is like CPU. It can easily handle pointers, functions and all programming is easy, because CUDA/OpenCL work with some set of C/C++ (for example OpenCL use C99 + some special functions).
FPGA is more hardware oriented. You can define gates and whole logic, which is then faster and there is paralelism, but it's achieved by different means.
FPGA is much better in serial operations, when it have constant stream of data (streaming encryption, video decoding,...) and when reprogramming is not casual. You can close FPGA in box and let it do its job only by connecting input, output cables and power supply.
GPGPU is always connected through PCI express and sending program to it is common (games use set of shaders (GPU programs), which are switching quickly), so it is more like batch handling device. Today GPGPU have large ammounts of RAM and cores/multiprocesors so it's really a more like CPU than FPGA.
There is one (maybe more, but I can't remember more) thing, that FPGA will be a lot faster. I don't know about any need for this (other than (de)crypting), but it's huge ammounts of different bit operations. GPUs are specialized at working with floats and ints (32 bit floating point and integer values), but they are quite slow when you have to do some binary magic. Simply by utilizing FPGA architecture, this binary magic can be done in paralel in one tick.
In GPU, you have to divide each binary operation (AND, OR, XOR,...), study in which order they have to be done.
Tl,dr: If you dont have specific need for FPGA, choose GPGPU.

Related

Power consumption estimation from number of FLOPS (floating point operations)?

I have extracted how many flops (floating point operations) each of my algorithms are consuming,
I wonder if I implement this algorithms on FPGA or on a CPU, can predict (roughly at least) how much power is going to be consumed?
Both power estimation in either CPU or ASIC/FPGA are good for me. I am seeking something like a formula. I have this journal paper, for Intel CPUs. It gives power consumption per instruction (not only floating point operation but all those addressing, control, etc. instructions) so I need something more general to give power based on FLOPS not number of instructions of the code in a special processor.

Re CPU: It's not really possible with modern architectures. Let's assuming your program is running on bare metal (i.e. avoiding the complexities of modern OSs, other applications, interrupt processing, optimizing compilers, etc). Circuitry that isn't being used, the modern processor will operate at a reduced power level. There are also hardware power conservation states such as P (Power) and C (Sleep) states that are instruction independent and will vary your power consumption even with the same instruction sequence. Even if we assume your app is CPU-bound (meaning there are no periods long enough to allow the processor to drop into hardware power saving states), we can't predict power usage except at a gross statistical level. Instruction streams are pipelined, taken out-of-order, fused, etc. And this doesn't even include the memory hierarchy, etc.
FPGA: Oh, heck. My experience with FPGA is so old, that I really don't want to say from when. All I can say is that way back, when huge monsters roamed the earth, you could estimate power usage since you knew the circuit design, and the power consumption of on and off gates. I can't imagine that there aren't modern power conservation technologies built into modern FPGAs. Even so, what small literature I scanned implies that a lot of FPGA power technology is based upon a-priori analysis and optimization. See Design techniques for FPGA power optimization, and 40-nm FPGA Power Management and Advantages. (I just did a quick search and scan of the papers, by the way, so don't pay too much attention to my conclusion.)

Parallela FPGA- 64 cores performance compared with GPUs and expensive FPGAs?

This is the Parallela:
http://anycpu.org/forum/viewtopic.php?f=13&t=66
It has 64 cores, 1GB RAM, runs Linux, Ethernet- everyone is shouting about it....
My question is, from a performance/capability perspective how does the Parallela compare with more expensive FPGAs? Do they just have wider buses/more memory/faster processor clocks/more processors on the chip?
I understand GPUs are for massively parallel simple operations and CPUs are better for more complicated single-threaded computation- so where do expensive FPGAs and the Parallela fit on this curve?
The Parallela runs Linux- yet I was always under the impression FPGAs have their logic flashed on to them by writing verilog or VHDL?

A partial answer : FPGAs tend not to have ANY processors on the chip (there are exceptions) - but if you think about processing by fetching instructions and executing them one after the other, you haven't really grasped FPGAs. If you can see how to execute one complete iteration of your inner loop in a single clock cycle, you're getting there.
There will be tasks where this is easy, and the FPGA can wipe the floor with any other solution. There will be tasks where it is impossible, and the Parallela will be a contender. I don't see any one high performance solution as an overall winner; there are impressive things being done with GPUs (low power isn't one of them!), and many-core XMOS or Parallela solutions have their place too.

The only Parallelas available now are 16 cores. They have a Xilinx Zynq 7010 or 7020 which is dual core Arm 800mhz/1ghz and 80k logic cell FPGA which is used to communicate with the Parallela chip. I don't know how much of the FPGA is available to play with though.

If Parallelas has 16 cores and assume that each core has a hardware multiplier that runs at 1GHz, the overall computation ability of Parallelas is comparable with a $200 FPGA roughly, and definitely worse than a $1000 FPGA. However in most applications math computation are not the main processor's jobs; they are handled by ASIC (or an IP core or DSP coprocessor inside the main processor), for example H.264 codec or WiFi data modulation. For applications supported by ASIC, high-performance processor plus corresponding ASIC is always the best solution. Only if you want to be unique at some part, for example better image processing algorithms, you probably want to implement your own signal processing algorithm, and this is where multi-core DSP, GPU and high-end FPGA compete.

What types of code domains is OpenCL suited to?

I read the OpenCL overview, and it states it is suitable for code that runs of CPUs, GPGPUs, DSPs, etc. However, from looking through the command reference, it seems to be all math and image type operations. I didn't see anything for say strings.
This makes me wonder what would you run on a CPU via OpenCL?
Further, I know OpenCL can be used to perform sorting on GPGPUs. But would one ever use it (or, for that matter, a current GPGPU) to perform string processing such as pattern matching, metaphone extraction, dictionary lookup, or anything else that requires the processing of arrays of strings.
EDIT
I noticed that Intel's upcoming Ivy Bridge is touted as "OpenCL compliant" with reference to its graphics units. Does this infer that the CPU cores are not OpenCL compliant, or is there no such inference?
EDIT
In the interests of non-debate and constructiveness, I would appreciate if anyone could point me to official references that would answer my question.

You can think of OpenCL as a combination of a runtime (for device discovery, queueing) and a C-based programming language. This programming language has native vector types and built-in functions and operations for doing all sorts fun stuff to these vectors. This is nice in that you can write a vectorized kernel in OpenCL, and it it the responsibility of the implementation to map that to the actual vector ISA of your hardware.
From this 4/2011 article, which might vanish:
There are two major CPU architectures out there, x86 and ARM, both of
which should soon run OpenCL code.
If you write an OpenCL application that targets both of these architectures, you wouldn't have to worry about writing two versions, one SSE and one NEON. Just write OpenCL C and be done with it. Yes, I know. This assumes the vendor has done his job and written a solid implementation that fully utilizes the underlying ISA. But if he doesn't, complain!
In addition, some CL implementations offer auto-vectorization of scalar kernels, which are usually easier to write. A good auto-vectorizer would give you a solid performance increase for no effort. Since CL kernels are compiled "online," obtaining such a benefit wouldn't require shipping rebuilt code.

No links, but I would assume this is because algorithms that use strings may do a lot of dynamic memory allocation and branching, both of which GPGPUs are not well-suited for. GPGPUs also have a lot in common with vector processing, so doing units of work with different sized blocks of memory (which a string algorithm will generally work on, you usually don't have a homogeneous group of strings), yields poorer performance and is hard to program.
GPUs were designed to do the same work, with little to no branching, on a homogeneous group of data (such as per-vector or per-pixel operations). Algorithms that can mimic this type of behavior are great on GPUs.

This makes me wonder what would you run on a CPU via OpenCL?
I prefer to use ocl to offload work from the cpu to my graphics hardware. Sometimes there is a limitation with my video card, so I like having a backup kernel for cpu use. Such limitations can be memory size, memory bottleneck, low clock speed, or when the pci-e bus gets in the way.
I say I like using a separate kernel for cpu, because I think all kernels should be tweaked to run on their target hardware. I even like to have an openmp backup plan, as most algorithms I use get tested out in this manner ahead of time.
I suppose it is best practice to test out a gpu kernel on the cpu to make sure it runs as expected. If a user of your software has opencl installed, but only a cpu (or a low-end gpu) it's nice to be able to execute the same code on the different devices.

VHDL and FPGA's

I'm relatively new to the FPGA sceen and was looking to get experience with them and VHDL. I'm not quite sure what the benefit would be over using a standard MCU but looking for experience since many companies are looking for it.
What would be a good platform to start out on and get experience for not to much money. Ive been looking and all I can find are 200 - 300 dollar boards if not 1000's. What should one look for in an FPGA development board, I hear high speed peripheral interfaces, and what I guess I'm really confused about is that an MCU dev board with around 50/100 GPIO can go for around 100 while that same functionality on an FPGA board is much more expensive! I know you can reprogram an FPGA, but so can an MCU. Should I even fiddle with FPGA's will the market keep using them or are we moving towards MCU's only?

Hmm...I was able to find three evaluation boards under $100 pretty quickly:
$79: http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=593
$79: http://www.arrownac.com/solutions/bemicro-sdk/
$89: http://www.xilinx.com/products/boards-and-kits/AES-S6MB-LX9.htm
As to what to look for in an evaluation board, that depends entirely on what you want to do. If you have a specific design task to accomplish, you want a board that supports as many of the same functions and I/O as your final circuit. You can get boards with various memory options (SRAM, DDR2, DDR3, Flash, etc), Ethernet, PCI/PCIe bus, high-speed optical transceivers, and more. If you just want to get started, just about any board will work for you. Virtually anything sold today should have enough space for even non-trivial example designs (ie: build your own microcontroller with a soft-core CPU and design/select-your-own peripheral mix).
Even if your board only has a few switches and LEDs you can get started designing a hardware "Hello World" (a.k.a. the blinking LED :), simple state machines, and many other applications. Where you start and what you try to do should depend on your overall goals. If you're just looking to gain general experience with FPGAs, I suggest:
Start with any of low-cost evaluation boards
Run through their demo application (typically already programmed into the HW) to get familiar with what it does
Build the demo program from source and verify it works to get familiar with the FPGA tool chain
Modify the demo application in some way to get familiar with designing hardware for FPGAs
Use your new-found experience to determine what to try next
As for the market continuing to use FPGAs, they are definitely here to stay, but that does not mean they are suitable for every application. An MCU by itself is fine for many applications, but cannot handle everything. For example, you can easily "bit-bang" an I2C or even serial UART with most micro-controllers, but you would be hard pressed to talk to an Ethernet port, a VGA display, or a PCI/PCIe bus without some custom hardware. It's up to you to decide how to mix the available technology (MCUs, FPGAs, custom logic designed in-house, licensed IP cores, off-the-shelf standard hardware chips, etc) to create a functional product or device, and there typically isn't any single 'right' answer.

FPGAs win over microcontrollers if you need some or all of:
Huge amounts of maths to be done (even more than a DSP makes sense for)
Huge amounts of memory bandwidth (often goes hand in hand with the previous point - not much point having lots of maths to do if you have no data to do it on!)
Extremely predictable hard real-time performance - the timing analyser will tell you how fast you can clock you device given the logic you've designed. You can (with a certain - high - statistical likelihood) "guarantee" to operate at that speed. And therefore you can design logic which you know will always meet certain real-time response times, even if those deadlines are in the nano-second realm.
If not, then you are likely better off with a micro or DSP.

The OpenCores web site is an excellent resource, especially the Programming Tools section. The articles link on the site is a good place to start to survey FPGA boards.
The biggest advantage of an FPGA over a microprocessor is architecture. The microprocessor has a fixed set of functional units that solve most problems reasonably well. I've seen computational efficiency figures for microprocessors form 6% to 15%. In an FPGA you are creating functional units specifically for your problem and nothing else, so you can reach 90-100% computational efficiency.
As for the difference in cost, think of volume sales. High volume of microprocessor sales vs. relatively lower FPGA sales.

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki

[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.

Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)

Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)

For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio