Looking to get started with FPGAs -- speed up? [closed]

Looking to get started with FPGAs -- speed up? [closed] - fpga

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm very interested in learning FPGA development. I've found a bunch of "getting started with FPGA" questions here, and other tutorials and resources on the internet. But I'm primarily interested in using FPGAs as an accelerator, and I can't figure out what devices will actually offer a speed up over a desktop CPU (say a recent i7).
My particular interest at the moment is cellular automata (and other parallel environments like neural networks and agent based modeling). I'd like to experiment with 3d or higher dimensional cellular automatas. My question is - will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU? Or would I need to spend more and get a higher end model FPGA?

The FPGA can be a very good accelerator, but (and this is a big BUG) it is usually is very expensive. We have here machines like the beecube, a convey or from Dini godzillas part time nanny, and they are all very expensive (>10k$) and even with these machines many applications can be better accelerated with a standard cpu cluster or gpus. That FPGA is a bit better when the total cost of ownership is considered as you have there usually a better engery efficiency.
But there are applications that you can accelerate. On the lower scale you can/should do an rough estimate if its worth for you application, but you need there more concrete numbers for your application. Consider an standard deskop cpu: usually it has at least 4 cores (or dual with hyperthreading, not to mention the vector units), and clocks at, say 3 GHz. This results in 12 GCycles per second computation power. The (cheap) FPGAs you can get to 250 MHz (better can reach up to 500 MHz, but that must be very friendly designs and very good speed grades), so you need approx. 50 Operations in parallel, to compete with the CPU (actually its a bit better because the cpu has usually not 1 cycle ops, but it also has vector operations so we are equal).
50 Operations sounds much, and is hard, but is doable (the magic word here is pipeling). So you should know exactly how you are going to implement you design in hardware and which degree of parallelism you can use.
Even if you solve that parallelism problem, we come now to the real problem: The memory.
Above mentioned accelerators have so much compute capacity, they could do thousands things in parallel, but the real problem with such computation power is: how to get the data into/out of them. And you also have this problem in your small scale. In your desktop PC the cpu transfers more than 20GB/s to/from memory (good GPU card make 100GB/s and more), while your small accelerator for 100-200$ has at most (when you get lucky) 1-2 GB/s per PCI-Exp.
If its worth for you, depends completely on your application (and here you need far more details than: 3D Cellular Automatas, you must know the neighbourhoods, the required precision (do you double, single float, or integers or fixpoint...?), and your use case (do you transfer initial cell values, let the machine compute 2 day, and than transfer cell values back, or do you need the cell values after every step (this makes a huge difference in the required bandwidth while computation)).
But overall, without knowing more, I would say: Its 100$-200$ worth.
But not because you can compute your cellular automatas faster (which I dont believe), but because you will lern. And you will not only learn to design hardware and the development on FPGAs, but I see with our students that we have here, always that they get with the hardware design knowledge also a far better understanding on how the hardware actually look and behaves like. Sure nothing what you do on your FPGA is direct related to the interior of the cpu, but many get a better feeling for what hardware in general is capable of, which in turn make them even more effective software developer.
But I have also to admit: You are going to pay a much higher price than just the 100-200$: You have to spent really much time on it.

Disclaimer: I work for a reconfigurable system developer/manufacturer.
A short answer to your question "will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU" is probably not.
A longer answer:
A microprocessor is a set of fixed, shared functional units tuned to perform reasonably well across a broad range of applications. The operating system and compilers do a good job of making sure that these fixed, shared functional units are utilized appropriately.
FPGA based systems get their performance from dedicated, dense, computational efficiency. You create exactly what you need to execute your application, no more, no less - and whatever you create is not shared with any other user, process, operating system, whatever. If you need 80 floating point units, you create 80 dedicated floating point units that run in parallel. Compare that to a microprocessor scheduling floating point operations across some smaller number of floating point units. To get performance faster than a microprocessor, you have to instantiate enough dedicated FPGA-based functional units to make a performance difference vs. a microprocessor. This often requires the resources in the larger FPGA devices.
An FPGA alone is not enough. If you create a large number of efficient computational engines in an FPGA you -have- to keep these engines fed with data. This requires some number of high bandwidth connections to large amounts of data memory around the FPGA. What you often see with I/O based FPGA cards is some of the potential performance gain is usually diminished by moving data back and forth across the I/O bus.
As a data point, my company uses the '530 Stratix IV FPGA from Altera. We surround it with several directly coupled memories and tie this subsystem directly into the microprocessor memory. We get several advantages over microprocessor systems for many applications, but this is not a $100-$200 starter kit, this is a full-blown integrated system.

Related

Performance Gain with Specifically Built OS [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
Let's take a trivial CPU bound program, such as brute forcing prime numbers, which perhaps occasionally saves them to an SD card.
Inefficiencies in today's programs include interpretation and virtual machines etc. So in the interest of speed, let's throw those away, and use a compiled language.
Now while we now have code that can run directly on the processor, we still have the operating system, which will multiplex between different processes, run its own code, manage memory and do other things that will slow down the execution of our program.
If we were to write our own operating system which solely runs our program, what factor of speedup could we expect to see?
I'm sure there might be a number of variables, so please elaborate if you want.

Take a look at products by Return Infinity http://www.returninfinity.com/ (I'm not affiliated in any way), and experiment.
My own supercomputing experience demonstrates, that skipping the TLB (almost entirely), by running a flat memory model, combined with lack of context switching between kernel and userland, can and does accelerate some tasks - especially those related to message passing in networking (MAC level, not even TCP, why bother), as well as brute force computation (due to lack of memory management).
On brute-force computation that exceeds the TLB or cache size, you can expect approx 5-15% performance gain compared to having to do RAM-based translation table lookups - the penalty is that each software error is entirely unguarded (you can lock some pages statically with monolithic linking, thou).
On high-bandwidth work, especially with a lot of small message-passing, you can easily obtain even 500% acceleration by going kernel-space, either by completely removing the (multi-tasking) OS, or by loading your application as a kernel driver, circumventing the entire abstraction as well. We've been able to push the network latency on MAC-layer pings from 18us down to 1.3us.
On computation that does fit inside L1 cache, I'd expect minimal improvement (around 1%).
Does it all matter? Yes and no. If your hardware costs vastly exceed your engineering costs, and you have done all the algorithmic improvements you can think of (better yet, proved that the computation done is exactly the computation required for the result!) - this can give meaningful perfomance benefits. Extra 3% (overall average success) on a supercomputer costing approx $8M/y in electricity, not including hardware amortization, is worth $24k/y. Enough to pay an engineer for a month to optimize the most common task it runs :).

Assuming you're running a decent machine and the OS is not doing much else: Not a large factor, I'd expect less than a 10% improvement.
Just the OS 'idling' doesn't (shouldn't) take up much of the processing power of the CPU. If it is, you need a better machine, a better OS, a format or some combination of these.
If, on the other hand, you're running a bunch of other resource-intensive things, obviously expect that this can be sped up a lot by just not running those other things.
If you're not a super-user, you may be surprised to find that there are a ton of (non-OS) processes running in the background, these are more likely to take up CPU processing power that the OS.
Slightly off topic but related, keep in mind that, if you're running 8 cores, you can, in a perfect world, speed up the process by 8x by multi-threading.
Expect a way bigger improvement from known solutions to problems and making better use of data structures and algorithms, and, to a lesser extent, the choice of language and micro-optimizations.
From my experience:
Not the most scientific or trustable result but, most of the time when I open up Task Manager on Windows, all the OS processes are below 1% of the CPU.

There is a super-computer answer, and a multi-cores answer already, so here is the GPGPU answer.
When a super-computer is overkill, but a multi-core CPU is under-powered, and your algorithm is sensibly parallelizable, consider adapting it to a GPGPU. Many of the benefits of a super-computer solution are available, in reduced form at reduced cost, by performing CPU-intensive tasks on a GPGPU.
Here is a link to an analysis I performed last year on implementing, and tuning, a brute-force solution to the Travelling Salesman Problem using a compute capability 2.0 NVIDIA Graphics card, CUDAfy, and C#.

How much is the performance of modern FPGA relative to CPU and absolutly in (GFlops/GIops)?

How much is the performance of modern FPGA relative to CPU, absolutly in (GFlops/GIops) and what is the cost of one billion integer operations per second on the FPGA?
And in which tasks now beneficial to use FPGA?
I only found it:
http://www.hpcwire.com/hpcwire/2010-11-22/the_expanding_floating-point_performance_gap_between_fpgas_and_microprocessors.html
And an old article:
http://www.mouldy.org/fpgas-in-cryptanalysis.pdf

Disclaimer: I work for SRC Computers, a heterogeneous CPU/FPGA system manufacturer.
"It depends", of course, is the answer.
A microprocessor is a fixed set of functional units. These perform reasonably well across a broad range of applications.
An FPGA is programmed by a designer with a specific set of functional units designed solely to execute a specific application. As such, it (often) performs very well for a given application.
"How much is the performance of modern FPGA relative to CPU, absolutely in (GFlops/GIops)" becomes a meaningless question. It can be answered for the the microprocessor as it has a fixed set of floating point units. However, for an FPGA, the question evolves into 1) how large if the FPGA, 2) how many floating point units can I pack into it and still do useful work, what is the memory/support architecture around the FPGA and 4) what are the sustained system bandwidths between the FPGA, its memory and the rest of the system?
The answer to "what is the cost of one billion integer operations per second on the FPGA" is similarly addressed by the preceding paragraph.
An interesting thing to keep in mind around performance is that in an FPGA, peak performance equals sustained performance since the FPGA is dedicated to executing a given application. As long as other system parameters do not interfere, of course.
Your question "And in which tasks now beneficial to use FPGA?" is a very broad question and grows with every large FPGA device release. In extremely broad non-exclusive terms, parallel and streaming applications benefit, although the application performance is to a large extent determined by the system architecture.

FPGA measure accurate times

We are checking how fast is an algorithm running at the FPGA vs Normal Quad x86 computer.
Now at the x86 we run the algorithm lots of times, and take a median in order to eliminate OS overhead, also this "cleans" the curve from errors.
Thats not the problem.
The measure in the FPGA algorithm is in cycles and then take the cycles to time, with the FSMD is trivial to count cycles anyway...
We think that count cycles is too "pure" measure, and this could be done theoretically and dont need to make a real measure or running the algorithm in the real FPGA.
I want to know is there exist a paper or some idea, to do a real time measure.

If you are trying to establish that your FPGA implementation is competitive or superior, and therefore might be useful in the real world, then I encourage you to compare ** wall clock times ** on the multiprocessor vs. the FPGA implementation. That will also help ensure you do not overlook performance effects beyond the FSM + datapath (such as I/O delays).
I agree that reporting cycle counts only is not representative because the FPGA cycle time can be 10X that of off the shelf commodity microprocessors.
Now for some additional unsolicited advice. I have been to numerous FCCM conferences, and similar, and I have heard many dozens of FPGA implementation vs. CPU implementation performance comparison papers. All too often, a paper compares a custom FPGA implementation that took months, vs. a CPU+software implementation wherein the engineer just took the benchmark source code off the shelf, compiled it, and ran it in one afternoon. I do not find such presentations particularly compelling.
A fair comparison would evaluate a software implementation that uses best practices, best available libraries (e.g. Intel MKL or IPP), that used multithreading across multiple cores, that used vector SIMD (e.g. SSE, AVX, ...) instead of scalar computation, that used tools like profilers to eliminate easily fixed waste and like Vtune to understand and tune the cache+memory hierarchy. Also please be sure to report the actual amount of engineering time spent on the FPGA vs. the software implementations.
More free advice: In these energy focused times where results/joule may trump results/second, consider also reporting the energy efficiency of your implementations.
More free advice: to get most repeatable times on the "quad x86" be sure to quiesce the machine, shut down background processors, daemons, services, etc., disconnect the network.
Happy hacking!

At what rate are the number of cores per CPU increasing?

I'm designing a system that will be on-line in 2016 and run on commodity 1U or 2U server boxes. I'd like to understand how parallel the software will need to be so I'd like to estimate the number of cores per physical machine. I'm not interested in more exotic hardware like video game console processors, GPUs or DSPs. I could extrapolate based on when chips where issued by Intel or AMD, but this historical information seems scarce.
Thanks.

I found the following charts from Design for Manycore Systems:

As the great computer scientist Yogi Berra said, "It's tough to make predictions, especially about the future.". Given the relative recency of multicore systems, I think you're right to be wary of extrapolations. Still, you need a number to aim for.
M. Spinelli's graphs are very valuable, and (I think) have the benefit of being based on real plans out to 2014. Other than that, if you want a simple, easly calculatable and defensible number, I'd take as a starting point the number of cores in current (say) 2U systems at your price point (high range systems -- 24-32 cores at $15k; mid-range 12-16 cores at $8k, lower-end 8-12 core at $5k). Then note that Moore's law suggests 8-16x as many transistors per unit silicon in 2016 as now, and that on current trends, these mainly go into more cores. That suggests 64-512 cores per node depending on how much you're spending on each -- and these numbers are consistent with the graphs Matt Spinelli posted above.

Cores per physical machine doesn't seem to be a particularly good metric, I think. We haven't really seen that number grow in particularly non-linear ways, and many-core hardware has been available COTS since the 90's (though it was relatively specialized at that point). If your task is really that parallel, quadrupling the number of cores shouldn't change it that much. We've always had the option of faster-but-fewer-cores, which should still be available to you in 6 years if you find that you don't scale well with the current number of cores.
If your application is really embarrassingly parallel, why are you unwilling to consider GPU solutions?
How quickly do you plan to rotate the hardware? Leave old machines till they die, or replace them proactively as they start to slow the cluster down? How many machines are we talking about? What kind of interconnect technology are you considering? For many cluster applications that is the limiting factor.
The drdobbs article above is not a bad analysis, but I think it misses the point just a tad. It's going to be a significant while before many mainstream apps can take advantage of really parallel general compute hardware (and many tasks simply can't be parallelized much), and when they do, they'll be using graphics cards and (to a less extent) soundcards as the specialized hardware they use to do it.

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki

[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.

Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)

Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)

For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio