xilinx fpga resource estimation

xilinx fpga resource estimation - fpga

I am trying to understand how to estimate FPGA resource requirement for a design/application.
Lets says Spartan 7 part has,
Logic Cells - 52160
DSP Slices - 120
Memory - 2700
How to find out number of CLB's, RAM, and Flash availability?
Lets say my design needs a SPI interface in FPGA,
How to estimate CLB, RAM and Flash requirement for this design?
Thanks

Estimation of a block of logic can be done in a couple ways. One method is to actually pen out the logic on paper and look at what registers you are planning on creating. Then you need to look at the part you working with. In this case the Spartan 7 has CLB config as below:
This is from the Xilinx UG474 7 Series document, pg 17. So now you can see the quantity of flops and memory per CLB. Once you look at the registers in the code and count up the memory in the design, you can figure out the number of CLB's. You can share memory and flops in a single CLB generally without issue, however, if you have multiple memories, quantization takes over. Two seperate memories can't occupy the same CLB generally. Also, there are other quantization effects. Memories some in perfect binary sizes, and if you build a 33 bit wide memory x 128K locations, you will really absorbes 64x128K bits of memory, where 31 bits x 128K are unused and untouchable for other uses.
The second method of estimating size is more experienced based as is practiced by larger FPGA teams where previous designs are looked at, and engineers will make basic comparisons of logic to identify previous blocks that are similar to what you are designing next. You might argue that I2C interface isn'a 100% like a SPI interface, but they are similar enough that you could say, 125% of I2C would be a good estiamte of a SPI with some margin for error. You then just throw that number into a spread sheet along with estimates for the 100 other modules that are in design and you call that the rough estimate.
If the estimate needs a second pass to make it more accurate, then you should throw a little code together and validate that it is functional enough to NOT be optimizing flops, gates and memory away and then use that to sure up the estimate. This is tougher because optimization (Read as dropping of unused flops) can happen all too easily, so you need to be certain that flops and gates are twiddle-able enough to not let them be interpreted as unused or always 1 or always 0.
To figure out the number of CLB's you can use the CLB slice configuration table above. Take the number of flops and divide by 16 (For the 7 Series devices) and this will give you the flop based CLB number. Take the memory bits, and divide each memory by 256 (again for 7 series devices) and you will get the total CLB's based on memory. At that point just take the larger of the CLB counts and that will be your CLB estimate.

Related

Are bytes real?

I know that this question may sound stupid, but let me just explain. So...
Everyone knows that byte is 8 bits. Simple, right? But where exactly is it specified? I mean, phisically you don't really use bytes, but bits. For example drives. As I understand, it's just a reaaaaly long string of ones and zeros and NOT bytes. Sure, there are sectors, but, as far as I know, there are programmed at software level (at least in SSDs, I think). Also RAM, which is again - a long stream of ones and zeros. Another example is CPU. It doesn't process 8 bits at a time, but only one.
So where exactly is it specified? Or is it just general rule, which everyone follows? If so, could I make system (either operating system or even something at lower level) that would use, let's say, 9 bits in a byte? Or I wouldn't have to? Also - why can't you use less than a byte of memory? Or maybe you can? For example: is it possible for two applications to use the same byte (e.g. first one uses 4 bits and second one uses other 4)? And last, but not least - does computer drives really use bytes? Or is it that, for example, bits 1-8 belong to something, next to them there are some 3 random bits and bits 12-20 belong to something different?
I know that there are a lot of question and knowing answers to these questions doesn't change anything, but I was just wondering.
EDIT: Ok, I might've expressed myself not clear enough. I know that byte is just a concept (well, even bit is just a concept that we make real). I'm NOT asking why there are 8 bits in byte and why bytes exist as a term. What I'm asking is where in a computer is byte defined or if it even is defined. If bytes really are defined somewhere, at what level (a hardware level, OS level, programming language level or just at application level)? I'm also asking if computers even care about bytes (in that concept that we've made real), if they use bytes constantly (like in between two bytes, can there be some 3 random bits?).

Yes, they’re real insofaras they have a definition and a standardised use/understanding. The Wikipedia article for byte says:
The modern de-facto standard of eight bits, as documented in ISO/IEC 2382-1:1993, is a convenient power of two permitting the values 0 through 255 for one byte (2 in power of 8 = 256, where zero signifies a number as well).[7] The international standard IEC 80000-13 codified this common meaning. Many types of applications use information representable in eight or fewer bits and processor designers optimize for this common usage. The popularity of major commercial computing architectures has aided in the ubiquitous acceptance of the eight-bit size.[8] Modern architectures typically use 32- or 64-bit words, built of four or eight bytes
The full article is probably worth reading. No one set out stall 50+ years ago, banged a fist on the desk and said ‘a byte shallt be 8 bits’ but it became that way over time, with popular microprocessors being able to carry out operations on 8 bits at a time. Subsequent processor architectures carry out ops on multiples of this. While I’m sure intel could make their next chip a 100bit capable one, I think the next bitness revolution we’ll encounter will be 128
Everyone knows that byte is 8 bits?
These days, yes
But where exactly is it specified?
See above for the ISO code
I mean, phisically you don't really use bytes, but bits.
Physically we don’t use bits either, but a threshold of detectable magnetic field strength on a rust coated sheet of aluminium, or an amount of electrical charge storage
As I understand, it's just a reaaaaly long string of ones and zeros and NOT bytes.
True, everything to a computer is a really long stream of 0 and 1. What is important in defining anything else is where to stop counting this group of 0 or 1, and start counting the next group, and what you call the group. A byte is a group of 8 bits. We group things for convenience. It’s a lot more inconvenient to carry 24 tins of beer home than a single box containing 24 tins
Sure, there are sectors, but, as far as I know, there are programmed at software level (at least in SSDs, I think)
Sectors and bytes are analogous in that they represent a grouping of something, but they aren’t necessarily directly related in the way that bits and bytes are because sectors are a level of grouping on top of bytes. Over time the meaning of a sector as a segment of a track (a reference to a platter number and a distance from the centre of the platter) has changed as the march of progress has done away with positional addressing and later even rotational storage. In computing you’ll typically find that there is a base level that is hard to use, so someone builds a level of abstraction on top of it, and that becomes the new “hard to use”, so it’s abstracted again, and again.
Also RAM, which is again - a long stream of ones and zeros
Yes, and is consequently hard to use, so it’s abstracted, and abstracted again. Your program doesn’t concern itself with raising the charge level of some capacitive area of a memory chip, it uses the abstractions it has access to, and that attraction fiddles the next level down, and so on until the magic happens at the bottom of the hierarchy. Where you stop on this downward journey is largely a question of definition and arbitrary choice. I don’t usually consider my ram chips as something like ice cube trays full of electrons, or the subatomic quanta, but I could I suppose. We normally stop when it ceases to useful to solving the
Problem
Another example is CPU. It doesn't process 8 bits at a time, but only one.
That largely depends on your definition of ‘at a time’ - most of this question is about the definitions of various things. If we arbitrarily decide that ‘at a time’ is the unit block of the multiple picoseconds it takes the cpu to complete a single cycle then yes, a CPU can operate on multiple bits of information at once - it’s the whole idea of having a multiple bit cpu that can add two 32 bit numbers together and not forget bits. If you want to slice the time up so precisely that we can determine that enough charge has flowed to here but not there then you could say which bit the cpu is operating on right at this pico (or smaller) second, but it’s not useful to go so fine grained because nothing will happen until the end of the time slice the cpu is waiting for.
Suffice to say, when we divide time just enough to observe a single cpu cycle from start to finish, we can say the cpu is operating on more than one bit.
If you write at one letter per second, and I close my eyes for 2 out of every 3 seconds, I’ll see you write a whole 3 letter word “at the same time” - you write “the cat sat onn the mat” and to the observer, you generated each word simultaneously.
CPUs run cycles for similar reasons, they operate on the flow and buildup of electrical charge and you have to wait a certain amount of time for the charge to build up so that it triggers the next set of logical gates to open/close and direct the charge elsewhere. Faster CPUs are basically more sensitive circuitry; the rate of flow of charge is relatively constant, it’s the time you’re prepared to wait for input to flow from here to there, for that bucket to fill with just enough charge, that shortens with increasing MHz. Once enough charge has accumulated, bump! Something happens, and multiple things are processed “at the same time”
So where exactly is it specified? Or is it just general rule, which everyone follows?
it was the general rule, then it was specified to make sure it carried on being the general rule
If so, could I make system (either operating system or even something at lower level) that would use, let's say, 9 bits in a byte? Or I wouldn't have to?
You could, but you’d essentially have to write an adaptation(abstraction) of an existing processor architecture and you’d use nine 8bit bytes to achieve your presentation of eight 9bit bytes. You’re creating an abstraction on top of an abstraction and boundaries of basic building blocks don’t align. You’d have a lot of work to do to see the system out to completion, and you wouldn’t bother.
In the real world, if ice cube trays made 8 cubes at a time but you thought the optimal number for a person to have in the freezer was 9, you’d buy 9 trays, freeze them and make 72 cubes, then divvy them up into 8 bags, and sell them that way. If someone turned up with 9 cubes worth of water (it melted), you’d have to split it over 2 trays, freeze it, give it back.. this constant adaptation between your industry provided 8 slot trays and your desire to process 9 cubes is the adaptive abstraction
If you do do it, maybe call it a nyte? :)
Also - why can't you use less than a byte of memory? Or maybe you can?
You can, you just have to work with the limitations of the existing abstraction being 8 bits. If you have 8 Boolean values to store you can code things up so you flip bits of the byte on and off, so even though you’re stuck with your 8 cube ice tray you can selectively fill and empty each cube. If your program only ever needs 7 Booleans, you might have to accept the wastage of the other bit. Or maybe you’ll use it in combination with a regular 32 bit int to keep track of a 33bit integer value. Lot of work though, writing an adaptation that knows to progress onto the 33rd bit rather than just throw an overflow error when you try to add 1 to 4,294,967,295. Memory is plentiful enough that you’d waste the bit, and waste another 31bits using a 64bit integer to hold your 4,294,967,296 value.
Generally, resource is so plentiful these days that we don’t care to waste a few bits.. It isn’t always so, of course: take credit card terminals sending data over slow lines. Every bit counts for speed, so the ancient protocols for info interchange with the bank might well use different bits of the same byte to code up multiple things
For example: is it possible for two applications to use the same byte (e.g. first one uses 4 bits and second one uses other 4)?
No, because hardware and OS memory management these days keeps programs separate for security and stability. In the olden days though, one program could write to another program’s memory (it’s how we cheated at games, see the lives counter go down, just overwrite a new value), so in those days if two programs could behave, and one would only write to the 4 high bits and the other the 4 low bits then yes, they could have shared a byte. Access would probably be whole byte though, so each program would have to read the whole byte, only change its own bits of it, then write the entire result back
And last, but not least - does computer drives really use bytes? Or is it that, for example, bits 1-8 belong to something, next to them there are some 3 random bits and bits 12-20 belong to something different?
Probably not, but you’ll never know because you don’t get to peek to that level of abstraction enough to see the disk laid out as a sequence of bits and know where the byte boundaries are, or sector boundaries, and whether this logical sector follows that logical sector, or whether a defect in the disk surface means the sectors don’t follow on from each other. You don’t typically care though, because you treat the drive as a contiguous array of bytes (etc) and let its controller worry about where the bits are

FPGA verilog code upload speed and size limit

I have two question about FPGA
1. I would like to know how large FPGA chip size would be if I create a full CPU with pipeline.
Any calculation method or paper that describes how I can calculate the chip size?
2. If I upload fairly reasonable functions (or modules) to FPGA after compilation, How long would it actually take to write the logic on FPGA? that is, excluding the compilation time and including just uploading time.

This would depend on the particular CPU, but to give you an idea a Xilinx MicroBlaze Soft Processor Core takes up around 1000 logic cells, maybe up to around 6000 logic cells with peripherals. A high end FPGA like the Xilinx Zynq-7100 has 444K logic cells.
Configuring an FPGA is very quick; the Z-7100 takes about 1-2 minutes to program.

How to select the most powerful OpenCL device?

My computer has both an Intel GPU and an NVIDIA GPU. The latter is much more powerful and is my preferred device when performing heavy tasks. I need a way to programmatically determine which one of the devices to use.
I'm aware of the fact that it is hard to know which device is best suited for a particular task. What I need is to (programmatically) make a qualified guess using the variables listed below.
How would you rank these two devices? Intel HD Graphics 4400 to the left, GeForce GT 750M to the right.
GlobalMemoryCacheLineSize 64 vs 128
GlobalMemoryCacheSize 2097152 vs 32768
GlobalMemorySize 1837105152 vs 4294967296
HostUnifiedMemory true vs false
Image2DMaxHeight 16384 vs 32768
Image2DMaxWidth 16384 vs 32768
Image3DMaxDepth 2048 vs 4096
Image3DMaxHeight 2048 vs 4096
Image3DMaxWidth 2048 vs 4096
LocalMemorySize 65536 vs 49152
MaxClockFrequency 400 vs 1085
MaxComputeUnits 20 vs 2
MaxConstantArguments 8 vs 9
MaxMemoryAllocationSize 459276288 vs 1073741824
MaxParameterSize 1024 vs 4352
MaxReadImageArguments 128 vs 256
MaxSamplers 16 vs 32
MaxWorkGroupSize 512 vs 1024
MaxWorkItemSizes [512, 512, 512] vs [1024, 1024, 64]
MaxWriteImageArguments 8 vs 16
MemoryBaseAddressAlignment 1024 vs 4096
OpenCLCVersion 1.2 vs 1.1
ProfilingTimerResolution 80 vs 1000
VendorId 32902 vs 4318
Obviously, there are hundreds of other devices to consider. I need a general formula!

You can not have a simple formula to calculate an index from that parameters.
Explanation
First of all let me assume you can trust collected data, of course if you read 2 for MaxComputeUnits but in reality it's 80 then there is nothing you can do (unless you have your own database of cards with all their specifications).
How can you guess if you do not know task you have to perform? It may be something highly parallel (then more units may be better) or a raw brute calculation (then higher clock frequency or bigger cache may be better). As for normal CPU number of threads isn't the only factor you have to consider for parallel tasks. Just to mention few things you have to consider:
Cache: how much local data each task works with?
Memory: shared with CPU? How many concurrent accesses compared to parallel tasks?
Instruction set: do you need something specific that increases speed even if other parameters aren't so good?
Misc stuff: do you have some specific requirement, for example size of something that must be supported and a fallback method makes everything terribly slow?
To make it short: you can not calculate an index in a reliable way because factors are too many and they're strongly correlated (for example high parallelism may be slowed by small cache or slow memory access but a specific instruction, if supported, may give you great performance even if all other parameters are poor).
One Possible Solution
If you need a raw comparison you may even simply do MaxComputeUnits * MaxClockFrequency (and it may even be enough for many applications) but if you need a more accurate index then don't think it'll be an easy task and you'll get a general purpose formula like (a + b / 2)^2, it's not and results will be very specific to task you have to accomplish.
Write a small test (as much similar as possible to what your task is, take a look to this post on SO) and run it with many cards, with a big enough statistic you may extrapolate an index from an unknown set of parameters. Algorithms can become pretty complex and there is a vast literature about this topic so I won't even try to repeat them here. I would start with Wikipedia article as summary to other more specific papers. If you need an example of what you have to do you may read Exploring the Multiple-GPU Design Space.
Remember that more variables you add to your study more results quality will be unstable, less parameters you use less results will be accurate. To better support extrapolation:
After you collected enough data you should first select and reduce variables with some pre-analysis to a subset of them including only what influences more your benchmark results (for example MaxGroupSize may not be so relevant). This phase is really important and decisions should be made with statistic tools (you may for example calculate p-value).
Some parameters may have a great variability (memory size, number of units) but analysis would be easier with less values (for example [0..5) units, [5..10) units, [10..*) units). You should then partition data (watching their distribution). Different partitions may lead to very different results so you should try different combinations.
There are many other things to consider, a good book about data mining would help you more than 1000 words written here.

As #Adriano as pointed out, there are many things to take into considerations...too many things.
But I can think of few things (and easier things that could be done) to help you out (not to completely solve your problem) :
OCL Version
First thing first, which version of OCL do you need (not really related to performance). But if you use some feature of OCL 1.2...well problem solved
Memory or computation bound
You can usually (and crudely) categorized your algorithms in one of these two categories: memory bounded or computation bounded. In the case it's memory bound (with a lot of transfers between host and device) probably the most interesting info would be the device with Host Unified Memory. If not, the most powerful processors most probably would be more interesting.
Rough benchmark
But most probably it wouldn't be as easy to choose in which category put your application.
In that case you could make a small benchmark. Roughly, this benchmark would test different size of data (if your app has to deal with that) on dummy computations which would more or less match the amount of computations your application requires (estimated by you after you completed the development of your kernels). You could log the point where the amount of data is so big that it cancels the device most powerful but connected via PCIe.
GPU Occupancy
Another very important thing when programming on GPUs is the GPU occupancy. The higher, the best. NVIDIA provides an Excel file that calculates the occupancy based on some input. Based on these concepts, you could more or less reproduce the calculation of the occupancy (some adjustment will most probably needed for other vendors) for both GPUs and choose the one with the highest.
Of course, you need to know the values of these inputs. Some of them are based on your code, so you can calculate them before hands. Some of them are linked to the specs of the GPU. You can query some of them as you already did, for some others you might need to hardcode the values in some files after some googling (but at least you don't need to have these GPUs at hands to test on them). Last but not least, don't forget that OCL provides the clGetKernelWorkGroupInfo() which can provide you some info such as the amount of local or private memory needed by a specific kernel.
Regarding the info about the local memory please note that remark from the standard:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
So, it means that this info could be useless if you have first to dynamically compute the size from the host side. A work-around for that could be to use the fact that the kernels are compiled in JIT. The idea here would be to use the preprocessor option -D when calling clBuildProgram() as I explained here. This would give you something like:
#define SIZE
__mykernel(args){
local myLocalMem[SIZE];
....
}
And what if the easier was:
After all the blabla. I'm guessing that you worry about this because you might want to ship your application to some users without knowing what hardware they have. Would it be very inconvenient (at install time or maybe after by providing them a command or a button) to simply run you application with dummy generated data to measure which device performed better and simply log it in a config file?
Or maybe:
Sometime, depending on you specific problem (that could not involve to many syncs) you don't have to choose. Sometime, you could just simply split the work between the two devices and use both...

Why guess? Choose dynamically on your hardware of the day: Take the code you wish to run on the "best" GPU and run it, on a small amount of sample data, on each available GPU. Whichever finishes first: use it for the rest of your calculations.

I'm loving all of the solutions so far. If it is important to make the best device selection automatically, that's how to do it (weight the values based on your usage needs and take the highest score).
Alternatively, and much simpler, is to just take the first GPU device, but also have a way for the user to see the list of compatible devices and change it (either right away or on the next run).
This alternative is reasonable because most systems only have one GPU.

bit-wise matrix transposition in VHDL using blockram

I've been trying to figure out a nice way of transposing a large amount of data in VHDL using a block ram (or similar).
Using a vector of vectors it's relatively easy, but it gets icky for large amounts of data.
I want to use a dual channel block ram so that I can write to the one block and read out the other. write in 8 bit std_logic_vectors, read out 32 bit std_logic_vectors, where the 32 bits is (for first rotation at least) the MSB for input vectors 0 - 31, then 32 - 63 all the way to 294911, then MSB-1, etc.
The case described above is my ideal scenario. Is this even possible? I can't seem to find a nice way of doing this.

In this answer, I'm assuming 18kbit Xilinx-style BRAMs. I'm most familiar with the Virtex-4, so I'm referring to UG070 for the following. The answer will map trivially to other Xilinx FPGAs, and probably other vendors' parts as well.
The first thing to note is that Virtex-4 BRAMs can be used in dual-port mode with different geometries for each port. However, with 1-bit-wide ports, these RAMs' parity bits aren't useful. Thus, an 18kbit BRAM is only effectively 16kbits here.
Next, consider the number of BRAMs you need just for storage (irrespective of your design.) 294912x8 bits maps to 144 BRAMs, which is a pretty large resource commitment. There's always a balance between throughput, design complexity, and resource requirements; if you need to squeeze every MBit/sec out of the design, maybe a BRAM-based approach is ideal. If not, you should consider whether your throughput and latency requirements allow you to use off-chip RAM instead of BRAM.
If you do plan on using BRAMs, then you should consider an array of 18x8 BRAMs. The 8 columns each store a single input bit. Each BRAM is written via a 1-bit port, and read out with a 32-bit port.
Every 8-bit write is mapped to eight one-bit writes (one write to a BRAM in each column.)
Every 32-bit read is mapped to a single 32-bit read from a single BRAM.
You should need very little sequencing logic (around a RAMB16 primitive) to get this working. The major complexity is how to map address bits between the 1-bit port and the 32-bit port.

After a bit of research, and thought, this is my answer to this problem:
Due to the nature of the block ram addressing, the ideal scenario mentioned in the OP is not possible with the current block ram addressing implementation. In order to perform a bitwise matrix transposition in the manner described, the block ram addressing would need to be able to switch between horizontal and vertical. That is, the ram must be accessible both row-wise and column-wise, and the addressing mode must be switchable in real-time. Since a bit-wise data transposition is not really a particularly "useful" transform, there wouldn't really be a reason for the implementation of such a switching scheme. Especially since the whole point of block ram is to store datain chunks of more than 1 bit, and such a transform would scramble the data.
I have discovered a way of changing my design such that 294911 x 8 bits do not need to be transformed at once, but rather done in stages using a process. This does not rely on block ram to perform the transform.

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki

[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.

Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)

Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)

For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio