How to select the most powerful OpenCL device?

How to select the most powerful OpenCL device? - performance

My computer has both an Intel GPU and an NVIDIA GPU. The latter is much more powerful and is my preferred device when performing heavy tasks. I need a way to programmatically determine which one of the devices to use.
I'm aware of the fact that it is hard to know which device is best suited for a particular task. What I need is to (programmatically) make a qualified guess using the variables listed below.
How would you rank these two devices? Intel HD Graphics 4400 to the left, GeForce GT 750M to the right.
GlobalMemoryCacheLineSize 64 vs 128
GlobalMemoryCacheSize 2097152 vs 32768
GlobalMemorySize 1837105152 vs 4294967296
HostUnifiedMemory true vs false
Image2DMaxHeight 16384 vs 32768
Image2DMaxWidth 16384 vs 32768
Image3DMaxDepth 2048 vs 4096
Image3DMaxHeight 2048 vs 4096
Image3DMaxWidth 2048 vs 4096
LocalMemorySize 65536 vs 49152
MaxClockFrequency 400 vs 1085
MaxComputeUnits 20 vs 2
MaxConstantArguments 8 vs 9
MaxMemoryAllocationSize 459276288 vs 1073741824
MaxParameterSize 1024 vs 4352
MaxReadImageArguments 128 vs 256
MaxSamplers 16 vs 32
MaxWorkGroupSize 512 vs 1024
MaxWorkItemSizes [512, 512, 512] vs [1024, 1024, 64]
MaxWriteImageArguments 8 vs 16
MemoryBaseAddressAlignment 1024 vs 4096
OpenCLCVersion 1.2 vs 1.1
ProfilingTimerResolution 80 vs 1000
VendorId 32902 vs 4318
Obviously, there are hundreds of other devices to consider. I need a general formula!

You can not have a simple formula to calculate an index from that parameters.
Explanation
First of all let me assume you can trust collected data, of course if you read 2 for MaxComputeUnits but in reality it's 80 then there is nothing you can do (unless you have your own database of cards with all their specifications).
How can you guess if you do not know task you have to perform? It may be something highly parallel (then more units may be better) or a raw brute calculation (then higher clock frequency or bigger cache may be better). As for normal CPU number of threads isn't the only factor you have to consider for parallel tasks. Just to mention few things you have to consider:
Cache: how much local data each task works with?
Memory: shared with CPU? How many concurrent accesses compared to parallel tasks?
Instruction set: do you need something specific that increases speed even if other parameters aren't so good?
Misc stuff: do you have some specific requirement, for example size of something that must be supported and a fallback method makes everything terribly slow?
To make it short: you can not calculate an index in a reliable way because factors are too many and they're strongly correlated (for example high parallelism may be slowed by small cache or slow memory access but a specific instruction, if supported, may give you great performance even if all other parameters are poor).
One Possible Solution
If you need a raw comparison you may even simply do MaxComputeUnits * MaxClockFrequency (and it may even be enough for many applications) but if you need a more accurate index then don't think it'll be an easy task and you'll get a general purpose formula like (a + b / 2)^2, it's not and results will be very specific to task you have to accomplish.
Write a small test (as much similar as possible to what your task is, take a look to this post on SO) and run it with many cards, with a big enough statistic you may extrapolate an index from an unknown set of parameters. Algorithms can become pretty complex and there is a vast literature about this topic so I won't even try to repeat them here. I would start with Wikipedia article as summary to other more specific papers. If you need an example of what you have to do you may read Exploring the Multiple-GPU Design Space.
Remember that more variables you add to your study more results quality will be unstable, less parameters you use less results will be accurate. To better support extrapolation:
After you collected enough data you should first select and reduce variables with some pre-analysis to a subset of them including only what influences more your benchmark results (for example MaxGroupSize may not be so relevant). This phase is really important and decisions should be made with statistic tools (you may for example calculate p-value).
Some parameters may have a great variability (memory size, number of units) but analysis would be easier with less values (for example [0..5) units, [5..10) units, [10..*) units). You should then partition data (watching their distribution). Different partitions may lead to very different results so you should try different combinations.
There are many other things to consider, a good book about data mining would help you more than 1000 words written here.

As #Adriano as pointed out, there are many things to take into considerations...too many things.
But I can think of few things (and easier things that could be done) to help you out (not to completely solve your problem) :
OCL Version
First thing first, which version of OCL do you need (not really related to performance). But if you use some feature of OCL 1.2...well problem solved
Memory or computation bound
You can usually (and crudely) categorized your algorithms in one of these two categories: memory bounded or computation bounded. In the case it's memory bound (with a lot of transfers between host and device) probably the most interesting info would be the device with Host Unified Memory. If not, the most powerful processors most probably would be more interesting.
Rough benchmark
But most probably it wouldn't be as easy to choose in which category put your application.
In that case you could make a small benchmark. Roughly, this benchmark would test different size of data (if your app has to deal with that) on dummy computations which would more or less match the amount of computations your application requires (estimated by you after you completed the development of your kernels). You could log the point where the amount of data is so big that it cancels the device most powerful but connected via PCIe.
GPU Occupancy
Another very important thing when programming on GPUs is the GPU occupancy. The higher, the best. NVIDIA provides an Excel file that calculates the occupancy based on some input. Based on these concepts, you could more or less reproduce the calculation of the occupancy (some adjustment will most probably needed for other vendors) for both GPUs and choose the one with the highest.
Of course, you need to know the values of these inputs. Some of them are based on your code, so you can calculate them before hands. Some of them are linked to the specs of the GPU. You can query some of them as you already did, for some others you might need to hardcode the values in some files after some googling (but at least you don't need to have these GPUs at hands to test on them). Last but not least, don't forget that OCL provides the clGetKernelWorkGroupInfo() which can provide you some info such as the amount of local or private memory needed by a specific kernel.
Regarding the info about the local memory please note that remark from the standard:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
So, it means that this info could be useless if you have first to dynamically compute the size from the host side. A work-around for that could be to use the fact that the kernels are compiled in JIT. The idea here would be to use the preprocessor option -D when calling clBuildProgram() as I explained here. This would give you something like:
#define SIZE
__mykernel(args){
local myLocalMem[SIZE];
....
}
And what if the easier was:
After all the blabla. I'm guessing that you worry about this because you might want to ship your application to some users without knowing what hardware they have. Would it be very inconvenient (at install time or maybe after by providing them a command or a button) to simply run you application with dummy generated data to measure which device performed better and simply log it in a config file?
Or maybe:
Sometime, depending on you specific problem (that could not involve to many syncs) you don't have to choose. Sometime, you could just simply split the work between the two devices and use both...

Why guess? Choose dynamically on your hardware of the day: Take the code you wish to run on the "best" GPU and run it, on a small amount of sample data, on each available GPU. Whichever finishes first: use it for the rest of your calculations.

I'm loving all of the solutions so far. If it is important to make the best device selection automatically, that's how to do it (weight the values based on your usage needs and take the highest score).
Alternatively, and much simpler, is to just take the first GPU device, but also have a way for the user to see the list of compatible devices and change it (either right away or on the next run).
This alternative is reasonable because most systems only have one GPU.

Related

Drhystone benchmark on 32 bit micrcontroller

Currently I am doing a performance comparison on two 32bit microcontrollers. I used Dhrystone benchmark to run on both microcontrollers. One microcontroller has 4KB I-cache while second coontroller has 8KB of I-cache. Both microcontrollers are using same tool chain. As much as possible I kept same static and run-time settings onboth microcontrollers. But microcontroller with 4KB cache are faster than 8KB cache microcontroller. Both microcontroller are from same vendor and based on same CPU.
Could anyone provide some information why microcontroller with 4KB cache is faster than other?

Benchmarks are in general useless. dhrystone being one of the oldest ones and it may have had a little bit of value back then, before pipelines and too much compiler optimization. I think I gave up on dhrystone 15 years ago roughly about the time I started using it.
It is trivial to demonstrate that this code
.globl ASMDELAY
ASMDELAY:
sub r0,r0,#1
bne ASMDELAY
bx lr
Which is primarily two INSTRUCTIONS, can vary WIDELY in execution time on the same chip if you understand how modern processors work. The simple trick to seeing this is turn off the caches and prefetchers and such, and place this code at offset 0x0000, call it with some value. place it at 0x0004 repeat, then at 0x0008, repeat. keep doing this. You can put one, two, etc nops between the subtract and branch. try it at various offsets.
THEN, turn on and of caches for each of these alignments, if you have a prefetch outside the processor for the flash, turn that on and off.
THEN, vary your clocks, esp for those MCUs where you have to adjust the wait states based on clock rate.
On a SINGLE MCU, you are going to see those essentially two instructions vary in execution time by a very large amount. Lets just say 20 times longer in some cases than others.
Now take a small program or a small fraction of the dhrystone program. Compile that for your mcu, how many instructions do you see? Make minor to major optimization and other variations on the compile command line, how much does the code change. If two instructions can vary by lets call it 20 times in execution time, how bad can it get for 200 instructions or 2000 instructions? It can get pretty bad.
If you take the dhrystone programs you have right now with the compiler options you have right now, go into your bootstrap, add one nop (causing the entire binary to shift by one instruction in flash) run again. Add two, three, four. You are still not comparing different mcus just running your benchmark on one system.
Run with and without d cache, with and without i cache if you have each of those. turn on and off the flash prefetch if you have one and if you have a write buffer you can turn on and of try that. Still remaining on the same compiler same options same mcu.
Take the different functions in the dhrystone source code, and re-arrange them in the source code. instead of Proc_1, Proc_2, Proc_3 make it Proc_1, Proc_3, Proc_2. Do all of the above again. Re-arrange again, repeat.
Before leaving this mcu you should now see that the execution time of the same source code which is completely unmodified (other than perhaps re-arranging functions) can and will have vastly different execution times.
if you then start changing compiler options or keep the same source and change compilers, you will see even more of a difference in execution time.
How is it possible that dhrystone benchmarks today or from way back had a single result for each platform? Simple, that result was just one of a wide range, not really representing the platform.
So then if you try to compare two different hardware platforms be it the same arm core inside a different mcu from the same vendor or different vendors. the arm core, assuming (which is not a safe assumption) it is the same source with the same compile/build options, even assuming the same verilog compile and synthesis was used. you can have that same core change based on arm provided options. Anyway, how the vendor be it the same vendor in two instances or two different vendors wraps that same core, you will see variations. Then take a completely different core be it another arm or a mips, etc. How could you gain any value comparing those using a program like this that itself varies widely on each platform?
You cant. What you can do is use benchmarks to give the illusion of one thing being better than another, one computer is faster than another, one compiler is faster than another. In order to sell computers or compilers. Sprints coverage is within one percent of Verizons...does that tell us anything useful? Nope.
If you were to eliminate the compiler from the equation, and if these are truly the same "CPU", same rev of source from ARM, built the same way, then they should fetch the same, but the size of the cache is part of that so it may already be a different cpu implementation as the width or depth of the cache can affect things. In software it is like needing a 32 bit pointer rather than a 16 bit pointer (17 bit instead of 16, but you cant have a 17 bit generally in logic you can).
Anyway, if you compile the code under test one time for an address space that is common to both platforms, use that same binary exactly for that space, can attach different bootstrap code as needed, note the C library calls strcpy, etc also have to be the same in the same space between platforms to eliminate the compiler and alignment from messing you up. this may or may not level the playing field.
If you want to believe these are the same cpu, then turn the caches off, eliminate the compiler variations by doing the above. See if they execute the same. Copy the program to ram and run in ram, eliminate the flash issues. I assume you have them both clocked the same with the same wait states in the flash?
if they are the same cpu and the chip vendor has with these two chips made the memory system take the same number of clocks say for ram accesses, and it is really the same cpu, you should be able to get the same time by eliminating optimizations (caching, flash prefetching, alignment).
what you are probably seeing is some form of alignment with how the code lies in memory from the compiler vs cache lines, or it could be much much simpler it could be just the differences in the caches, how the hits and misses work and the 4KB is just more lucky than the 8KB for this particular program compiled a certain way, aligned in memory a certain way, etc.
With the simple two instruction loop above it is easy to see some of the reasons why performance varies on the same system, if your modern cpu fetches 8 instructions at a time and your loop gets too close to the tail end of that fetch, the prefetch may think it needs to fetch another 8 beyond that costing you those clock cycles. Certainly as you exactly straddle two "fetch lines" as I call them with those two instructions it is going to cost you a lot more cycles per loop even with a cache. Same problem happens when those two instructions approach a cache line (as you vary their alignment per test) eventually it takes two cache line reads instead of one to fetch those two instructions. At least the first time through there is an extra cache line read. The extra clocks for that first time through is something you can see using a simple benchmark like that while playing with alignment.
Michael Abrash, Zen of Assembly Language. There is an epub/etc you can build from github of this book. the 8088 was obsolete when this book came out if that is all you see 8088 stuff, then you are completely missing the point. It applies to the most modern processors today, how to view the problem, how to test, how to time the test, how to interpret the results. All the stuff I have mentioned so far, and all the things I know about this that I have not mentioned, all came from that books knowledge applied for however many decades I have been doing this.
So again if you have truly eliminated the compiler, alignment, the cpu, the memory system tied to that cpu, etc and it is down to only the size of the cache varies. Then it is probably related to how the cache line fetches hit and miss differently based on alignment of the code relative to the cache lines for the two caches. One is hitting more and missing less and/or evicting better for this particular binary. You can try rearranging the functions, can add nops, or if you cant get at the bootstrap then add whole functions or more code (another printf, etc) at a lower address in the binary causing the linker to slide the code under test through to different addresses changing how the program lines up with the cache lines. Since the functions in the code under test are so big (more than a few instructions to implement) you would have to start modifying the program in order to get a finer grained adjustment of binary relative to cache lines.
You most definitely should see execution time differences on the two platforms if you adjust alignment and or wholesale re-arrange the binary based on functions being re-arranged.
Bottom line benchmarks dont really tell you much, the results have more of a negative stink to them than a positive joy. Without a re-write a particular benchmark or application may just do better on one platform (be it everything is the same but the size of the cache or two completely different architectures) even when you try to vary the results due to alignment, turning on and off prefetching, write buffering, branch prediction, etc. Pulling out all the tricks you can come up with one may vary from x to y the other from n to m and maybe there is some overlap in the ranges. For these two platforms that you say are similiar except for cache size I would hope you could find a combination where sometimes A is faster than B and at least one combination where B is faster than A, with the same features turned on and off for both in that comparision. If/when you switch to some other application/benchmark this all starts over again, no reason to assume dhrystone results predict any other code under test. The only program that matters, esp on an MCU is the final build of your application. Just remember changing a single line of code, or even adding a single nop in the bootstrap can have dramatic results in performance sometimes several TIMES slower or faster from a single nop.

How to compare two implementations of the same algorithm? (by examine their Assembly code)

Assume I have two implementations of the same algorithm in assembly. I would like to know by examining the two snippets codes which one is faster.
The parameters I thought one might take into account are: number of op-codes, number of branches, number of function frames.
My questions are:
Can I assume each opcode execution is one cycle ?
What is the overhead of branch which break the pipeline ?
What are the effects and overhead of calling a function ?
Is there a difference in the analysis between ARM and x86 ?
The question is theoretical since I have two implementations; one 130 instructions long and one is 184 instructions long.
And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?
"BETTER == FASTER"

Without wanting to be flippant, the answers are
no
that depends on your hardware
that depends on your hardware
yes
You would really need to test things on your target hardware, or have a simulator that understands your hardware fully, in order to answer your question the way you meant to...
For the last part of your question, you need to define "better"…better.

Since you asked about a Cortex A9, the data sheet has instruction cycle counts in appendix B. These counts generally assume that the memory bus is fast enough to keep the CPU busy. In reality this is rarely the case. Many video/audio algorithms will have a big win in how they access memory.
One cycle per op
Of course you can't assume this if you want an exact count. However, if you are deciding which algorithm to choose, you can get a feel for the best algorithm by looking at the instructions in the inner loop. Here, your cache should allow the code to execute as per the instruction counts in the data sheet. If the counts are close, then you probably need to look at each instruction. Load/stores are more expensive and usually multiples, etc. Some algorithms, especially crytographic, will have big wins by using assembler that doesn't map well to C. For example, clz, ror, using the carry for multi-word arithmetic, etc.
Branch overhead
Look in Appendix B, or whatever data sheet has cycle counts for your processor. For an ARM926 it is about 3 cycles. The compiler only generates two conditional opcodes in a row to avoid branching, otherwise, it branches. If the algorithm is large, the branch may disrupt the cache. A hard answer depends on your CPU, cache, and memory. According to the Cortex A9 datasheet (B.5), there is only one cycle overhead to a fixed branch.
Function overhead
This is much the same as the branch overhead. However, the compiler will also have an influence. noted by Jim Does it cache align functions. Does the compiler perform leaf function optimizations, etc. With modern gcc versions, if all the functions are static, the compiler will generally in-line when it is advantageous. If the algorithms are particularly large, a register spill may be advantageous. However, with your example of 130/184 instructions, this seems unlikely. The compiler options will obviously effect the overhead. You can use objdump -S to examine the prologue/epilogue and then determine the number of cycles for your hardware.
ARM verus x86
Of course there is a technical difference in the cycle counts. The CISC x86 also has variable instruction size. This complicates the analysis. It is slightly easier on the ARM.
Normally, you want to ball park things and then actually run them with a profiler. The estimates can help guide development of the algorithms. Loop/memory tuning, etc for your hardware. Something like instruction emulation, page or alignment faults, etc may be dominant and make all the cycle count analysis meaningless. If the algorithm is in user space, per-emption, may negate cache wins from run to run. It is possible that one algorithm will work better in a little loaded system and the other will work better under a higher load.
A note on cycle counts
See the post-process objdump for some complications in getting cycle counts. Basically a typical CPU is several phases (a pipe line) and different conditions can cause stalls. As CPU's become more complex, the pipe line typically gets longer, meaning there are more conditions or phases which can stall. However, cycle count estimates can be helpful in guiding development of an algorithm and evaluating them. Things like memory timing or branch prediction can be just as important, depending on the algorithm. Ie, cycle counts are not completely useless, but they are not complete either. Profiling should confirm actual algorithm times. If they diverge, instruction re-ordering, pre-fetching and other techniques may bring them closer. The fact that cycle counts and active profiling diverge can be helpful in itself.

It is definitely not true to say that the 130 instruction code is faster than the 184 instruction code. it is very easy to have 1000 instructions run faster than 100 and vice versa on either of these platforms.
1 Can I assume each opcode execution is one cycle ?
Start by looking at the advertised mips/mhz, although a marketing number it gives a rough idea of what is possible. If the number is greater than one then more than one instruction per clock is possible.
2 What is the overhead of branch which break the pipeline ?
Anywhere from absolutely no affect to a very dramatic affect, on either system. one clock to hundreds are the potential penalty.
3 What are the effects and overhead of calling a function ?
Depends heavily on the function, and the function calling the function. Depending on the calling convention you might have to save registers to the stack, or rearrange the contents of registers to prepare for the parameters for the function to be called. If passing a struct by value a copy of the struct may need to be made on the stack, the bigger the struct passed the bigger the copy. once in the function a stack frame may need to be prepared, etc, etc. There are many factors involved. This question and answer are also independent of platform.
4 Is there a difference in the analysis between ARM and x86 ?
yes and no, both systems use all the modern tricks of pipelining, branch prediction, etc to keep the mips/mhz up. ARM is going to give a better mips per mhz than x86, x86 being variable instruction length might give more instructions per unit cache. How you analyze the cache, and memory and peripheral systems in the systems side of the analysis is roughly the same. The comparison of the instructions and core are similar and different depending on what aspects you are analyzing. The arm is not microcoded, the x86 likely is so you dont really see how many registers there really are, things like that. at the same time the x86 you can get a better look at the memory system with the arm, since they are generally not system on a chip. Depending on what ARM chip you buy you may lose a lot of the visibility in the boundaries of the chip, might not see all the memory and peripheral busses, for example. (x86 is changing that by putting pcie on chip now for example) in the case of something in the cortex-a class you mentioned you would have similar edge of chip visibility as those would use larger/cheaper dram based memory off chip rather than microcontroller like on chip resources.
Bottom line your final question:
"And I would like to know if it is definitely true to say the 130 instructions long snippet is faster than the 184 instructions long implementation?"
It is definitely NOT TRUE to say the 130 instruction snippet is faster than the 184 instruction snippet. It might be faster it might be slower and it might be about the same. With a lot more information we might be able to make a pretty good statement or it may still be non-deterministic. it is easy to choose 100 instructions that execute faster than 1000 instructions and likewise easy to choose 1000 instructions that execute faster than 100 instructions (even if I were to add no branching and no loops, just linear execution)

Your question is almost entirely meaningless: It probably depends on your input.
Most CPUs have something resembling a branch misprediction penalty (e.g. traditional ARM which throws away an instruction fetch/decode on any taken branch, IIRC). ARM and x86 also allow conditional execution, which can be faster than branching. If either of these are dependent on input data, then different inputs will follow different code paths.
Perhaps one version heavily uses conditional execution, which is wasteful when the condition is false. Perhaps another was compiled using some profiling information that performs no branches (except the return at the end) for a specific case. There are many, many reason why a compiler can take the same source and produce an "optimized" output which is faster for one input and slower for another.
Many optimizations have this characteristic — for example, aligning the start of a loop to 16 bytes helps on some processors, but not when the loop is only executed once.

Some text book answer to this question from Cortex
™
-A Series Programmer’s Guide, chapter 17.
Although cycle timing information can be found in the Technical Reference Manual (TRM) for the processor that you are using, it is very difficult to work out how many cycles even a trivial piece of code will take to execute. The movement of instructions through the pipeline is dependent on the progress of the surrounding instructions and can be significantly affected by memory system
activity. Pending loads or instruction fetches which miss in the cache can stall code for tens of cycles. Standard data processing instructions (logical and arithmetic) will take only one or two cycles to execute, but this does not give the full picture. Instead, we must use profiling tools, or the system performance monitor built-in to the processor, to extract useful information about performance.
Also read under 17.4 Cortex-A9 micro-architecture optimizations which answers your question very very much.

"Well-parallelized" algorithm not sped up by multiple threads

I'm sorry to ask a question one a topic that I know so little about, but this idea has really been bugging me and I haven't been able to find any answers on the internet.
Background:
I was talking to one of my friends who is in computer science research. I'm in mostly ad-hoc development, so my understanding of a majority of CS concepts is at a functional level (I know how to use them rather than how they work). He was saying that converting a "well-parallelized" algorithm that had been running on a single thread into one that ran on multiple threads didn't result in the processing speed increase that he was expecting.
Reasoning:
I asked him what the architecture of the computer he was running this algorithm on was, and he said 16-core (non-virtualized). According to what I know about multi-core processors, the processing speed increase of an algorithm running on multiple cores should be roughly proportional to how well it is parallelized.
Question:
How can an algorithm that is "well-parallelized" and programmed correctly to run on a true multi-core processor not run several times more quickly? Is there some information that I'm missing here, or is it more likely a problem with the implementation?
Other stuff: I asked if the threads were possibly taking up more power than any individual core had available and apparently each core runs at 3.4 GHz. This is much more than the algorithm should need, and when diagnostics are run the cores aren't maxed out during runtime.

It is likely sharing something. What is being shared may not be obvious.
One of the most common non-obvious shared resources is CPU cache. If the threads are updating the same cache line that cache line has to bounce between CPUs, slowing everything down.
That can happen because of accessing (even read-only) variables which are near to each other in memory. If all accesses are read-only it is OK, but if even one CPU is writing to that cache line it will force a bounce.
A brute-force method of fixing this is to put shared variables into structures that look like:
struct var_struct {
int value;
char padding[128];
};
Instead of hard-coding 128 you could research what system parameter or preprocessor macros define the cache-line size for your system type.
Another place that sharing can take place is inside system calls. Even seemingly innocent functions might be taking global locks. I seem to recall reading about Linux fixing an issue like this a while back with locks on the functions that return process and thread identifiers and parent identifiers.

Performance versus number of cores is often a S-like curve - first it obviously increases but as locking, shared cache and the like take they debt the further cores do not add so much and even may degrade. Hence nothing mysterious. If we would know more details about the algorithm it may be possible to find an idea to speed it up.

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki

[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.

Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)

Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)

For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

Power Efficient Software Coding

In a typical handheld/portable embedded system device Battery life is a major concern in design of H/W, S/W and the features the device can support. From the Software programming perspective, one is aware of MIPS, Memory(Data and Program) optimized code.
I am aware of the H/W Deep sleep mode, Standby mode that are used to clock the hardware at lower Cycles or turn of the clock entirel to some unused circutis to save power, but i am looking for some ideas from that point of view:
Wherein my code is running and it needs to keep executing, given this how can I write the code "power" efficiently so as to consume minimum watts?
Are there any special programming constructs, data structures, control structures which i should look at to achieve minimum power consumption for a given functionality.
Are there any s/w high level design considerations which one should keep in mind at time of code structure design, or during low level design to make the code as power efficient(Least power consuming) as possible?

Like 1800 INFORMATION said, avoid polling; subscribe to events and wait for them to happen
Update window content only when necessary - let the system decide when to redraw it
When updating window content, ensure your code recreates as little of the invalid region as possible
With quick code the CPU goes back to deep sleep mode faster and there's a better chance that such code stays in L1 cache
Operate on small data at one time so data stays in caches as well
Ensure that your application doesn't do any unnecessary action when in background
Make your software not only power efficient, but also power aware - update graphics less often when on battery, disable animations, less hard drive thrashing
And read some other guidelines. ;)
Recently a series of posts called "Optimizing Software Applications for Power", started appearing on Intel Software Blogs. May be of some use for x86 developers.

Zeroith, use a fully static machine that can stop when idle. You can't beat zero Hz.
First up, switch to a tickless operating system scheduler. Waking up every millisecend or so wastes power. If you can't, consider slowing the scheduler interrupt instead.
Secondly, ensure your idle thread is a power save, wait for next interrupt instruction.
You can do this in the sort of under-regulated "userland" most small devices have.
Thirdly, if you have to poll or perform user confidence activities like updating the UI,
sleep, do it, and get back to sleep.
Don't trust GUI frameworks that you haven't checked for "sleep and spin" kind of code.
Especially the event timer you may be tempted to use for #2.
Block a thread on read instead of polling with select()/epoll()/ WaitForMultipleObjects().
Puts stress on the thread scheuler ( and your brain) but the devices generally do okay.
This ends up changing your high-level design a bit; it gets tidier!.
A main loop that polls all the things you Might do ends up slow and wasteful on CPU, but does guarantee performance. ( Guaranteed to be slow)
Cache results, lazily create things. Users expect the device to be slow so don't disappoint them. Less running is better. Run as little as you can get away with.
Separate threads can be killed off when you stop needing them.
Try to get more memory than you need, then you can insert into more than one hashtable and save ever searching. This is a direct tradeoff if the memory is DRAM.
Look at a realtime-ier system than you think you might need. It saves time (sic) later.
They cope better with threading too.

Do not poll. Use events and other OS primitives to wait for notifiable occurrences. Polling ensures that the CPU will stay active and use more battery life.

From my work using smart phones, the best way I have found of preserving battery life is to ensure that everything you do not need for your program to function at that specific point is disabled.
For example, only switch Bluetooth on when you need it, similarly the phone capabilities, turn the screen brightness down when it isn't needed, turn the volume down, etc.
The power used by these functions will generally far outweigh the power used by your code.

To avoid polling is a good suggestion.
A microprocessor's power consumption is roughly proportional to its clock frequency, and to the square of its supply voltage. If you have the possibility to adjust these from software, that could save some power. Also, turning off the parts of the processor that you don't need (e.g. floating-point unit) may help, but this very much depends on your platform. In any case, you need a way to measure the actual power consumption of your processor, so that you can find out what works and what not. Just like speed optimizations, power optimizations need to be carefully profiled.

Consider using the network interfaces the least you can. You might want to gather information and send it out in bursts instead of constantly send it.

Look at what your compiler generates, particularly for hot areas of code.

If you have low priority intermittent operations, don't use specific timers to wake up to deal with them, but deal with when processing other events.
Use logic to avoid stupid scenarios where your app might go to sleep for 10 ms and then have to wake up again for the next event. For the kind of platform mentioned it shouldn't matter if both events are processed at the same time.
Having your own timer & callback mechanism might be appropriate for this kind of decision making. The trade off is in code complexity and maintenance vs. likely power savings.

Simply put, do as little as possible.

Well, to the extent that your code can execute entirely in the processor cache, you'll have less bus activity and save power. To the extent that your program is small enough to fit code+data entirely in the cache, you get that benefit "for free". OTOH, if your program is too big, and you can divide your programs into modules that are more or less independent of the other, you might get some power saving by dividing it into separate programs. (I suppose it's also possible to make a toolchain that spreas out related bundles of code and data into cache-sized chunks...)
I suppose that, theoretically, you can save some amount of unnecessary work by reducing the number of pointer dereferencing, and by refactoring your jumps so that the most likely jumps are taken first -- but that's not realistic to do as a programmer.
Transmeta had the idea of letting the machine do some instruction optimization on-the-fly to save power... But that didn't seem to help enough... And look where that got them.

Set unused memory or flash to 0xFF not 0x00. This is certainly true for flash and eeprom, not sure about s or d ram. For the proms there is an inversion so a 0 is stored as a 1 and takes more energy, a 1 is stored as a zero and takes less. This is why you read 0xFFs after erasing a block.

Rather timely this, article on Hackaday today about measuring power consumption of various commands:
Hackaday: the-effect-of-code-on-power-consumption
Aside from that:
- Interrupts are your friends
- Polling / wait() aren't your friends
- Do as little as possible
- make your code as small/efficient as possible
- Turn off as many modules, pins, peripherals as possible in the micro
- Run as slowly as possible
- If the micro has settings for pin drive strengh, slew rate, etc. check them & configure them, the defaults are often full power / max speed.
- returning to the article above, go back and measure the power & see if you can drop it by altering things.

also something that is not trivial to do is reduce precision of the mathematical operations, go for the smallest dataset available and if available by your development environment pack data and aggregate operations.
knuth books could give you all the variant of specific algorithms you need to save memory or cpu, or going with reduced precision minimizing the rounding errors
also, spent some time checking for all the embedded device api - for example most symbian phones could do audio encoding via a specialized hardware

Do your work as quickly as possible, and then go to some idle state waiting for interrupts (or events) to happen. Try to make the code run out of cache with as little external memory traffic as possible.

On Linux, install powertop to see how often which piece of software wakes up the CPU. And follow the various tips that the powertop site links to, some of which are probably applicable to non-Linux, too.
http://www.lesswatts.org/projects/powertop/

Choose efficient algorithms that are quick and have small basic blocks and minimal memory accesses.
Understand the cache size and functional units of your processor.
Don't access memory. Don't use objects or garbage collection or any other high level constructs if they expands your working code or data set outside the available cache. If you know the cache size and associativity, lay out the entire working data set you will need in low power mode and fit it all into the dcache (forget some of the "proper" coding practices that scatter the data around in separate objects or data structures if that causes cache trashing). Same with all the subroutines. Put your working code set all in one module if necessary to stripe it all in the icache. If the processor has multiple levels of cache, try to fit in the lowest level of instruction or data cache possible. Don't use floating point unit or any other instructions that may power up any other optional functional units unless you can make a good case that use of these instructions significantly shortens the time that the CPU is out of sleep mode.
etc.

Don't poll, sleep
Avoid using power hungry areas of the chip when possible. For example multipliers are power hungry, if you can shift and add you can save some Joules (as long as you don't do so much shifting and adding that actually the multiplier is a win!)
If you are really serious,l get a power-aware debugger, which can correlate power usage with your source code. Like this

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio