quickest cpu instructions

quickest cpu instructions - cpu

Is there a list of the order of speed of instructions (this may be generally speaking at it will vary on the architecture)? I was told years ago from my assembly professor that shift was the quickest.
How are +, -, *, / ordered?

Agner Fog has a very nice list of instruction latencies and throughputs for x86 processors:
http://www.agner.org/optimize/instruction_tables.pdf
EDIT:
That said, you really have to be at this level of HPC to be able to make use of this information to improve performance.

Related

Do (sampling) profilers still "lie" these days?

Most of my limited experience with profiling native code is on a GPU rather than on a CPU, but I see some CPU profiling in my future...
Now, I've just read this blog post:
How profilers lie: The case of gprof and KCacheGrind
about how what profilers measure and what they show you, which is likely not what you expect if you're interested in discerning between different call paths and the time spent in them.
My question is: Is this still the case today (5 years later)? That is, do sampling profilers (i.e. those who don't slow execution down terribly) still behave the way gprof used to (or callgrind without --separate-callers=N)? Or do profilers nowadays customarily record the entire call stack when sampling?

No, many modern sampling profilers don't exhibit the problem described regarding gprof.
In fact, even when that was written, the specific problem was actually more a quirk of the way gprof uses a mix of instrumentation and sampling and then tries to reconstruct a hypothetical call graph based on limited caller/callee information and combine that with the sampled timing information.
Modern sampling profilers, such as perf, VTune, and various language-specific profilers to languages that don't compile to native code can capture the full call stack with each sample, which provides accurate times with respect to that issue. Alternately, you might sample without collecting call stacks (which reduces greatly the sampling cost) and then present the information without any caller/callee information which would still be accurate.
This was largely true even in the past, so I think it's fair to say that sampling profilers never, as a group, really exhibited that problem.
Of course, there are still various ways in which profilers can lie. For example, getting results accurate to the instruction level is a very tricky problem, given modern CPUs with 100s of instructions in flight at once, possibly across many functions, and complex performance models where instructions may have a very different in-context cost as compared to their nominal latency and throughput values. Even that tricky issues can be helped with "hardware assist" such as on recent x86 chips with PEBS support and later related features that help you pin-point an instruction in a less biased way.

Regarding gprof, yes, it's still the case today. This is by design, to keep the profiling overhead small. From the up-to-date documentation:
Some of the figures in the call graph are estimates—for example, the
children time values and all the time figures in caller and subroutine
lines.
There is no direct information about these measurements in the profile
data itself. Instead, gprof estimates them by making an assumption
about your program that might or might not be true.
The assumption made is that the average time spent in each call to any
function foo is not correlated with who called foo. If foo used 5
seconds in all, and 2/5 of the calls to foo came from a, then foo
contributes 2 seconds to a’s children time, by assumption.
Regarding KCacheGrind, little has changed since the article was written. You can check out the change log and see that the latest version was published in April 5, 2013, which includes unrelated changes. You can also refer to Josef Weidendorfer's comments under the article (Josef is the author of KCacheGrind).

If you noticed, I contributed several comments to that post you referenced, but it's not just that profilers give you bad information, it's that people fool themselves about what performance actually is.
What is your goal? Is it to A) find out how to make the program as fast as possible? Or is it to B) measure time taken by various functions, hoping that will lead to A? (Hint - it doesn't.) Here's a detailed list of the issues.
To illustrate: You could, for example, be calling a teeny innocent-looking little function somewhere that just happens to invoke nine yards of system code including reading a .dll to extract a string resource in order to internationalize it. This could be taking 50% of wall-clock time and therefore be on the stack 50% of wall-clock time. Would a "CPU-profiler" show it to you? No, because practically all of that 50% is doing I/O. Do you need many many stack samples to know to 3 decimal places exactly how much time it's taking? Of course not. If you only got 10 samples it would be on 5 of them, give or take. Once you know that teeny routine is a big problem, does that mean you're out of luck because somebody else wrote it? What if you knew what the string was that it was looking up? Does it really need to be internationalized, so much so that you're willing to pay a factor of two in slowness just for that? Do you see how useless measurements are when your real problem is to understand qualitatively what takes time?
I could go on and on with examples like this...

Suggest an OpenMP program that has noticeble speedup and the most important concepts in it for a talk

I am going to have a lecture on OpenMP and I want to write an program using OpenMP lively . What program do you suggest that has the most important concept of OpenMP and has noticeable speedup? I want an awesome program example, please help me all of you that you are expert in OpenMP
you know I am looking for an technical and Interesting example with nice output.
I want to write two program lively , first one for better illustration of most important OpenMP concept and has impressive speedup and second-one as a hands-on that everyone must write that code at the same time
my audience may be very amateur

Personally I wouldn't say that the most impressive aspect of OpenMP is the scalability of the codes you can write with it. I'd say that a more impressive aspect is the ease with which one can take an existing serial program and, with only a few OpenMP directives, turn it into a parallel program with satisfactory scalability.
So I'd suggest that you take any program (or part of any program) of interest to your audience, better yet a program your audience is familiar with, and parallelise it right there and then in your lecture, lively as you put it. I'd be impressed if a lecturer could show me, say, a 4 times speedup on 8 cores with 5 minutes coding and a re-compilation. And that leads on to all sorts of interesting topics about why you don't (always, easily) get 8 times speedup on 8 cores.
Of course, like all stage illusionists, you'll have to choose your example carefully and rehearse to ensure that you do get an impressive-enough speedup to support your argument.
Personally I'd be embarrassed to use an embarrassingly parallel program for such a demo; the more perceptive members of the audience might be provoked into a response such as meh.

(1) Matrix multiply
Perhaps it's the most simple example (though matrix addition would be simpler).
(2) Mandelbrot
http://en.wikipedia.org/wiki/Mandelbrot_set
Mandelbrot is also embarrassingly parallel, and OpenMP can achieve decent speedups. You can even use graphics to visualize it. Mandelbrot is also an interesting example because it has workload imbalance. You may see different speedups based on scheduling policies (e.g., schedule(dynamic,1) vs. schedule(static)), and different threading libraries (e.g., Cilk Plus or TBB).
(3) A couple of mathematical kernels
For example, FFT (non-recursive version) is also embarrassingly parallelized.
Take a look at "OmpSCR" benchmarks: http://sourceforge.net/projects/ompscr/ This suite has simple OpenMP examples.

Where to learn about low-level, hard-core performance stuffs?

This is actually a 2 part question:
For people who want to squeeze every clock cycle, people talk about pipelines, cache locality, etc.
I have seen these low level performance techniques mentioned here and there but I have not seen a good introduction to the subject, from start to finish. Any resource recommendations? (Google gave me definitions and papers, where I'd really appreciate some kind of worked examples/tutorials real-life hands-on kind of materials)
How does one actually measure this kind of things? Like, as in a profiler of some sort? I know we can always change the code, see the improvement and theorize in retrospect, I am just wondering if there are established tools for the job.
(I know algorithm optimization is where the orders of magnitudes are. I am interested in the metal here)

The chorus of replies is, "Don't optimize prematurely." As you mention, you will get a lot more performance out of a better design than a better loop, and your maintainers will appreciate it, as well.
That said, to answer your question:
Learn assembly. Lots and lots of assembly. Don't MUL by a power of two when you can shift. Learn the weird uses of xor to copy and clear registers. For specific references,
http://www.mark.masmcode.com/ and http://www.agner.org/optimize/
Yes, you need to time your code. On *nix, it can be as easy as time { commands ; } but you'll probably want to use a full-features profiler. GNU gprof is open source http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html
If this really is your thing, go for it, have fun, and remember, lots and lots of bit-level math. And your maintainers will hate you ;)

EDIT/REWRITE:
If it is books you need Michael Abrash did a good job in this area, Zen of Assembly language, a number of magazine articles, big black book of graphics programming, etc. Much of what he was tuning for is no longer a problem, the problems have changed. What you will get out of this is the ideas of the kinds of things that can cause bottle necks and the kinds of ways to solve. Most important is to time everything, and understand how your timing measurements work so that you are not fooling yourself by measuring incorrectly. Time the different solutions and try crazy, weird solutions, you may find an optimization that you were not aware of and didnt realize until you exposed it.
I have only just started reading but See MIPS Run (early/first edition) looks good so far (note that ARM took over MIPS as the leader in the processor market, so the MIPS and RISC hype is a bit dated). There are a number of text books old and new to be had about MIPS. Mips being designed for performance (At the cost of the software engineer in some ways).
The bottlenecks today fall into the categories of the processor itself and the I/O around it and what is connected to that I/O. The insides of the processor chips themselves (for higher end systems) run much faster than the I/O can handle, so you can only tune so far before you have to go off chip and wait forever. Getting off the train, from the train to your destination half a minute faster when the train ride was 3 hours is not necessarily a worthwhile optimization.
It is all about learning the hardware, you can probably stay within the ones and zeros world and not have to get into the actual electronics. But without really knowing the interfaces and internals you really cannot do much performance tuning. You might re-arrange or change a few instructions and get a little boost, but to make something several hundred times faster you need more than that. Learning a lot of different instruction sets (assembly languages) helps get into the processors. I would recommend simulating HDL, for example processors at opencores, to get a feel for how some folks do their designs and getting a solid handle on how to really squeeze clocks out of a task. Processor knowledge is big, memory interfaces are a huge deal and need to be learned, media (flash, hard disks, etc) and displays and graphics, networking, and all the types of interfaces between all of those things. And understanding at the clock level or as close to it as you can get, is what it takes.

Intel and AMD provide optimization manuals for x86 and x86-64.
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html/
http://developer.amd.com/documentation/guides/pages/default.aspx
Another excellent resource is agner.
http://www.agner.org/optimize/
Some of the key points (in no particular order):
Alignment; memory, loop/function labels/addresses
Cache; non-temporal hints, page and cache misses
Branches; branch prediction and avoiding branching with compare&move op-codes
Vectorization; using SSE and AVX instructions
Op-codes; avoiding slow running op-codes, taking advantage of op-code fusion
Throughput / pipeline; re-ordering or interleaving op-codes to perform separate tasks avoiding partial stales and saturating the processor's ALUs and FPUs
Loop unrolling; performing multiple iterations for a single "loop comparison, branch"
Synchronization; using atomic op-code (or LOCK prefix) to avoid high level synchronization constructs

Yes, measure, and yes, know all those techniques.
Experienced people will tell you "don't optimize prematurely", which I relate as simply "don't guess".
They will also say "use a profiler to find the bottleneck", but I have a problem with that. I hear lots of stories of people using profilers and either liking them a lot or being confused with their output.
SO is full of them.
What I don't hear a lot of is success stories, with speedup factors achieved.
The method I use is very simple, and I've tried to give lots of examples, including this case.

I'd suggest Optimizing subroutines in assembly
language
An optimization guide for x86 platforms.
It's quite heavy stuff though ;)

What is the platform independent algorithm that returns a measurable value to test Moore's Law? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Just because of curiosity...
Is there a platform independent algorithm that produces a comparable value; so that I can
implement the algorith on different machines that were introduced to the market bi-yearly
and see how does it fit with Moore's Law by checking the returned values of the algorithm
in those machines?

Most of the transistors that are put onto your CPU by Intel and AMD are put there with the purpose of speeding it up one way or another, so a possible proxy for "how many transistors are on there?" is, "how fast is it?". Often when people talk about Moore's law in relation to a CPU it's performance that they're talking about, even though that's not what Moore said.
Benchmarking a CPU is notoriously arbitrary, though. What weightings do you give to your various speed tests? Suppose that next year, Intel invents 20 new SIMD instructions, and adds corresponding silicon to their chips to implement them. Unless your code uses those instructions, there's no way it's going to notice that they're there, so they won't affect your results and you won't report an increase in your performance/transistor index. Since they were invented after you wrote your code, you can't execute them explicitly, so the only way they're going to be used is if an up-to-date compiler, with options to target the new version of the CPU, finds some code in your benchmark that it thinks will benefit from the new instructions. Not very reliable, you simply can't detect new transistors if you can't find a way to use them.
Performance of a single core of a CPU on simple benchmarks has in any case hit something of a roadblock in the last few years. CPU manufacturers are adding cores, and adding special-purpose instructions and silicon, so programs have more resources to draw on if they know how to use them, but boring old arithmetic isn't getting much faster. It's hard to know for what special purposes CPU manufacturers will be adding transistors in 5 or 10 years time, but if you can do that then you could possibly write benchmarks now that will tell you when they've done it.
I don't know much about GPUs, but if you can somehow detect the number of GPU cores on your machine (counting parallel shaders and whatnot), that might actually be the best proxy for raw number of transistors. I guess the number of transistors in each core does go up over time too, but the number of cores on modern graphics cards is rocketing, so actually that might account for the bulk of the new transistors related to processing. Whether that will still be the case in 5 or 10 years, again, who knows.
Another big transistor count is RAM - presumably for a given type of RAM, the number of transistors is pretty much proportional to capacity, and that at least is easily measured using OS-specific functions.
If you stick a SSD in a machine, I bet you pile on the transistor count too. Is that the sort of thing you're interested in, though? Really Moore's law was about single ICs, not the total contents of a beige (well, white or silver these days) box at a given price point.

Well algorithm could be really simple, like calculating flops(floating point operations per second). Just get system time, make 1kk floating point operations get time again and get the difference(or use LINPACK Benchmarks wich is used to rate supercomputers). However implementing this in platform independent way would be tricky.

What are the advantages and disadvantages of GPGPU (general-purpose GPU) development? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am wondering what is the key thing that helps you in GPGPU development and of course what is the constraints that you find unacceptable.
Comes to mind for me:
Key advantage: the raw power of these things
Key constraint: the memory model
What's your view?

You have to be careful with how you interpret Tim Sweeney's statements in that Ars interview. He's saying that having two separate platforms (the CPU and GPU), one suitable for single-threaded performance and one suitable for throughput-oriented computing, will soon be a thing of the past, as our applications and hardware grow towards one another.
The GPU grew out of technology limitations with the CPU, which made the arguably more natural algorithms like ray-tracing and photon mapping nigh-undoable at reasonable resolutions and framerates. In came the GPU, with a wildly different and restrictive programming model, but maybe 2 or 3 orders of magnitude better throughput for applications painstakingly coded to that model. The two machine models had (and still have) essentially different coding styles, languages (OpenGL, DirectX, shader languages vs. traditional desktop languages), and workflows. This makes code reuse, and even algorithm/programming skill reuse, extremely difficult, and hamstrings any developer who wants to make use of a dense parallel compute substrate into this restrictive programming model.
Finally, we're coming to a point where this dense compute substrate is similarly programmable to a CPU. Although there is still a sizeable performance delta between one "core" of these massively-parallel accelerators (though the threads of execution within, for example, an SM on the G80, are not exactly cores in the traditional sense) and a modern x86 desktop core, two factors drive convergence of these two platforms:
Intel and AMD are moving towards more, simpler cores on x86 chips, converging the hardware with the GPU, where units are becoming more coarse-grained and programmable over time).
This and other forces are spawning many new applications that can take advantage of Data- or Thread-Level Parallelism (DLP/TLP), effectively utilizing this kind of substrate.
So, what Tim was saying is that the 2 distinct platforms will converge, to an even greater extent than, for instance, OpenCl, affords. A salient quote from the interview:
TS: No, I see exactly where you're
heading. In the next console
generation you could have consoles
consist of a single non-commodity
chip. It could be a general processor,
whether it evolved from a past CPU
architecture or GPU architecture, and
it could potentially run
everything—the graphics, the AI,
sound, and all these systems in an
entirely homogeneous manner. That's a
very interesting prospect, because it
could dramatically simplify the
toolset and the processes for creating
software.
Right now, in the course of shipping
Unreal 3, we have to use multiple
programming languages. We use one
programming language for writing pixel
shaders, another for writing gameplay
code, and then on PlayStation 3 we use
yet another compiler to write code to
run on the Cell processor. So the
PlayStation 3 ends up being a
particular challenge, because there
you have three completely different
processors from different vendors with
different instruction sets and
different compilers and different
performance techniques. So, a lot of
the complexity is unnecessary and
makes load-balancing more difficult.
When you have, for example, three
different chips with different
programming capabilities, you often
have two of those chips sitting idle
for much of the time, while the other
is maxed out. But if the architecture
is completely uniform, then you can
run any task on any part of the chip
at any time, and get the best
performance tradeoff that way.

I found this article to be interesting about how GPU's won't be as necessary with the speed of CPU's and # of cores ever increasing.
http://arstechnica.com/articles/paedia/gpu-sweeney-interview.ars

Used to be interesting for their parallel architectures and extra silicon that was mostly idle and hence could be used on the side for general purpos programming tasks -
see - http://en.wikipedia.org/wiki/CUDA
but it might not be too relevant in the face of Lou's answer above.

The key advantage is gigaflops - raw power. Disadvantages include limited, non orthogonal instruction set and programming model.
Here's a survey paper:
http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=907
The wikipedia article's a pretty good start.
Lou Franco points to an interview with Tim Sweeney; here's the slides of a talk he gave, which has more detail:
http://www.scribd.com/doc/5687/The-Next-Mainstream-Programming-Language-A-Game-Developers-Perspective-by-Tim-Sweeney
Might also nose around:
http://gpgpu.org

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio