Do modern processors suffer from slowdowns due to instruction dependencies? [duplicate] - performance

This question already has answers here:
Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?
(12 answers)
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
(1 answer)
About the branchless binary search
(4 answers)
Closed 2 years ago.
While I was studying computer organization, we talked about data dependencies and how they limit the throughput of the pipeline because the execution of one instruction is blocked by another instruction not being done.
In modern processors, is it still the case? Is it possible to create a scenario in a real-world program where the CPU has enough data (it is not waiting for data from the memory), but due to data dependencies it is not running at full speed (maximum instructions per cycle)?
I believe that the compilers will try to break chains of dependencies. Are there cases when this is not possible?

Related

Is hyperthreading working in a concurrency or a parallelism pattern? [duplicate]

This question already has answers here:
Can a hyper-threaded processor core execute two threads at the exact same time?
(1 answer)
Does a hyperthreading CPU implement parallelism or just concurrency?
(2 answers)
How does hyperthreading affect parallelization?
(1 answer)
Closed 13 days ago.
I've read some articles explaining this. Not very sure about my understanding. Here is what I thought. It has both patterns. When there no executor resource conflicts between two threads, they work parallel(two threads working at the same time). Otherwise,they work concurrently(only one thread working at the same time).

How can different blocks run in a same SM(streaming multiprocessor) [duplicate]

This question already has an answer here:
How Concurrent blocks can run a single GPU streaming multiprocessor?
(1 answer)
Closed 3 years ago.
I've read these two pages: Understanding Streaming Multiprocessors (SM) and Streaming Processors (SP), How Concurrent blocks can run a single GPU streaming multiprocessor?
But I am still confusing about the hardware structure.
Is SM a SIMT(single instruction multi thread) structure?
Suppose there are 8 SPs in a given SM. If different blocks can be executed in a same SM, these SPs will have different instructions. So my understanding is: SM will give different SP different instruction.
Are the threads in a same warp executed simultaneously?
Suppose there are 8 SPs in a given SM. A warp is in the SM. Since several warps may run in the SM, I suppose 4 SPs are running this warp. There are 32 threads in this warp, but only 4 SPs can run them. So it will actually take 8 cycles to run this warp?
I also heard someone said that all the threads in a warp run serially. I don't know what is the truth...
Several blocks can run in a single SM. According to this presentation (slide 19 - thanks #RobertCrovela), blocks can be emitted from different kernels. When running from the same kernel, block index can be seen as a supplemental level of thread index, up to some limit (different for each architecture and kernel). However, I have never experienced two different streams running at the same time on a given SM.
Depending on architecture, a single warp instruction may be run by SP in a single cycle, hence simultaneously. However, this can only be true for SM with 32 SP, thus not in double precision for example. Also, there is no guarantee on this. Finally we experienced configurations where, some threads of high warp index where running before lower indices. Besides synchronization functions and other tools, there is no hard rule on how the instruction scheduler behaves.

The maximum number of thread [duplicate]

This question already has answers here:
What's the maximum number of threads in Windows Server 2003?
(8 answers)
Closed 9 years ago.
What is the maximum number of thread in an application 32-bits and 64-bits developed in Delphi?
I need to know what is the limit of threads running simultaneously on a 32-bit application, because I'm doing performance analysis and I want to let the OS manage the execution order of the threads that are waiting.
You might want to read this answer: https://stackoverflow.com/a/481919/1560865
Still, what I wrote in my comment above stays partially true (but please also notice Martin James' objection to it below).
Notice that - generally speaking - if you create way more threads than
processor cores (or virtual equivalents), you will not gain any
performance advantage. If you create too many, you'll even end up with
pretty bad results like these:
thedailywtf.com/Articles/Less-is-More.aspx So are you completely sure
that you'll need the theoretically possible maximum number of threads?

Why are GPUs more powerful than CPUs [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
How are GPUs more faster then CPUs? I've read articles that talk about how GPU's are much faster in breaking passwords than CPUs. If thats the case then why can't CPUs be designed in the same way as GPUs to be even in speed?
GPU get their speed for a cost. A single GPU core actually works much slower than a single CPU core. For example, Fermi GTX 580 has a core clock of 772MHz. You wouldn't want your CPU with such a low core clock nowadays...
The GPU however has several cores (up to 16) each operating in a 32-wide SIMD mode. That brings 500 operations done in parallel. Common CPUs however have up to 4 or 8 cores and can operate in 4-wide SIMD which gives much lower parallelism.
Certain types of algorithms (graphics processing, linear algebra, video encoding, etc...) can be easily parallelized on such a huge number of cores. Breaking passwords falls into that category.
Other algorithms however are really hard to parallelize. There is ongoing research in this area... Those algorithms would perform really badly if they were run on the GPU.
The CPU companies are now trying to approach the GPU parallelism without sacrificing the capability of running single-threaded programs. But the task is not an easy one. The Larabee project (currently abandoned) is a good example of the problems. Intel has been working on it for years but it is still not available on the market.
GPUs are designed with one goal in mind: process graphics really fast. Since this is the only concern they have, there have been some specialized optimizations in place that allow for certain calculations to go a LOT faster than they would in a traditional processor.
In the case of password cracking (or the molecular dynamic "folding at home" project) what has happened is that programmers have found ways of leveraging these optimized processes to do things like crunch passwords at a faster rate.
Your standard CPU has to do a lot more different calculation and processing types that what graphics processors do, so they can't be optimized in a similar manner.

How to measure x86 and x86-64 assembly commands execution time in processor cycles? [duplicate]

This question already has answers here:
How many CPU cycles are needed for each assembly instruction?
(5 answers)
Closed 3 years ago.
I want to write a bunch of optimizations for gcc using genetic algorithms.
I need to measure execution time of an assembly functions for some stats and fit functions.
The usual time measurement can't be used, 'cause it is influenced by the cache size.
So I need a table where I can see something like this.
command | operands | operands sizes | execution cycles
Am I missunderstanding something?
Sorry for bad English.
With modern CPU's, there are no simple tables to look up how long an instruction will take to complete (although such tables exist for some old processors, e.g. 486). Your best information on what each instruction does and how long it might take comes from the chip manufacturer. E.g. Intel's documentation manuals are quite good (there's also an optimisation manual on that page).
On pretty much all modern CPU's there's also the RDTSC instruction that reads the time stamp counter for the processor on which the code is running into EDX:EAX. There are pitfalls with this also, but essentially if the code you are profiling is representative of a real use situation, its execution doesn't get interrupted or shifted to another CPU core, then you can use this instruction to get the timings you want. I.e. surround the code you are optimising with two RDTSC instructions and take the difference in TSC as the timing. (Variances on timings in different tests/situations can be great; statistics is your friend.)
reading the system clock value?
You can instrument your code using assembly (rdtsc and friends) or using a instrumentation API like PAPI. Accurately measuring clock cycles that were spent during the execution of one instruction is not possible, however - you can refer to your architecture developer manuals for the best estimates.
In both cases, you should be careful when taking into account effects from running on a SMP environment.

Resources