parallel code slower on multicore AMD - gcc

parallelized code(openmp), compiled on and intel (linux) with gcc, runs much faster on an intel computer than on an AMD with twice as many cores. I see that all the cores are in use but it takes about 10 times more cpu time on the AMD. I had heard about "cripple AMD" in intel compiler, but I am using gcc! Thanks in advance

Intel has hyper-threading technology in their modern processor cores which essentially means that you have multiple hardware contexts running on a single core simultaneously, r you taking this into account, when you make the comparison ??

Related

Can CUDA cores run things absolutely parallel or do they need context switching?

Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching? I know that it is not possible on a CPU, but on a NVIDIA GPU? I know that a SM can run warps, and if core has to wait for some information, then a it gets another thread from the dispatch unit.
I know that it is not possible on a CPU, but on a NVIDIA GPU?
This assertion is wrong on modern mainstream CPUs (eg. since at least a decade for nearly all x86-64 processors, starting from Intel Skylake or AMD Zen 2). Indeed, modern x86-64 Intel/AMD processor can generally compute 2 (256 AVX) SIMD vectors in parallel since there is generally 2 SIMD units. Processors like Intel Skylake also have 4 ALU units capable of computing 4 basic arithmetic operations (eg. add, sub, and, xor) in parallel per cycle. Some instruction like division are far more expensive and do not run in parallel on such architecture though it is well pipelined. The instructions can come from the same thread on the same logical cores or possibly 2 threads (of possibly 2 different processes) scheduled on 2 logical cores without any context switches. Note that recent high-end ARM processors can also do this (even some mobile processors).
Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching?
NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Thus, 1 instruction operate on 32 items in parallel (though, theoretically, an hardware can be free not to do that completely in parallel). A kernel execution basically contains many block and blocks are scheduled to SM. An SM can operate on many blocks concurrently so there is a massive amount of parallelism available.
Whether a specific GPU can execute two INT32 warp in parallel it is dependent of the target architecture, not CUDA itself. On modern Nvidia GPUs, each SM can be split in multiple partitions that can each execute instructions on blocks independently of the other partitions. For example, AFAIK, on a Pascal GP104, there is 20 SM and each SM has 4 partition capable of running SIMD instructions operating on 1 warp (32 items) at time. In practice, things can be a bit more complex on newer architectures. You can get more information here.

Xcode compile times: which Mac configuration delivers noticeable best performance? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
I looked at the different configurations of Macs available: MacBook Pro, iMac and iMac Pro.
Are the huge configurations of e.g. the iMac Pro (Xeon, 18 cores etc.) noticeable speeding up Xcode compilation times? Or are those specs tailored at video editing?
Also if I compare
3,2 GHz 8-Core Intel Xeon W Processor
4,2 GHz Quad‑Core Intel Core i7 Processor
more cores, less GHz or the other way round? What's most important for Xcode compilation performance - cores? Processor? Ghz?
Its super easy.
Xcode uses processor power for compiling tasks.
CPU Specification formula:
**
3,2Ghz * 8 cores = 25,6 Ghz
4,2Ghz * 4 cores = 16,8 Ghz
**
So answering to your question, the most important for Xcode compilation performance is processor power.
First processor, xeon based will be much more productive for xcode routine. Use that formula.
p.s. My answer based on assumption that both processors is the same or nearnly same year production. Its also important to take in mind the youth of CPU.
For 100% sure, check your processors at Geekbench
A higher clock speed allows more processes to be executed in a given time frame. Whereas multiple cores allow for parallel processing. However, the benefits are not double, because not everything will be able to run in parallel for the whole time.
4 cores sounds like plenty. You could maybe go to 6 and be able to justify it, but 8 would be overkill and a waste of money. A higher clock speed will be much more useful and would be much more useful when using the computer for other tasks as well. Also, in regards to the type of processor, they don’t matter too much. As long as you are getting the performance, the implementation doesn’t matter much compared to the other metrics.
Edit
It is also important to take into account the Turbo Boost speeds. This allows a processor to run at a lower clock speed, when non-intensive tasks are running, in order to save energy consumption. For intensive tasks, it will be the Turbo Boost speed that you are getting. This is managed automatically by macOS, but can be manually controlled using an app such as Turbo Boost Switcher.
For the Quad-Core i7, it has a Turbo Boost of 4.5GHz, whereas the 8 Core Xeon has a Turbo Boost of 4.2GHz. This makes them much closer in terms of clock speed. However, the i7 still beats the Xeon in terms of outright clock speed. It also beats it in terms of normal speed, which will benefit with other tasks performed on the computer, and will help with any ‘turbo lag’, if it it managed by the system. Finally, it also has an additional benefit of beating the Xeon on price. This means that for compiling and other Xcode tasks, the i7 is a clear winner.
Look at your current machine. Open Activity Monitor while you are building. If everything is perfect, you would have 100% CPU usage. On a good day you come to 70%, because nothing is perfect.
I have some third party build-scripts that are highly inefficient and use only one core. So your 18 core Mac won't benefit from that at all.
The first and cheapest approach is to make sure you use pre-compiled headers, especially for C++ code, and that your build scripts use all available processors. I have one C++ library that I could build four times faster after doing that.
Note that "GHz" numbers don't tell you what really happens. As your Mac uses more cores, it heats up, and has to reduce the clock speed. So that 3.2 GHz eight core model can run four threads at a much higher speed, probably the same speed as the 4.2 GHz quad core model.
Right now I would recommend you get an M1 Mac for highest single core performance and good multi-core performance, or wait a little bit for their second generation with 8 performance cores. That's what I will be doing.
I suggest you take the i7 one. (If both of the processors have the same release date, always take the newer release date)
If you are comparing processor performances, you need to know what that processor build for. Intel Xeon is a server processor, and Intel i7 is high-end pc processor.
When comparing 4,2 GHz Quad‑Core Intel Core i7 Processor vs 3,2 GHz 8-Core Intel Xeon W Processor for a single app the answer is simply the i7 one. Xcode build process may only take one full core with paralleling some its computing process in other core.
The 8-Core Xeon will better use for running computing process as a server do.

How many CPU, cores are really in multiccores?

I have a corei7 intel processore(CPU name: Intel(R) Core(TM) i7-4500U CPU # 1.80GHz, CPU type: Intel Core Haswell processor).
I wonder the output of CPUID command as it shows 4 cpus each having 2 cores!
Do I have really 4 CPUs?
the out put includes 4 cpus(cpu0 to cpu3)
(multi-processing synth): multi-core (c=2), hyper-threaded (t=2)
This is because I want to use hardware performance counters to test my app. However I am confused with how many cores I have to monitor and profile.
Your Intel i7 4500U is a Dual Core CPU with Hyper Threading support, so you see 4 Cores.
This U stands for ultra book, so this is a CPU which is designed for long battery life for the slim ultra books.
First, as mentioned before, your system is a dual-core with Hyperthreading (Hyperthreading means each core can execute from two simultaneous hardware threads). Therefore, your OS sees 4 "logical CPUs" even though there's only one "physical CPU". Read more below:
If you're on linux, look at /proc/cpuinfo using cat or less as follows:
cat /proc/cpuinfo
That will list all info you need to know. However, to answer your question and to make sense of the information. You need to know that there is a difference between a 'logical cpu' and a 'physical cpu'. A physical CPU is the actual hardware made by Intel for example that's installed in your system. A logical CPU is what is seen by the OS and basically refers to a 'hardware thread' or one processor core. So, let's say you have One physical CPU with 4 cores and each core supports one thread (hardware thread), then your OS will see 4 CPUs and those will be listed in the /proc/cpuinfo having different 'processor' numbers but the same 'physical id' because they all belong to the same physical processor.
Another example, let's say that each of the cores above supports two threads (again, hardware threads, not software threads). Then, your OS will see 8 CPUs. If you have dual-socket (multi-node) server with two physical cpus and all the above, then your OS will see 16 CPUs; each 8 of them will have the same 'physical id'.
Info about your system is here: http://ark.intel.com/products/75460/Intel-Core-i7-4500U-Processor-4M-Cache-up-to-3_00-GHz

Is there any difference performance difference between g++-mp-4.8 and g++-4.8?

I'm compiling the same program on two different machines and then running tests to compare performance.
There is a difference in the power of the two machines: one is MacBook Pro with a four 2.3GHz processors, the other is a Dell server with twelve 2.9 GHz processors.
However, the mac runs the test programs in shorter time!!
The only difference in the compilation is that I run g++-mp-4.8 on the machine mac, and g++-4.8 on the other.
EDIT: There is NO parallel computing going on, and my process was the only one run on the server. Also, I've updated the number of cores on the Dell.
EDIT 2: I ran three tests of increasing complexity, the times obtained were, in the format (Dell,Mac) in seconds: (1.67,0.56), (45,35), (120,103). These differences are quite substantial!
EDIT 3: Regarding the actual processor speed, we considered this with the system administrator and still came up with no good reason. Here is the spec for the MacBook processor:
http://ark.intel.com/fr/products/71459/intel-core-i7-3630qm-processor-6m-cache-up-to-3_40-ghz
and here for the server:
http://ark.intel.com/fr/products/64589/Intel-Xeon-Processor-E5-2667-15M-Cache-2_90-GHz-8_00-GTs-Intel-QPI
I would like to highlight a feature that particularly skews results of single-threaded code on mobile processors:
Note that while there's a 500 MHz difference in base speed (the question mentioned 2.3 GHz, are we looking at the same CPU?), there's only a 100 MHz difference in single-threaded speed, when Turbo Boost is running at maximum.
The Core-i7 also uses faster DDR than its server counterpart, which normally runs at a lower clock speed with more buffers to support much larger capacities of RAM. Normally the number of channels on the Xeon and difference in L3 cache size makes up for this, but different workloads will make use of cache and main memory differently.
Of course generational improvements can make a difference as well. The significance of Ivy Bridge vs Sandy Bridge varies greatly with application.
A final possibility is that the program runtime isn't CPU-bound. I/O subsystem, speed of GPGPU, etc can affect performance over multiple orders of magnitude for applications that exercise those.
The compilers are practically identical (-mp just signifies that this gcc version was installed via macports).
The performance difference you observed results from the different CPUs: The server is a "Sandy Bridge" microarchitecture, running at 3.5 GHz, while the MacBook has a newer "Ivy Bridge" CPU running at 3.4 GHz (single-thread turbo boost speeds).
Between Sandy Bridge and Ivy Bridge is just a "Tick" in Intel parlance, meaning that the process was changed (from 32nm to 22nm), but almost no changes to the microarchitecture. Still there are some changes in Ivy Bridge that improve the IPC (instructions per clock-cycle) for some workloads. In particular, the throughput of division operations, both integer and floating-point, was doubled. (For more changes, see the review on AnandTech: http://www.anandtech.com/show/5626/ivy-bridge-preview-core-i7-3770k/2 )
As your workload contains lots of divisions, this fits your results quite nicely: the "small" testcase shows the largest improvement, while in the larger testcases, the improved core performance is probably shadowed by memory access, which seems roughly the same speed in both systems.
Note that this is purely educated guessing given the current information - one would need to look at your benchmark code, the compiler flags, and maybe analyze it using the CPU performance counters to verify this.

What is difference between 'Cores across processors' and 'Number of CPUs'?

E.g. Consider following is processor configuration of my machine:
Intel(R) Core(TM)i5 CPU 650 #3.20GHz (4 CPUs)
Then how should i find out how many 'Cores across processors' My machine have?
Is it the 4 cores[i.e. Number of CPU]?
I have referred following links but still i does not get clear idea:
http://www.ehow.com/how_6873203_do-number-core-processors-windows_.html
Can anyone please clear my doubt?
Cores across processors means nothing, or at least, nothing in particular, it's a generic and non-technical assumption/phrase with no exact meaning or no meaning at all.
According to Intel this CPU provides 2 physical cores with Hyper Threading and this mean that you get 4 logical cores or so called threads.
Hyper Threading is an Intel Technology that for each core provides 2 threads, so 2*2 = 4 threads.
I think that this is the closest answer to what you are asking here.
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPU's, Intel Xeon. Other technique developed no the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.

Resources