**cpu :** E5-2630L * 2
**os :** Linux CentOS 6.3
physical core : 12
logical core : 24 (grep -c processor /proc/cpuinfo, by hyper threading)
E5-2630L has 6 cores, so total 24. (6*2*2)
but /proc/pid/status is
- Cpus_allowed: ffffffff,ffffffff
- Cpus_allowed_list: 0-63
cpu has 24 logical cores, but why cpu_allowed is 64?
It is the default, it just means there is no further restriction (besides the available hardware). I think the mask is a multiple of 32bit, but it always starts with 2 times that.
Related
My University has computational nodes with 128 total cores but comprised of two individual AMD processors (i.e., sockets), each with 64 cores. This leads to anomalous simulation runtimes in ABAQUS using crystal plasticity models implemented in a User MATerial Subroutine (UMAT). For instance, if I run a simulation using 1 node and 128 cores, this takes around 14 hours. If I submit the same job to run across two nodes with 128 cores (i.e., using 64 cores/1 processor on two separate nodes), the job finishes in only 9 hours. One would expect the simulation running on a single host node to be faster than on two separate nodes for the same total number of cores, but this is not the case. The problem is that in the latter configuration, each host node contains two processors each with 64 cores and the abaqus_v6.env file therefore contains:
mp_host_list=[['node_1', 64],['node_2', 64]]
for the 2 node/128 core simulation. The ABAQUS .msg file then accordingly splits the job into two processes each with 64 threads:
PROCESS 1 THREAD NUMBER OF ELEMENTS
1 3840
2 3840
...
63
64 3840
PROCESS 2 THREAD NUMBER OF ELEMENTS
1 3584
2 4096
...
63
64 3840
The problem arises when I specify a single host node with 128 cores because ABAQUS has no way of recognizing that the host node consists of two separate processors. I can modify the abaqus_v6.env file accordingly as:
mp_host_list=[['node_1', 64],['node_1', 64]]
but ABAQUS just clumps this into one process with 128 threads, and I believe this is why my simulations actually run quicker on two nodes instead of one with the same number of cores, because ABAQUS does not recognize that it should treat the single node as two processors/processes.
Is there a way to specify two processes on the same host node in ABAQUS?
As a note, the amount of memory/RAM reserved per core does not change (~2 GB per core).
Final update: able to reduce runtimes using multiple nodes
I found that running these types of simulations across multiple nodes reduces run times. A table of simulation speeds for two models across various numbers of cores, nodes, and cores/processor are listed below.
The smaller model finished in 9.7 hours on two nodes with 64 cores/node = total of 128 cores. The runtime reduced by 25% when simulated over four nodes with 32 cores/node for the same total of 128 cores. Interestingly, the simulation took longer using three nodes with 64 cores/node (total of 192 cores), and there could be many reasons for this. One surprising result was that the simulation ran quicker using 64 nodes split over two nodes (32 cores/socket) vs. 64 cores on a single socket, which means the extra memory bandwidth of using multiple nodes helps (details of which I do not fully understand).
The larger model finished in ~32.5 hours using 192 cores and there was little between using three (64 cores/processor) or six (32 cores/processor) nodes, which means that at some point, using more nodes does not help. However, this larger model finished in 36.7 hours using 128 cores with 32 cores/processor (four nodes). Thus, the most efficient use of nodes for both the larger and smaller model is with 128 CPUs split over four nodes.
Simulation details for a model with 477,956 tetrahedral elements and 86,153 nodes. Model is cyclically strained to a strain of 1.3% for 10 cycles with a strain ratio R = 0.
# CPUs
# Nodes
# ABAQUS processes: actual
# ABAQUS processes: ideal
Notes on cores per processor
Wall time (hr)
Notes
64
1
1
2
32 cores/processor
13.8
Using cores on two processors but unable to specify two separate processes
64
1
1
1
64 cores/processor
11.5
No need to specify two processes
64
2
2
2
32 cores/processor
10.5
Correctly specifies two processes. Surprisingly faster than the scenario directly above!
128
1
1
2
64 cores/processor
14.5
Unable to specify two separate processes
128
2
2
2
64 cores/processor
8.9
Correctly specifies two processes
128
2
2
4
~32 cores/processor; 4 total processors
9.9
Specifies two processes but should be four processes
128
2
2
3
64 cores/processor
9.7
Specifies two processes over three processors
128
4
4
4
32 cores/processor
7.2
32 cores per node. Most efficient!
192
3
3
3
64 cores/processor
7.6
Three nodes with three processors
192
2
2
4
64 and 32 cores/processor on both node
10.5
Four processors over two nodes
Simulation details for a model with 4,104,272 tetrahedral elements and 702,859 nodes. The model is strained to 1.3% strain and then back to 0% strain (one cycle).
# CPUs
# Nodes
# ABAQUS processes: actual
# ABAQUS processes: ideal
Notes on cores per processor
Wall time (hr)
Notes
64
1
1
1
64 cores/processor
53.0
Using a single processor
128
1
1
2
64 cores/processor
57.3
Using two processors on one node
128
2
2
2
64 cores/processor
40.9
128
4
4
4
32 cores/processor
36.7
Most efficient!
192
2
2
4
64 and 32 cores/processor on both node
42.7
192
3
3
3
64 cores/processor
32.4
192
6
6
6
32 cores/processor
32.6
I am beginner in cluster configuration. I know in our cluster we have types of worker nodes:
16 x 4TB Disks
128 RAM
2 x 8 Core CPUs
12 x 1.2 TB Disks
256 RAM
2 x 10 Core CPUs
I am confused about the configuration. What does mean 2 x 8 cores? It means 2 processor with 8 physical core each? So if my processor are hyperthreading i will have 2 X 8 X 2 = 32 virtual cores?
And 12 x 1.2 TB means, 12 disks with 1.2 TB each?
Usually 2x 8 Core CPUs, means, that you have 2 physical chips on your motherboard, each having 8 Cores. If you enable hyperthreading, you then have 32 virtual cores.
The amount of disks is either the way, like you stated it, or its the number of nodes. Then you have 16 nodes with 4TB disk.... and 12 nodes with 1.2TB disk ....
I am just wondering, how someone can get to this hardware, not knowing what it means. Can you send me some nodes? :)
With the intention of comparing the speed of GPU vs CPU computing, I ran the example codes available here (a Mandelbrot set on the GPU) from MATLAB central. Below are the results that I obtained:
Case 1 (without GPU): 6.2 secs
Case 2 (using parallel.gpu.GPUArray): 6.518 secs (1.39 secs in the example)
Case 3 (Using Element-wise Operation): 1.259 secs (0.14 secs in the example)
As can be seen, there is no improvement in case 2 and only slight improvement of around 4 times in case 3. As the example did not state the details of GPU they used, may I know if this is simply due to the "incompetency" of my graphic card or am I missing something important?
The graphic card is also responsible for driving my display (HP Z Display Z23i 23-inch IPS LED Backlit Monitor).
CPU: Intel i7-4790, 3.6 GHz (8 cores)
GPU:
Name: 'NVS 510'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.6934e+09
MultiprocessorCount: 1
ClockRateKHz: 797000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Thank you!
Edit
The GPU used in the example here is Tesla C2050. (Credits to #Sam Roberts)
The times on that link are most likely for a different GPU in comparison to yours. They don't specify what kind of graphics card they're using, but my guess is that they're using a more higher end card.
By Googling NVS 510, the specs are similar to the card that I have for my machine. However, your card is geared towards business while mine is geared towards gaming. I have a GTX 660 which is one of the higher end GPUs that are available on the market.
These are the attributes of my graphics card:
CUDADevice with properties:
Name: 'GeForce GTX 660'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6.5000
ToolkitVersion: 5.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.5357e+09
MultiprocessorCount: 5
ClockRateKHz: 1084500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The differences between my card and yours are that I have 5 multiprocessors, and my clock rate is about 300 MHz faster than yours. For a side-by-side comparison, check out my card in comparison to yours:
NVS 510: http://www.nvidia.ca/object/nvs-510-graphics-card.html#pdpContent=2
GTX 660: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660/specifications
Upon further inspection, I have a much higher memory bandwidth than your card. I also have 960 GPU cores in comparison to your 192.
I decided to run these tests to compare my performance with your timings. My CPU is an i7-4770 3.6 GHz Intel and I have 16 GB of RAM on my machine.
The times that I get by running those examples are the following:
Case #1 - Without GPU: 6.46 seconds
Case #2 - Naive GPU: 0.82 seconds - 7.9x faster
Case #3 - Through CUDA: 0.09 seconds - 71.7x faster
With this, my guess is that your graphics card may be of a lower quality in comparison to those tests that MathWorks performed. Maybe try updating your graphics drivers and see if that helps. However, my guess is that my performance is much better due to the multiprocessor count, faster clock, a higher amount of cores and higher memory bandwidth.
What I know is
Number of Logical Processor = Core x Sockets x HT
Is it right ? How many Virtual Machines are possible to provision with this logical processor ?
Exactly , so if it has 2 Procs with 4 Cores and HT enabled then the
Number of logical processor = 2 x 4 x 2 = 16
ESX will also use a core, so if you take the simplistic view of
allocating cores to VMs, you only have 7 to "allocate".
Have a look at this VMware community question on ESXi CPU.
I bougth recently a server with 2 x X5550, they are quad (4 cores each) total 8 cores
If I check the task manager it shows in the CPU usage history 16 diagrams,
Should't it be 8 cause I have 2 processors with quad?
or the diagrams maybee shows the Threads of the CPU?
The CPUs have support for HyperThreading, so each core x2 logical CPUs.
You can always lookup the chip specs on Intel's site