Training the following GBM model on 2 cores vs 96 cores (on EC2 c5.large and c5.metal) results in faster training times when using less cores. I checked the water meter to verify all cores were running.
Training times:
c5.large (2 cores): ~1min
c5.metal (96 cores): ~2min
Training details:
training set size 6840 rows x 95 cols
seed 1
ntrees 1000
max_depth 50
min_rows 10
learn_rate 0.005
sample_rate 0.5
col_sample_rate 0.5
stopping_rounds 2
stopping_metric "MSE"
stopping_tolerance 1.0E-5
score_tree_interval 500
histogram_type "UniformAdaptive"
nbins 800
nbins_top_level 1024
Any thoughts on why this is happening?
I think the reason is that the parallel speed is composed of two main components:
computing time on every single core
communicating time to communicate and collecting results
If you have small data and a lot of cores, the algorithm could slow down due to huge communication. Try for example 4, 6, 10 cores instead of 96 to speed up.
Related
We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.
I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.
But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program.
Should I take the (average or largest value or total) of all kernels for each metric??
One possible approach would be to use a weighted average method.
Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.
Let's also suppose that the profiler reports the gld_efficiency metric as follows:
kernel duration gld_efficiency
1 10ms 88%
2 20ms 76%
3 30ms 50%
You could compute the weighted average as follows:
88*10 76*20 50*30
"overall" global load efficiency = ----- + ----- + ----- = 65%
60 60 60
I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:
kernel gld_transactions gld_efficiency
1 1000 88%
2 2000 76%
3 3000 50%
88*1000 76*2000 50*3000
"overall" global load efficiency = ------- + ------- + ------- = 65%
6000 6000 6000
With the intention of comparing the speed of GPU vs CPU computing, I ran the example codes available here (a Mandelbrot set on the GPU) from MATLAB central. Below are the results that I obtained:
Case 1 (without GPU): 6.2 secs
Case 2 (using parallel.gpu.GPUArray): 6.518 secs (1.39 secs in the example)
Case 3 (Using Element-wise Operation): 1.259 secs (0.14 secs in the example)
As can be seen, there is no improvement in case 2 and only slight improvement of around 4 times in case 3. As the example did not state the details of GPU they used, may I know if this is simply due to the "incompetency" of my graphic card or am I missing something important?
The graphic card is also responsible for driving my display (HP Z Display Z23i 23-inch IPS LED Backlit Monitor).
CPU: Intel i7-4790, 3.6 GHz (8 cores)
GPU:
Name: 'NVS 510'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.6934e+09
MultiprocessorCount: 1
ClockRateKHz: 797000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Thank you!
Edit
The GPU used in the example here is Tesla C2050. (Credits to #Sam Roberts)
The times on that link are most likely for a different GPU in comparison to yours. They don't specify what kind of graphics card they're using, but my guess is that they're using a more higher end card.
By Googling NVS 510, the specs are similar to the card that I have for my machine. However, your card is geared towards business while mine is geared towards gaming. I have a GTX 660 which is one of the higher end GPUs that are available on the market.
These are the attributes of my graphics card:
CUDADevice with properties:
Name: 'GeForce GTX 660'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6.5000
ToolkitVersion: 5.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.5357e+09
MultiprocessorCount: 5
ClockRateKHz: 1084500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The differences between my card and yours are that I have 5 multiprocessors, and my clock rate is about 300 MHz faster than yours. For a side-by-side comparison, check out my card in comparison to yours:
NVS 510: http://www.nvidia.ca/object/nvs-510-graphics-card.html#pdpContent=2
GTX 660: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-660/specifications
Upon further inspection, I have a much higher memory bandwidth than your card. I also have 960 GPU cores in comparison to your 192.
I decided to run these tests to compare my performance with your timings. My CPU is an i7-4770 3.6 GHz Intel and I have 16 GB of RAM on my machine.
The times that I get by running those examples are the following:
Case #1 - Without GPU: 6.46 seconds
Case #2 - Naive GPU: 0.82 seconds - 7.9x faster
Case #3 - Through CUDA: 0.09 seconds - 71.7x faster
With this, my guess is that your graphics card may be of a lower quality in comparison to those tests that MathWorks performed. Maybe try updating your graphics drivers and see if that helps. However, my guess is that my performance is much better due to the multiprocessor count, faster clock, a higher amount of cores and higher memory bandwidth.
i was taking an exam earlier and i memorized the questions that i didnt know how to answer but somehow got it correct(since the online exam using electronic classrom(eclass) was done through the use of multiple choice.. The exam was coded so each of us was given random questions at random numbers and random answers on random choices, so yea)
anyways, back to my questions..
1.)
There is a CPU with a clock frequency of 1 GHz. When the instructions consist of two
types as shown in the table below, what is the performance in MIPS of the CPU?
-Execution time(clocks)- Frequency of Appearance(%)
Instruction 1 10 60
Instruction 2 15 40
Answer: 125
2.)
There is a hard disk drive with specifications shown below. When a record of 15
Kbytes is processed, which of the following is the average access time in milliseconds?
Here, the record is stored in one track.
[Specifications]
Capacity: 25 Kbytes/track
Rotation speed: 2,400 revolutions/minute
Average seek time: 10 milliseconds
Answer: 37.5
3.)
Assume a magnetic disk has a rotational speed of 5,000 rpm, and an average seek time of 20 ms. The recording capacity of one track on this disk is 15,000 bytes. What is the average access time (in milliseconds) required in order to transfer one 4,000-byte block of data?
Answer: 29.2
4.)
When a color image is stored in video memory at a tonal resolution of 24 bits per pixel,
approximately how many megabytes (MB) are required to display the image on the
screen with a resolution of 1024 x768 pixels? Here, 1 MB is 106 bytes.
Answer:18.9
5.)
When a microprocessor works at a clock speed of 200 MHz and the average CPI
(“cycles per instruction” or “clocks per instruction”) is 4, how long does it take to
execute one instruction on average?
Answer: 20 nanoseconds
I dont expect someone to answer everything, although they are indeed already answered but i am just wondering and wanting to know how it arrived at those answers. Its not enough for me knowing the answer, ive tried solving it myself trial and error style to arrive at those numbers but it seems taking mins to hours so i need some professional help....
1.)
n = 1/f = 1 / 1 GHz = 1 ns.
n*10 * 0.6 + n*15 * 0.4 = 12 ns (=average instruction time) = 83.3 MIPS.
2.)3.)
I don't get these, honestly.
4.)
Here, 1 MB is 10^6 bytes.
3 Bytes * 1024 * 768 = 2359296 Bytes = 2.36 MB
But often these 24 bits are packed into 32 bits b/c of the memory layout (word width), so often it will be 4 Bytes*1024*768 = 3145728 Bytes = 3.15 MB.
5)
CPI / f = 4 / 200 MHz = 20 ns.
I was wondering how can matlab multiply two matrices so fast. When multiplying two NxN matrices, N^3 multiplications are performed. Even with the Strassen Algorithm it takes N^2.8 multiplications, which is still a large number. I was running the following test program:
a = rand(2160);
b = rand(2160);
tic;a*b;toc
2160 was used because 2160^3=~10^10 ( a*b should be about 10^10 multiplications)
I got:
Elapsed time is 1.164289 seconds.
(I'm running on 2.4Ghz notebook and no threading occurs)
which mean my computer made ~10^10 operation in a little more than 1 second.
How this could be??
It's a combination of several things:
Matlab does indeed multi-thread.
The core is heavily optimized with vector instructions.
Here's the numbers on my machine: Core i7 920 # 3.5 GHz (4 cores)
>> a = rand(10000);
>> b = rand(10000);
>> tic;a*b;toc
Elapsed time is 52.624931 seconds.
Task Manager shows 4 cores of CPU usage.
Now for some math:
Number of multiplies = 10000^3 = 1,000,000,000,000 = 10^12
Max multiplies in 53 secs =
(3.5 GHz) * (4 cores) * (2 mul/cycle via SSE) * (52.6 secs) = 1.47 * 10^12
So Matlab is achieving about 1 / 1.47 = 68% efficiency of the maximum possible CPU throughput.
I see nothing out of the ordinary.
To check whether you do or not use multi-threading in MATLAB use this command
maxNumCompThreads(n)
This sets the number of cores to use to n. Now I have a Core i7-2620M, which has a maximum frequency of 2.7GHz, but it also has a turbo mode with 3.4GHz. The CPU has two cores. Let's see:
A = rand(5000);
B = rand(5000);
maxNumCompThreads(1);
tic; C=A*B; toc
Elapsed time is 10.167093 seconds.
maxNumCompThreads(2);
tic; C=A*B; toc
Elapsed time is 5.864663 seconds.
So there is multi-threading.
Let's look at the single CPU results. A*B executes approximately 5000^3 multiplications and additions. So the performance of single-threaded code is
5000^3*2/10.8 = 23 GFLOP/s
Now the CPU. 3.4 GHz, and Sandy Bridge can do maximum 8 FLOPs per cycle with AVX:
3.4 [Ginstructions/second] * 8 [FLOPs/instruction] = 27.2 GFLOP/s peak performance
So single core performance is around 85% peak, which is to be expected for this problem.
You really need to look deeply into the capabilities of your CPU to get accurate performannce estimates.