I wanted to test the execution time of the same program on a win10 virtual machine, so I ran this program for 100 times:
for(int i=1;i<=1e8;i++)
{
}
And I drew a distribution picture(I have tested for many times and the result is similar):
Distribution of execution time on virtual machine
X is the running time, Y is the density.
It is strange because there are several peaks in this picture, and the X-coordinate of the peaks seem an arithmetic progression.
But on the guest machine, it is a normal distribution which is in line with our intuition.
I guess this is caused by some periodic process which occupies the cpu since I could only allocate 1 CPU to the virtual machine. Could you help me explain it?
Related
I need to draw 53000000 observations from a standard normal distribution. My current code takes a long time to run in Julia (in fact, it's been running for the past twenty minutes) and I'm wondering if there's anything I can do to speed it up. Here's what I tried:
using Distributions
d = Normal()
shock = rand(d, 1, 53000000)
The code works instantaneously when I execute it in REPL (I am working in Juno/Atom), but lags at this point (drawing from the standard normal) when I step through using the debugger. So I think the debugger may be the real culprit here.
It may be that the 1/2 gig of memory used by the allocation of the variable shock is sometimes causing swapping when the debugger is loaded.
Try running this to see, in the debugger:
using Distributions, Base.Sys
println("Free memory is $(Int(Sys.free_memory()))")
d = Normal()
shock = rand(d, 1, 53000000)
println("shock uses $(sizeof(shock)) bytes.")
println("Free memory is $(Int(Sys.free_memory()))")
Are you close to out of memory in gigs?
I'm testing and comparing GPU speed up with different numbers of work-items (no work-groups). The kernel I'm using is a very simple but long operation. When I test with multiple work-items, I use a barrier function and split the work in smaller chunks to get the same result as with just one work-item. I measure the kernel execution time using cl_event and the results are the following:
1 work-item: 35735 ms
2 work-items: 11822 ms (3 times faster than with 1 work-item)
10 work-items: 2380 ms (5 times faster than with 2 work-items)
100 work-items: 239 ms (10 times faster than with 10 work-items)
200 work-items: 122 ms (2 times faster than with 100 work-items)
CPU takes about 580 ms on average to do the same operation.
The only result I don't understand and can't explain is the one with 2 work items. I would expect the speed up to be about 2 times faster compared to the result with just one work item, so why is it 3?
I'm trying to make sense of these numbers by looking at how these work-items were distributed on processing elements. I'm assuming if I have just one kernel, only one compute unit (or multiprocessor) will be activated and the work items distributed on all processing elements (or CUDA cores) of that compute unit. What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
CL_DEVICE_MAX_WORK_ITEM_SIZES are 1024 / 1024 / 64 and CL_DEVICE_MAX_WORK_GROUP_SIZE 1024. Since I'm using just one dimension, does that mean I can have 1024 work-items running at the same time per processing element or per compute unit? When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
My GPU info: Nvidia GeForce GT 525M, 96 CUDA cores (2 compute units, 48 CUDA cores per unit)
The only result I don't understand and can't explain is the one with 2
work items. I would expect the speed up to be about 2 times faster
compared to the result with just one work item, so why is it 3?
The exact reasons will probably be hard to pin down, but here are a few suggestions:
GPUs aren't optimised at all for small numbers of work items. Benchmarking that end of the scale isn't especially useful.
35 seconds is a very long time for a GPU. Your GPU probably has other things to do, so your work-item is probably being interrupted many times, with its context saved and resumed every time.
It will depend very much on your algorithm. For example, if your kernel uses local memory, or a work-size dependent amount of private memory, it might "spill" to global memory, which will slow things down.
Depending on your kernel's memory access patterns, you might be running into the effects of read/write coalescing. More work items means fewer memory accesses.
What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
Most GPU hardware supports a form of SMT to hide memory access latency. So a compute core will have up to some fixed number of work items in-flight at a time, and if one of them is blocked waiting for a memory access or barrier, the core will continue executing commands on another work item. Note that the maximum number of simultaneous threads can be further limited if your kernel uses a lot of local memory or private registers, because those are a finite resource shared by all cores in a compute unit.
Work-groups will normally run on only one compute unit at a time, because local memory and barriers don't work across units. So you don't want to make your groups too large.
One final note: compute hardware tends to be grouped in powers of 2, so it's usually a good idea to make your work group sizes a multiple of e.g. 16 or 64. 1000 is neither, which usually means some cores will be doing nothing.
When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
Please be more precise in this question, it's not clear what you're asking.
I'm writing a very simple program to calculate the factorial of a number.
Here it is:
import time
def factorial1(n):
fattoriale = 1
while (n > 0):
fattoriale = fattoriale * n
n = n - 1
return fattoriale
start_time = time.clock()
factorial1(v)
print float(time.clock() - start_time), "seconds"
The strange point (for me) are the results in term of execution time (on a value):
1° run: 0.000301 seconds
2° run: 0.000430 seconds
3° run: 0.000278 seconds
Why do you think it's so variable?
Does it has something to do with the float type approximation?
Thanks, Gianluca
On Unix based systems time.clock returns the CPU time, not the wall-clock time.
Your program is deterministic (even the print is) and on an ideal system should always run in the same amount of time. I believe that in your tests your program was interrupted and some interrupt handler was executed or the scheduler paused your process and gave the CPU to some other process. When your process is allowed to run again the CPU cache might have been filled by the other process, so the processor needs to load your code from memory into the cache again. This takes a small amount of time - which you see in your test.
For a good quantization of how fast your program is you should consider not calling factorial1 only once but thousands of times (or with greater input values). When your program runs for multiple seconds, then scheduling effects have less (relative) impact than in your test where you only tested for less than a millisecond.
It probably has a lot to do with sharing of resources. If your program runs as a separate process, it might have to contend for other processes running on your computer at the same time which are using resources like CPU and RAM. These resources are used by other processes as well so 'acquire' them in terms of concurrent terms will take variable times especially if there are some high-priority processes running parallel to it and other things like interupts may have higher priority.
As for your idea, from what I know, the approximation process should not take variable times as it runs a deterministic algorithm. However the approximation process again may have to contend for the resources.
In Vtune results what the numbers 0,1,2 (and 3) actually represent ?
What is the meaning of Blue bar over 0?
It's a histogram - each column represents the portion of time you spend while the variable (the one appearing below the graph) is at any given value.
The left one states that you spend roughly 1/3 of the time with 0 utilized logical CPUs (fully idle), and 2/3 of the time with 1 logical core operating. You never reach 2 simultaneously operating cores.
In the same manner, the right histogram says you spend ~25% of the time with zero active threads, and ~75% with one thread (there's a negligible portion with 2 threads).
Note that the total times are slightly different, and the portion of fully-idle time also varies a bit - if this is taken over the exact same run, then this discrepancy might be explained by the difference between the time when a core becomes active (waking up from a low power state), and the moment that the OS can schedule a thread to actually start running on it.
I'm trying to save a 'big' rational matrix in SAGE, but I'm running into problems. After computing my matrix A, which has size 5 x 10,000 and each entry contains rational numbers in fraction form with total number of digits for numerator and denominator more than 10 pages, I run the following command:
save(A, DATA + 'A').
This gives me the following error message:
Traceback(most recent call last):
...
RuntimeError: Segmentation fault.
I tried the same save command with a 'smaller' matrix and that worked fine. I should also note that I'm using a laptop with a 64-bit operating system, x64-based processor, Windows 8, i7 CPU # 2.40 GHz and 8 GB RAM. I'm running SAGE on a virtual machine to which I allocated 5237 MB. Let me know if you need further information. My questions are:
Why can't I save my matrix? Why do I get the above error message? What does it mean?
How can I save my matrix A using this command? Is there any other way I can save it?
I have asked these same questions in another forum which specifically deals with SAGE, but I'm not getting an answer there. I have also spent a lot of time searching online about this question, but haven't seen anyone with this same problem.