I am using TickCount() to determine the time difference between events or time required to run a certain piece of code. But it is deprecated in OS X 10.8.
Therefore, I needed an alternative for the same.
If you want to measure absolute time, use gettimeofday(). This gives you the date, e.g., "Thu Nov 22 07:48:52 UTC 2012". This is not always suitable for measuring differences between events because the time reported by gettimeofday() can jump forwards or backwards if the user changes the clock.
If you want to measure relative time, mach_absolute_time(). This lets you measure the difference between two events, e.g., "15.410 s". This does not give absolute times, but is always monotonic.
If you want to measure CPU time, use clock(). This is often but not always the way you measure the performance of a piece of code. It doesn't count time spent on IO, or impact on system speed, so it should only be used when you know you are measuring something CPU bound.
I'm surprised that TickCount() wasn't deprecated earlier. It's really an OS 9 and earlier thing.
While this API may not be suitable for new development, if you find yourself in need of an identical API, it can be re-implemented as follows:
uint32_t TickCount() {
uint64_t mat = mach_absolute_time();
uint32_t mul = 0x80d9594e;
return ((((0xffffffff & mat) * mul) >> 32) + (mat >> 32) * mul) >> 23;
}
The above implementation was created through analysis of /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/CarbonCore.framework/Versions/A/CarbonCore, and was briefly unit-tested against the deprecated TickCount with LLDB by altering the registers returned by mach_absolute_time.
Related
I would like to measure the total simulation and initialization time of a system of DAEs. I am interested in the wall-clock time (like the one given in Matlab by the function tic-toc).
I noticed in Modelica there are different flags for the simulation time but actually the time I get is very small compared to the time that elapses since I press the simulation button to the end of the simulation (approximately measured with the clock of my phone).
I guess this short time is just the time required for the simulation and it does not include the initialization of the system of eqs.
Is there a way to calculate this total time?
Thank you so much in advance,
Gabriele
Dear Marco,
Thank you so much for your extremely detailed and useful reply!
I am actually using OpenModelica and not Dymola so unfortunately I have to build the function that does it for me and I am very new with OpenModelica language.
So far, I have a model that simulate the physical behavior based on a DAEs. Now, I am trying to build what you suggest here:
With get time() you can build a function that: reads the system time as t_start translates the model and simulate for 0 seconds reads the system time again and as t_stop computes the difference between t_start and t_stop.
Could you please, give me more details: Which command can I use to read the system at time t_start and to simulate it for 0 seconds? To do this for both t_start and t_stop do I need to different function?
Once I have done this, do I have to call the function (or functions) inside the OpenModelica Model of which I want to know its time?
Thank you so much again for your precious help!
Very best regards, Gabriele
Depending on the tool you have, this could mean a lot of work.
The first problem is that the MSL allows you to retrieve the system time, but there is nothing included to easily compute time deltas. Therefore the Testing library in Dymola features the operator records DateTime and Duration. Note, that it is planned to integrate them in future MSL versions, but at the moment this is only available via the Testing library for Dymola users.
The second problem is that there is no standardized way to translate and simulate models. Every tools has its own way to do that from scripts. So without knowing what tool you are using, it's not possible to give an exact answer.
What Modelica offers in the MSL
In the current Modelica Standard Library version 3.2.3 you can read the actual system time via Modelica.Utilities.System.getTime().
This small example shows how to use it:
function printSystemTime
protected
Integer ms, s, min, h, d, mon, a;
algorithm
(ms, s, min, h, d, mon, a) := Modelica.Utilities.System.getTime();
Modelica.Utilities.Streams.print("Current time is: "+String(h)+":"+String(min)+":"+String(s));
end printSystemTime;
You see it gives the current system date and time via 7 return values. These variables are not very nice to deal with if you want to compute a time delta, as you will end up with 14 variables, each with its own value range.
How to measure translation and simulation time in general
With gettime() you can build a function that:
reads the system time as t_start
translates the model and simulate for 0 seconds
reads the system time again and as t_stop
computes the difference of t_start and t_stop.
Step 2 depends on the tool. In Dymola you would call
DymolaCommands.SimulatorAPI.simulateModel("path-to-model", 0, 0);
which translates your model and simulates it for 0 seconds, so it only runs the initialization section.
For Dymola users
The Testing library contains the function Testing.Utilities.Simulation.timing, which does almost exactly what you want.
To translate and simulate your model call it as follows:
Testing.Utilities.Simulation.timing(
"Modelica.Blocks.Examples.PID_Controller",
task=Testing.Utilities.Simulation.timing.Task.fullTranslate_simulate,
loops=3);
This will translate your model and simulate for 1 second three times and compute the average.
To simulate for 0s, duplicate the function and change this
if simulate then
_ :=simulateModel(c);
end if;
to
if simulate then
_ :=simulateModel(c, 0, 0);
end if;
Being new to GLSL shaders, I noticed on my old netbook that adding a single more line to a perfectly running shader could suddenly multiply the execution time by thousands.
For example this fragment shader runs instantly while limit's value is 32 or below, and takes 10 seconds to run once limit's value is 33 :
int main()
{
float limit=33.;//runs instantly if =32.
float useless=0.5;
for(float i=0.;i<limit;i++) useless=useless*useless;
gl_FragColor=useless*vec4(1.,1.,1.,1.);
}
What confuses me as well is that adding one or more useless self-multiplications out of the 32 turns loop does not cause that sharp time increasing.
Here is an example without a for loop. It runs within a millisecond on my computer with 6 sin computations, and adding the seventh one suddenly makes the program take about 500ms to run :
int main()
{
float useless=gl_FragCoord.x;
useless=sin(useless);
useless=sin(useless);
useless=sin(useless);
useless=sin(useless);
useless=sin(useless);
useless=sin(useless);
useless=sin(useless);//the straw that breaks the shader's back
gl_FragColor=useless*vec4(1.,1.,1.,1.);
}
On a less outdated computer I own, the compilation time becomes too big before I can find such a breaking point.
On my netbook, I'd expect the running times to increase continuously as I add operations.
I'd like to know what causes those sudden leaps and consequently if it's a problem I should adress, planning to target the reasonably widest Steam audience. If useful, here is the netbook I'm doing my tests on http://support.hp.com/ch-fr/document/c01949780 and its chipset http://ark.intel.com/products/36549/Intel-82945GSE-Graphics-and-Memory-Controller
Also I don't know if it matters but I'm using SFML to run shaders.
according to intel, the GMA 950 supports shader model 2 in hardware, and shader model 3 in software. According to microsoft, shader model 2 has a rather harsh limit on instruction count (64 ALU and 32 tex instructions).
my guess would be that, when having more than this instruction count, the intel driver decides to do shading in software, which would match the abysmal performance you're seeing.
the sin function might expand to multiple instructions. the loop likely gets unrolled, resulting in a higher instruction count with a higher limit. why adding the 33th multiplication outside the loop does not trigger this i don't know.
to decide whether you should fix this, i can recommend the unity hardware stats and steam hardware survey. in short i'd say that the shader model 2 is nothing you need to support :)
I'd like to know if someone has experience in writing a HAL AudioUnit rendering callback taking benefits of multi-core processors and/or symmetric multiprocessing?
My scenario is the following:
A single audio component of sub-type kAudioUnitSubType_HALOutput (together with its rendering callback) takes care of additively synthesizing n sinusoid partials with independent individually varying and live-updated amplitude and phase values. In itself it is a rather straightforward brute-force nested loop method (per partial, per frame, per channel).
However, upon reaching a certain upper limit for the number of partials "n", the processor gets overloaded and starts producing drop-outs, while three other processors remain idle.
Aside from general discussion about additive synthesis being "processor expensive" in comparison to let's say "wavetable", I need to know if this can be resolved right way, which involves taking advantage of multiprocessing on a multi-processor or multi-core machine? Breaking the rendering thread into sub-threads does not seem the right way, since the render callback is already a time-constraint thread in itself, and the final output has to be sample-acurate in terms of latency. Has someone had positive experience and valid methods in resolving such an issue?
System: 10.7.x
CPU: quad-core i7
Thanks in advance,
CA
This is challenging because OS X is not designed for something like this. There is a single audio thread - it's the highest priority thread in the OS, and there's no way to create user threads at this priority (much less get the support of a team of systems engineers who tune it for performance, as with the audio render thread). I don't claim to understand the particulars of your algorithm, but if it's possible to break it up such that some tasks can be performed in parallel on larger blocks of samples (enabling absorption of periods of occasional thread starvation), you certainly could spawn other high priority threads that process in parallel. You'd need to use some kind of lock-free data structure to exchange samples between these threads and the audio thread. Convolution reverbs often do this to allow reasonable latency while still operating on huge block sizes. I'd look into how those are implemented...
Have you looked into the Accelerate.framework? You should be able to improve the efficiency by performing operations on vectors instead of using nested for-loops.
If you have vectors (of length n) for the sinusoidal partials, the amplitude values, and the phase values, you could apply a vDSP_vadd or vDSP_vmul operation, then vDSP_sve.
As far as I know, AU threading is handled by the host. A while back, I tried a few ways to multithread an AU render using various methods, (GCD, openCL, etc) and they were all either a no-go OR unpredictable. There is (or at leas WAS... i have not checked recently) a built in AU called 'deferred renderer' I believe, and it threads the input and output separately, but I seem to remember that there was latency involved, so that might not help.
Also, If you are testing in AULab, I believe that it is set up specifically to only call on a single thread (I think that is still the case), so you might need to tinker with another test host to see if it still chokes when the load is distributed.
Sorry I couldn't help more, but I thought those few bits of info might be helpful.
Sorry for replying my own question, I don't know the way of adding some relevant information otherwise. Edit doesn't seem to work, comment is way too short.
First of all, sincere thanks to jtomschroeder for pointing me to the Accelerate.framework.
This would perfectly work for so called overlap/add resynthesis based on IFFT. Yet I haven't found a key to vectorizing the kind of process I'm using which is called "oscillator-bank resynthesis", and is notorious for its processor taxing (F.R. Moore: Elements of Computer Music). Each momentary phase and amplitude has to be interpolated "on the fly" and last value stored into the control struct for further interpolation. Direction of time and time stretch depend on live input. All partials don't exist all the time, placement of breakpoints is arbitrary and possibly irregular. Of course, my primary concern is organizing data in a way to minimize the number of math operations...
If someone could point me at an example of positive practice, I'd be very grateful.
// Here's the simplified code snippet:
OSStatus AdditiveRenderProc(
void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
// local variables' declaration and behaviour-setting conditional statements
// some local variables are here for debugging convenience
// {... ... ...}
// Get the time-breakpoint parameters out of the gen struct
AdditiveGenerator *gen = (AdditiveGenerator*)inRefCon;
// compute interpolated values for each partial's each frame
// {deltaf[p]... ampf[p][frame]... ...}
//here comes the brute-force "processor eater" (single channel only!)
Float32 *buf = (Float32 *)ioData->mBuffers[channel].mData;
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buf[frame] = 0.;
for(UInt32 p = 0; p < candidates; p++){
if(gen->partialFrequencyf[p] < NYQUISTF)
buf[frame] += sinf(phasef[p]) * ampf[p][frame];
phasef[p] += (gen->previousPartialPhaseIncrementf[p] + deltaf[p]*frame);
if (phasef[p] > TWO_PI) phasef[p] -= TWO_PI;
}
buf[frame] *= ovampf[frame];
}
for(UInt32 p = 0; p < candidates; p++){
//store the updated parameters back to the gen struct
//{... ... ...}
;
}
return noErr;
}
Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.
Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.
You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.
I need to do some timing to compare the performance of some Fortran Vs C code.
In C I can get both user time and system time independently.
When using gFortran's cpu_time() what does it represent?
With in IBM's fortran compiler one can choose what to output by setting an environment variable (see CPU_TIME() )
I found no reference to something similar in gFortran's documentation.
So, does anybody know if gFortran's cpu_time() returns user time, system time, or the sum of both?
Gfortran CPU_TIME returns the sum of the user and system time.
On MINGW it uses GetProcessTimes(), on other platforms getrusage() or if getrusage() is not available, times().
See
http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgfortran/intrinsics/cpu_time.c;h=619f8d25246409e0f32c96299db724213aa62b45;hb=refs/heads/master
and
http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libgfortran/intrinsics/time_1.h;h=12d79ebc12fecf52baa0895c7ab8accc41dab500;hb=refs/heads/master
FWIW, if you wish to measure the wallclock time rather than CPU time, please use the SYSTEM_CLOCK intrinsic instead of CPU_TIME.
my guess: the total of user and system time, otherwise it would be mentioned? Probably depends on the OS anyway, maybe not all of them make the distinction. As far a s I know, CPU time is the time which the OS assigns to your process, be it in user mode or in kernel mode executed on behalf of the process.
Is it important for you to have that distinction?
For performance comparison, I would probably go for wall-time anyway, and use CPU time to guess how much I/O it is doing by subtracting it from the wall-time.
If you need wallclock time, you may use date_and_time, http://gcc.gnu.org/onlinedocs/gcc-4.0.2/gfortran/DATE_005fAND_005fTIME.html
I'm not sure how standard it is, but in my experience it works on at least four different platforms, including exotic Cray designs.
One gotcha here is to take care of the midnight, like this:
character*8 :: date
character*10 :: time
character*5 :: zone
integer :: tvalues(8)
real*8 :: time_prev, time_curr, time_elapsed, time_limit
integer :: hr_curr, hr_prev
! set the clock
call date_and_time(date, time, zone, tvalues)
time_curr = tvalues(5)*3600.d0 + tvalues(6)*60.d0 + tvalues(7) ! seconds
hr_curr = tvalues(5)
time_prev=0.d0; time_elapsed = 0.d0; hr_prev = 0
!... do something...
time_prev = time_curr; hr_prev = hr_curr
call date_and_time(date, time, zone, tvalues)
time_curr = tvalues(5)*3600.d0 + tvalues(6)*60.d0 + tvalues(7) ! seconds
hr_curr = tvalues(5)
dt = time_curr - time_prev
if( hr_curr < hr_prev )dt = dt + 24*3600.d0 ! across the midnight
time_elapsed = time_elapsed + dt
#Emanual Ey - In continuation to your comment on #steabert's post - (what follows goes for Intel's; I don't know whether something differs on other compilers). User cpu time + system cpu time should equal cpu time. Elapsed, real, or "wall clock" time should be greater than total charged cpu time. To measure wallclock time, it is best to put the time command, before and after the tricky part. Ugh, I'm gonna make this more complicated than it should be. Could you read the part on Timing your application on Intel's manual page (you'll have to find the "Timing your application" in the index). Should clear up a few things.
As I said before, that goes for Intel's. I don't have access to gfortran's compiler.