I'm looking for a way to occupy exactly 80% (or any other number) of a single CPU in a consistent manner.
I need this for some unit test that tests a component that triggers under specific CPU utilization conditions
For this purpose I can assume that the machine is otherwise idle.
What's a robust and possibly OS independent way to do this?
There is no such thing as occupying the CPU 80% of the time. The CPU is always either being used, or idle. Over some period of time, you can get average usage to be 80%. Is there a specific time period you want it to be averaged over? This pseudo-code should work across platforms and over 1 second have a CPU usage of 80%:
while True:
startTime = time.now()
while date.now() - startTime < 0.8:
Math.factorial(100) // Or any other computation here
time.sleep(0.2)
Its pretty easy to write a program that alternately spins and sleeps to get any particular load level you want. I threw this together in a couple of minutes:
#include <stdlib.h>
#include <signal.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#define INTERVAL 500000
volatile sig_atomic_t flag;
void setflag(int sig) { flag = 1; }
int main(int ac, char **av) {
int load = 80;
struct sigaction sigact;
struct itimerval interval = { { 0, INTERVAL }, { 0, INTERVAL } };
struct timespec pausetime = { 0, 0 };
memset(&sigact, 0, sizeof(sigact));
sigact.sa_handler = setflag;
sigaction(SIGALRM, &sigact, 0);
setitimer(ITIMER_REAL, &interval, 0);
if (ac == 2) load = atoi(av[1]);
pausetime.tv_nsec = INTERVAL*(100 - load)*10;
while (1) {
flag = 0;
nanosleep(&pausetime, 0);
while (!flag) { /* spin */ } }
return 0;
}
The trick is that if you want to occupy 80% of CPU, keep the processor busy for 0.8 seconds (or 80% of any time interval. Here I take it to be 1 second), then let it sleep for 0.2 seconds, though it is advisable not to utilize the CPU too much, or all your processes will start running slow. You could try around 20% or so.
Here is an example done in Python:
import time
import math
time_of_run = 0.1
percent_cpu = 80 # Should ideally replace this with a smaller number e.g. 20
cpu_time_utilisation = float(percent_cpu)/100
on_time = time_of_run * cpu_time_utilisation
off_time = time_of_run * (1-cpu_time_utilisation)
while True:
start_time = time.clock()
while time.clock() - start_time < on_time:
math.factorial(100) #Do any computation here
time.sleep(off_time)
Related
This is my cuda code:
#include<stdio.h>
#include<stdint.h>
#include <chrono>
#include <cuda.h>
__global__ void test(int base, int* out)
{
int curTh = threadIdx.x+blockIdx.x*blockDim.x;
{
int tmp = base * curTh;
#pragma unroll
for (int i = 0; i<1000*1000*100; ++i) {
tmp *= tmp;
}
out[curTh] = tmp;
}
}
typedef std::chrono::high_resolution_clock Clock;
int main(int argc, char *argv[])
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
int data = rand();
int* d_out;
void* va_args[10] = {&data, &d_out};
int nth = 10;
if (argc > 1) {
nth = atoi(argv[1]);
}
int NTHREADS = 128;
printf("nth: %d\n", nth);
cudaMalloc(&d_out, nth*sizeof(int));
for (int i = 0; i < 10; ++i) {
auto start = Clock::now();
cudaLaunchKernel((const void*) test,
nth>NTHREADS ? nth/NTHREADS : 1,
nth>NTHREADS ? NTHREADS : nth, va_args, 0, stream);
cudaStreamSynchronize(stream);
printf("use :%ldms\n", (Clock::now()-start)/1000/1000);
}
cudaDeviceReset();
printf("host Hello World from CPU!\n");
return 0;
}
I compile my code, and run in 2080Ti, I found the thread elapse time is around 214 ms, but the thread count is 3 times of gpu core(in 2080Ti, it's 4352)
root#d114:~# ./cutest 1
nth: 1
use :255ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
root#d114:~# ./cutest 13056
nth: 13056
use :272ms
use :223ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
use :214ms
root#d114:~# ./cutest 21760
nth: 21760
use :472ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :424ms
use :428ms
So my question is Why is the elapse time the same as the number of thread increase to 3 times of gpu core?
It's mean the NVIDIA gpu computing power is 3 times of gpu core?
Even though gpu-pipeline can issue a new instruction at one per cycle rate, it can overlap multiple threads' instruction running at least 3-4 times for simple math operations so increased number of threads only adds few cycles of extra latency per thread. But as it is visible at thr=21760, giving more of same instruction fully fills the pipeline and starts waiting.
21760/13056=1.667
424ms/214ms=1.98
this difference of ratios could be originated from tail-effect. When pipelines are fully filled, adding small work doubles the latency because the new work is computed as a second wave of computation after only all others completed because all they have same exact instructions. You could add some more threads and it should stay at 424ms until you get a third wave of waiting threads because again the instructions are exactly same for all threads there is no branching between threads and they work like blocks of waiting from outside.
Loop iterating for 100million times with complete dependency chain is limiting the memory accesses too. Only 1 memory operation per 100m iterations will have too low bandwidth consumption on card's memory.
The kernel is neither compute nor memory bottlenecked (if you don't count the integer multiplication with no latency-hiding in its own thread as a computation). With this, all SM units of GPU must be running with nearly same timings (with some thread-launch latency that is not visible near 100m loop and is linearly increasing with more threads).
When the algorithm is a real-world one that uses multiple parts of pipeline (not just integer multiplication), SM unit can find more threads to overlap in the pipeline. For example, if SM unit supports 1024 threads per block (and if 2 blocks in-flight maximum) and if it has only 128 pipelines, then there has to be at least 2048/128 = 16 slots to overlap operations like reading main memory, floating-point multiplication/addition, reading constant cache, shuffling registers, etc and this lets it complete a task quicker.
So, I am using the LTC 2366, 3 MSPS ADC and using the code given below, was able to achieve a sampling rate of about 380 KSPS.
#include <stdio.h>
#include <time.h>
#include <bcm2835.h>
int main(int argc, char**argv) {
FILE *f_0 = fopen("adc_test.dat", "w");
clock_t start, end;
double time_taken;
if (!bcm2835_init()) {
return 1;
}
bcm2835_spi_begin();
bcm2835_spi_setBitOrder(BCM2835_SPI_BIT_ORDER_MSBFIRST);
bcm2835_spi_setDataMode(BCM2835_SPI_MODE0);
bcm2835_spi_setClockDivider(32);
bcm2835_spi_chipSelect(BCM2835_SPI_CS0);
bcm2835_spi_setChipSelectPolarity(BCM2835_SPI_CS0, LOW);
int i;
char buf_[0] = {0x01, (0x08|0)<<4, 0x00}; // really doesn't matter what this is
char readBuf_0[2];
start = clock();
for (i=0; i<380000; i++) {
bcm2835_spi_transfernb(buf_0, readBuf_0, 2);
fprintf(f_0, "%d\n", (readBuf_0[0]<<6) + (readBuf_0[1]>>2));
}
end = clock();
time_taken = ((double)(end-start)/CLOCKS_PER_SEC);
printf("%f", (double)(time_taken));
printf(" seconds \n");
bcm2835_spi_end();
bcm2835_close();
return 0;
}
This returns about 1 second every time.
When I used the exact same code with LTC 2315, I still get a sampling rate of about 380 KSPS. How come? First of all, why is the 3 MSPS ADC giving me only 380 KSPS and not something like 2 MSPS? Second, when I change the ADC to something that's about 70% faster, I get the same sampling rate, why is that? Is that the limit of the Pi? Any way of improving this to get at least 1 MSPS?
Thank you
I have tested a bit of the Raspberry Pi SPI and found out that the spi has some overheads. In my case, I tried pyspi, where one byte seems to take at least 15us, and 75us between two words ( see these captures). That's slower than what you measure, so good for you!
Increasing the SPI clock changes the length of the exchange, but not the overheads. Hence, the critical time doesn't change, as the overhead is the limiting factor. 380ksps means 2.6us by byte, that may be well close to your overhead ?
The easier way to improve the ADC speed would be to used parallel ADCs instead of serial - it has the potential to increased overall speed to 20Msps+.
I'd like to measure the time a bit of code within my kernel takes. I've followed this question along with its comments so that my kernel looks something like this:
__global__ void kernel(..., long long int *runtime)
{
long long int start = 0;
long long int stop = 0;
asm volatile("mov.u64 %0, %%clock64;" : "=l"(start));
/* Some code here */
asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop));
runtime[threadIdx.x] = stop - start;
...
}
The answer says to do a conversion as follows:
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
For which I do:
for(long i = 0; i < size; i++)
{
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
}
Where 1.62 is the GPU Max Clock rate of my device. But the time I get in milliseconds does not look right because it suggests that each thread took minutes to complete. This cannot be correct as execution finishes in less than a second of wall-clock time. Is the conversion formula incorrect or am I making a mistake somewhere? Thanks.
The correct conversion in your case is not GHz:
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
^^^^
but hertz:
fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1620000000.0f)*1000.0);
^^^^^^^^^^^^^
In the dimensional analysis:
clock cycles
clock cycles / -------------- = seconds
second
the first term is the clock cycle measurement. The second term is the frequency of the GPU (in hertz, not GHz), the third term is the desired measurement (seconds). You can convert to milliseconds by multiplying seconds by 1000.
Here's a worked example that shows a device-independent way to do it (so you don't have to hard-code the clock frequency):
$ cat t1306.cu
#include <stdio.h>
const long long delay_time = 1000000000;
const int nthr = 1;
const int nTPB = 256;
__global__ void kernel(long long *clocks){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
long long start=clock64();
while (clock64() < start+delay_time);
if (idx < nthr) clocks[idx] = clock64()-start;
}
int main(){
int peak_clk = 1;
int device = 0;
long long *clock_data;
long long *host_data;
host_data = (long long *)malloc(nthr*sizeof(long long));
cudaError_t err = cudaDeviceGetAttribute(&peak_clk, cudaDevAttrClockRate, device);
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
err = cudaMalloc(&clock_data, nthr*sizeof(long long));
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
kernel<<<(nthr+nTPB-1)/nTPB, nTPB>>>(clock_data);
err = cudaMemcpy(host_data, clock_data, nthr*sizeof(long long), cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
printf("delay clock cycles: %ld, measured clock cycles: %ld, peak clock rate: %dkHz, elapsed time: %fms\n", delay_time, host_data[0], peak_clk, host_data[0]/(float)peak_clk);
return 0;
}
$ nvcc -arch=sm_35 -o t1306 t1306.cu
$ ./t1306
delay clock cycles: 1000000000, measured clock cycles: 1000000210, peak clock rate: 732000kHz, elapsed time: 1366.120483ms
$
This uses cudaDeviceGetAttribute to get the clock rate, which returns a result in kHz, which allows us to easily compute milliseconds in this case.
In my experience, the above method works generally well on datacenter GPUs that have the clock rate running at the reported rate (may be affected by settings you make in nvidia-smi.) Other GPUs such as GeForce GPUs may be running at (unpredictable) boost clocks that will make this method inaccurate.
Also, more recently, CUDA has the ability to preempt activity on the GPU. This can come about in a variety of circumstances, such as debugging, CUDA dynamic parallelism, and other situations. If preemption occurs for whatever reason, attempting to measure anything based on clock64() is generally not reliable.
clock64 returns a value in graphics clock cycles. The graphics clock is dynamic so I would not recommend using a constant to try to convert to seconds. If you want to convert to wall time then the better option is to use globaltimer, which is a 64-bit clock register accessible as:
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(start));
The unit is in nanoseconds.
The default resolution is 32ns with update every µs. The NVIDIA performance tools force the update to every 32 ns (or 31.25 MHz). This clock is used by CUPTI for start time when capturing concurrent kernel trace.
My question is about understanding the Linux perf tool metrics. I did an optimisations related to prefetch/cache-misses in my code, that is now faster. However, perf does not show me that (or more certainly, I do not understand what perf shows me).
Taking it back to where it all began. I did an investigation in order to speed up random memory access using prefetch.
Here is what my program does:
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
each value is a random index in the second buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU context switches
After my last tunings, my code is the following one:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#include <sched.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 50000)
#define PADDING 256
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + PADDING) * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
double sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
unsigned int index = indexBuffer[i];
unsigned int value = valueBuffer[index];
double s = sin(value);
sum += s;
}
return (sum);
}
unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);
gettimeofday(&endTime, NULL);
printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
void testWithPrefetchStep(unsigned short prefetchStep)
{
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
int iterateOnPrefetchSteps()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
{
testWithPrefetchStep(prefetchStep);
}
}
void setCpuAffinity(int cpuId)
{
int pid=0;
cpu_set_t mask;
unsigned int len = sizeof(mask);
CPU_ZERO(&mask);
CPU_SET(cpuId,&mask);
sched_setaffinity(pid, len, &mask);
}
int main(int argc, char ** argv)
{
setCpuAffinity(7);
if (argc == 2)
{
testWithPrefetchStep(atoi(argv[1]));
}
else
{
iterateOnPrefetchSteps();
}
}
At the end of my previous stackoverflow question I thought I had all the elements: In order to avoid cache-misses I made my code prefetching data (using __builtin_prefetch) and my program was faster. Everything looked as normal as possible
However, I wanted to study it using the Linux perf tool. So I launched a comparison between two executions of my program:
./TestPrefetch 0: doing so, the prefetch is inefficient because it is done on the data that is read just after (when the data is accessed, it cannot have be loaded in the CPU cache). Run duration: 21.346 seconds
./TestPrefetch 1: Here the prefetch is far more efficient because data is fetched one loop-iteration before it is read. Run duration: 12.624 seconds
The perf outputs are the following ones:
$ gcc -O3 TestPrefetch.c -o TestPrefetch -lm && for cpt in 0 1; do echo ; echo "### Step=$cpt" ; sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./TestPrefetch $cpt; done
### Step=0
prefetchStep = 0, Sum = -1107.523504 - Time: 21346278 micro-seconds = 21.346 seconds
Performance counter stats for './TestPrefetch 0':
24387,010283 task-clock (msec) # 1,000 CPUs utilized
97 274 163 155 cycles # 3,989 GHz
59 183 107 508 instructions # 0,61 insn per cycle
425 300 823 cache-references # 17,440 M/sec
249 261 530 cache-misses # 58,608 % of all cache refs
24,387790203 seconds time elapsed
### Step=1
prefetchStep = 1, Sum = -1107.523504 - Time: 12623665 micro-seconds = 12.624 seconds
Performance counter stats for './TestPrefetch 1':
15662,864719 task-clock (msec) # 1,000 CPUs utilized
62 159 134 934 cycles # 3,969 GHz
59 167 595 107 instructions # 0,95 insn per cycle
484 882 084 cache-references # 30,957 M/sec
321 873 952 cache-misses # 66,382 % of all cache refs
15,663437848 seconds time elapsed
Here, I have difficulties to understand why am I better:
The number of cache-misses is almost the same (I even have a little bit more): I can't understand why and (overall) if so, why am I faster?
what are the cache-references?
what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
I wouldn't trust the perf summary, as it's not very clear what each name represents and which perf counter are they programmed to follow. The default settings have also been known to count the wrong things (see - https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604)
What could happen here is that your cache miss counter may count also the prefetch instructions (which may seem as loads to the machine, especially as you descend in the cache hierarchy). In that case, having more cache references (lookups) makes sense, and you would expect these requests to be misses (the whole point of a prefetch is to miss...).
Instead of relying on some ambiguous counter, find our the counter IDs and masks for your specific machine that represent demand reads lookups and misses, and see if they improved.
Edit: looking at your numbers again, I see an increase of ~50M accesses, but ~70M misses. It's possible that there are more misses due to cache thrashing done by the prefetches
can't understand why and (overall) if so, why am I faster?
Because you run more instructions per the time. The old one:
0,61 insn per cycle
and the new one
0,95 insn per cycle
what are the cache-references?
The count how many times the cache was asked if it does contain the data you were loading/storing.
what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
Yes. But note that in today processors, there is no wait for any of this. The instructions are executed out-of-order, usually prefetched and if the next instruction needs some data that are not ready, other instructions will get executed.
I recently progress on my perf issues. I discovered a lot of new events among which some are really interesting.
Regarding the current problem, the following event have to be concidered: L1-icache-load-misses
When I monitor my test-application with perf in the same conditions than previously, I get the following values for this event:
1 202 210 L1-icache-load-misses
against
530 127 L1-icache-load-misses
For the moment, I do not yet understand why cache-misses events are not impacted by prefetches while L1-icache-load-misses are...
Let me take the hardware with computation ability 1.3 as an example.
30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources.
My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are finished?
I wrote such a piece of code.
#include<stdio.h>
#include<string.h>
#include<cuda_runtime.h>
#include<cutil_inline.h>
const int BLOCKNUM = 1024;
const int N=240;
__global__ void kernel ( volatile int* mark ) {
if ( blockIdx.x == 0 ) while ( mark[N] == 0 );
if ( threadIdx.x == 0 ) mark[blockIdx.x] = 1;
}
int main() {
int * mark;
cudaMalloc ( ( void** ) &mark, sizeof ( int ) *BLOCKNUM );
cudaMemset ( mark, 0, sizeof ( int ) *BLOCKNUM );
kernel <<< BLOCKNUM, 1>>> ( mark );
cudaFree ( mark );
return 0;
}
This code causes a deadlock and fails to terminate. But if I change N from 240 to 239, the code is able to terminate. So I want to know some details about the scheduling of blocks.
On the GT200, it has been demonstrated through micro-benchmarking that new blocks are scheduled whenever a SM has retired all the currently active blocks which it was running. So the answer is when some blocks are finished, and the scheduling granularity is SM level. There seems to be a consensus that Fermi GPUs have a finer scheduling granularity than previous generations of hardware.
I can't find any reference about this for compute capabilities < 1.3.
Fermi architectures introduce a new block dispatcher called GigaThread engine.
GigaThread enables immediate replacement of blocks on an SM when one completes executing and also enables concurrent kernel execution.
While there is no official answer to this, you can measure through atomic operations when your blocks begin your work and when they end.
Try playing with the following code:
#include <stdio.h>
const int maxBlocks=60; //Number of blocks of size 512 threads on current device required to achieve full occupancy
__global__ void emptyKernel() {}
__global__ void myKernel(int *control, int *output) {
if (threadIdx.x==1) {
//register that we enter
int enter=atomicAdd(control,1);
output[blockIdx.x]=enter;
//some intensive and long task
int &var=output[blockIdx.x+gridDim.x]; //var references global memory
var=1;
for (int i=0; i<12345678; ++i) {
var+=1+tanhf(var);
}
//register that we quit
var=atomicAdd(control,1);
}
}
int main() {
int *gpuControl;
cudaMalloc((void**)&gpuControl, sizeof(int));
int cpuControl=0;
cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);
int *gpuOutput;
cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);
int cpuOutput[maxBlocks*2];
for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side
cpuOutput[i]=-1;
// play with these values
const int thr=479;
const int p=13;
const int q=maxBlocks;
//I found that this may actually affect the scheduler! Try with and without this call.
emptyKernel<<<p,thr>>>();
cudaEvent_t timerStart;
cudaEvent_t timerStop;
cudaEventCreate(&timerStart);
cudaEventCreate(&timerStop);
cudaThreadSynchronize();
cudaEventRecord(timerStart,0);
myKernel<<<q,512>>>(gpuControl, gpuOutput);
cudaEventRecord(timerStop,0);
cudaEventSynchronize(timerStop);
cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks*2,cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
float thisTime;
cudaEventElapsedTime(&thisTime,timerStart,timerStop);
cudaEventDestroy(timerStart);
cudaEventDestroy(timerStop);
printf("Elapsed time: %f\n",thisTime);
for (int i=0; i<q; ++i)
printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);
}
What you get in the output is the block ID, followed by the enter "time" and exit "time". This way you can learn in which order those events occured.
On Fermi, I'm sure that a block is scheduled on a SM as soon there is room for it. I.e., whenever, a SM finishes executing one block, it will execute another block if there is any block left. (However, the actual order is not deterministic).
In older versions, I don't know. But you can verify it by using the build-in clock() function.
For example, I used the following OpenCL kernel code (you can easily convert it to CUDA):
__kernel void test(uint* start, uint* end, float* buffer);
{
int id = get_global_id(0);
start[id] = clock();
__do_something_here;
end[id] = clock();
}
Then output it to a file and build a graph. You will see how visual it is.