DirectX 11 Compute Shader device synchronization?

DirectX 11 Compute Shader device synchronization? - gpgpu

Background: perform benchmarking/comparisson over GPGPU platforms.
Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.
Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.
CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.
Code DX_1:
// Setup
...
for (...) {
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
}
// Release
...
Code DX_2:
for (...) {
// Setup
...
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
...
}
Results (average times of 2^2 to 2^11 elements):
DX_1 DX_2 CUDA
1.6 205.5 24.8
1.8 133.4 24.8
29.1 186.5 25.6
18.6 175.0 25.6
11.4 187.5 26.6
85.2 127.7 26.3
166.4 151.1 28.1
98.2 149.5 35.2
26.8 203.5 31.6
Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.
Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.

My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.
CUDA Event Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0
In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.
Code Examples, CUDA
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
...
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
OpenCL
cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
...
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);
DirectCompute
Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:
ID3D11Device* device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
D3D11_QUERY_DESC desc;
desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
...
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;
C/C++/OpenMP
static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;
static void __inline startTimer()
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
}
static double __inline stopTimer()
{
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
}
My code examples are taken out of context and I tried to do some clean-up but errors might be present.

If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.
This is probably a good primer / example on how to do it:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Related

Formulas in perf stat

I am wondering about the formulas used in perf stat to calculate figures from the raw data.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
1080267.226401 task-clock (msec) # 19.062 CPUs utilized
1,592,123,216,789 cycles # 1.474 GHz (50.00%)
871,190,006,655 instructions # 0.55 insn per cycle (75.00%)
3,697,548,810 cache-references # 3.423 M/sec (75.00%)
459,457,321 cache-misses # 12.426 % of all cache refs (75.00%)
In this context, how do you calculate M/sec from cache-references?

Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):
http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L626
When HW_INSTRUCTIONS is counted:
"Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS
if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
if (total) {
ratio = avg / total;
print_metric(ctxp, NULL, "%7.2f ",
"insn per cycle", ratio);
} else {
print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
}
Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS
There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).
Stalled percents are computed as HW_STALLED_CYCLES_FRONTEND / HW_CPU_CYCLES and HW_STALLED_CYCLES_BACKEND / HW_CPU_CYCLES
GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)
if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
update_stats(&runtime_nsecs_stats[cpu], count[0]);
There are also several formulas for transactions (perf stat -T mode).
"CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):
static inline unsigned long long rdclock(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
...
static int __run_perf_stat(int argc, const char **argv)
{
...
/*
* Enable counters and exec the command:
*/
t0 = rdclock();
clock_gettime(CLOCK_MONOTONIC, &ref_time);
if (forks) {
....
}
t1 = rdclock();
update_stats(&walltime_nsecs_stats, t1 - t0);
There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).
And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
total = avg_stats(&runtime_nsecs_stats[cpu]);
if (total)
ratio = 1000.0 * avg / total;
if (ratio < 0.001) {
ratio *= 1000;
unit = 'K';
}
snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);
}

Faster lookup tables using AVX2

I'm trying to speed up an algorithm which performs a series of lookup tables. I'd like to use SSE2 or AVX2. I've tried using the _mm256_i32gather_epi32 command but it is 31% slower. Does anyone have any suggestions to any improvements or a different approach?
Timings:
C code = 234
Gathers = 340
static const int32_t g_tables[2][64]; // values between 0 and 63
template <int8_t which, class T>
static void lookup_data(int16_t * dst, T * src)
{
const int32_t * lut = g_tables[which];
// Leave this code for Broadwell or Skylake since it's 31% slower than C code
// (gather is 12 for Haswell, 7 for Broadwell and 5 for Skylake)
#if 0
if (sizeof(T) == sizeof(int16_t)) {
__m256i avx0, avx1, avx2, avx3, avx4, avx5, avx6, avx7;
__m128i sse0, sse1, sse2, sse3, sse4, sse5, sse6, sse7;
__m256i mask = _mm256_set1_epi32(0xffff);
avx0 = _mm256_loadu_si256((__m256i *)(lut));
avx1 = _mm256_loadu_si256((__m256i *)(lut + 8));
avx2 = _mm256_loadu_si256((__m256i *)(lut + 16));
avx3 = _mm256_loadu_si256((__m256i *)(lut + 24));
avx4 = _mm256_loadu_si256((__m256i *)(lut + 32));
avx5 = _mm256_loadu_si256((__m256i *)(lut + 40));
avx6 = _mm256_loadu_si256((__m256i *)(lut + 48));
avx7 = _mm256_loadu_si256((__m256i *)(lut + 56));
avx0 = _mm256_i32gather_epi32((int32_t *)(src), avx0, 2);
avx1 = _mm256_i32gather_epi32((int32_t *)(src), avx1, 2);
avx2 = _mm256_i32gather_epi32((int32_t *)(src), avx2, 2);
avx3 = _mm256_i32gather_epi32((int32_t *)(src), avx3, 2);
avx4 = _mm256_i32gather_epi32((int32_t *)(src), avx4, 2);
avx5 = _mm256_i32gather_epi32((int32_t *)(src), avx5, 2);
avx6 = _mm256_i32gather_epi32((int32_t *)(src), avx6, 2);
avx7 = _mm256_i32gather_epi32((int32_t *)(src), avx7, 2);
avx0 = _mm256_and_si256(avx0, mask);
avx1 = _mm256_and_si256(avx1, mask);
avx2 = _mm256_and_si256(avx2, mask);
avx3 = _mm256_and_si256(avx3, mask);
avx4 = _mm256_and_si256(avx4, mask);
avx5 = _mm256_and_si256(avx5, mask);
avx6 = _mm256_and_si256(avx6, mask);
avx7 = _mm256_and_si256(avx7, mask);
sse0 = _mm_packus_epi32(_mm256_castsi256_si128(avx0), _mm256_extracti128_si256(avx0, 1));
sse1 = _mm_packus_epi32(_mm256_castsi256_si128(avx1), _mm256_extracti128_si256(avx1, 1));
sse2 = _mm_packus_epi32(_mm256_castsi256_si128(avx2), _mm256_extracti128_si256(avx2, 1));
sse3 = _mm_packus_epi32(_mm256_castsi256_si128(avx3), _mm256_extracti128_si256(avx3, 1));
sse4 = _mm_packus_epi32(_mm256_castsi256_si128(avx4), _mm256_extracti128_si256(avx4, 1));
sse5 = _mm_packus_epi32(_mm256_castsi256_si128(avx5), _mm256_extracti128_si256(avx5, 1));
sse6 = _mm_packus_epi32(_mm256_castsi256_si128(avx6), _mm256_extracti128_si256(avx6, 1));
sse7 = _mm_packus_epi32(_mm256_castsi256_si128(avx7), _mm256_extracti128_si256(avx7, 1));
_mm_storeu_si128((__m128i *)(dst), sse0);
_mm_storeu_si128((__m128i *)(dst + 8), sse1);
_mm_storeu_si128((__m128i *)(dst + 16), sse2);
_mm_storeu_si128((__m128i *)(dst + 24), sse3);
_mm_storeu_si128((__m128i *)(dst + 32), sse4);
_mm_storeu_si128((__m128i *)(dst + 40), sse5);
_mm_storeu_si128((__m128i *)(dst + 48), sse6);
_mm_storeu_si128((__m128i *)(dst + 56), sse7);
}
else
#endif
{
for (int32_t i = 0; i < 64; i += 4)
{
*dst++ = src[*lut++];
*dst++ = src[*lut++];
*dst++ = src[*lut++];
*dst++ = src[*lut++];
}
}
}

You're right that gather is slower than a PINSRD loop on Haswell. It's probably nearly break-even on Broadwell. (See also the x86 tag wiki for perf links, especially Agner Fog's insn tables, microarch pdf, and optimization guide)
If your indices are small, or you can slice them up, pshufb can be used as parallel LUT with 4bit indices. It gives you sixteen 8bit table entries, but you can use stuff like punpcklbw to combine two vectors of byte results into one vector of 16bit results. (Separate tables for high and low halves of the LUT entries, with the same 4bit indices).
This kind of technique gets used for Galois Field multiplies, when you want to multiply every element of a big buffer of GF16 values by the same value. (e.g. for Reed-Solomon error correction codes.) Like I said, taking advantage of this requires taking advantage of special properties of your use-case.
AVX2 can do two 128b pshufbs in parallel, in each lane of a 256b vector. There is nothing better until AVX512F: __m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b). There are byte (vpermi2b in AVX512VBMI), word (vpermi2w in AVX512BW), dword (this one, vpermi2d in AVX512F), and qword (vpermi2q in AVX512F) element size versions. This is a full cross-lane shuffle, indexing into two concatenated source registers. (Like AMD XOP's vpperm).
The two different instructions behind the one intrinsic (vpermt2d / vpermi2d) give you a choice of overwriting the table with the result, or overwriting the index vector. The compiler will pick based on which inputs are reused.
Your specific case:
*dst++ = src[*lut++];
The lookup-table is actually src, not the variable you've called lut. lut is actually walking through an array which is used as a shuffle-control mask for src.
You should make g_tables an array of uint8_t for best performance. The entries are only 0..63, so they fit. Zero-extending loads into full registers are as cheap as normal loads, so it just reduces the cache footprint. To use it with AVX2 gathers, use vpmovzxbd. The intrinsic is frustratingly difficult to use as a load, because there's no form that takes an int64_t *, only __m256i _mm256_cvtepu8_epi32 (__m128i a) which takes a __m128i. This is one of the major design flaws with intrinsics, IMO.
I don't have any great ideas for speeding up your loop. Scalar code is probably the way to go here. The SIMD code shuffles 64 int16_t values into a new destination, I guess. It took me a while to figure that out, because I didn't find the if (sizeof...) line right away, and there are no comments. :( It would be easier to read if you used sane variable names, not avx0... Using x86 gather instructions for elements smaller than 4B certainly requires annoying masking. However, instead of pack, you could use a shift and OR.
You could make an AVX512 version for sizeof(T) == sizeof(int8_t) or sizeof(T) == sizeof(int16_t), because all of src will fit into one or two zmm registers.
If g_tables was being used as a LUT, AVX512 could do it easily, with vpermi2b. You'd have a hard time with out AVX512, though, because a 64 byte table is too big for pshufb. Using four lanes (16B) of pshufb for each input lane could work: Mask off indices outside 0..15, then indices outside 16..31, etc, with pcmpgtb or something. Then you have to OR all four lanes together. So this sucks a lot.
possible speedups: design the shuffle by hand
If you're willing to design a shuffle by hand for a specific value of g_tables, there are potential speedups that way. Load a vector from src, shuffle it with a compile-time constant pshufb or pshufd, then store any contiguous blocks in one go. (Maybe with pextrd or pextrq, or even better movq from the bottom of the vector. Or even a full-vector movdqu).
Actually, loading multiple src vectors and shuffling between them is possible with shufps. It works fine on integer data, with no slowdowns except on Nehalem (and maybe also on Core2). punpcklwd / dq / qdq (and the corresponding punpckhwd etc) can interleave elements of vectors, and give different choices for data movement than shufps.
If it doesn't take too many instructions to construct a few full 16B vectors, you're in good shape.
If g_tables can take on too many possible values, it might be possible to JIT-compile a custom shuffle function. This is probably really hard to do well, though.

Cuda, why I cannot use more than one streaming processor?

I implemented a RNS Montgomery exponentiation in Cuda.
Everything nice everything fine. It runs on just one SM.
BUT, so far I focus on parallelization of just a single exp. What I want to do now is test with several exp on fly. That is, I want that the i-th next exp is assign to a free SM.
I tried, and the final time was always growing linearly, that is all the exp were assign to the same SM.
Then I switched to streams, but nothing changed.
However I have never used them, so maybe I am doing something wrong..
This is the code:
void __smeWrapper() {
cudaEvent_t start, stop;
cudaStream_t stream0, stream1, stream2;
float time;
unsigned int j, i, tmp;
cudaEventCreate(&start);
cudaEventCreate(&stop);
dim3 threadsPerBlock(SET_SIZE, (SET_SIZE+1)/2);
setCudaDevice();
s_transferDataToGPU();
if(cudaDeviceSetCacheConfig(cudaFuncCachePreferL1) != cudaSuccess)
printf("cudaDeviceSetCacheConfig ERROR!");
cudaEventRecord( start, 0 );
//for(i=0; i<EXPONENTIATION_NUMBER; i++) {
i=0;
__me<<< 1, threadsPerBlock, 0, stream0 >>>(&__s_x[i*(2*SET_SIZE + 1)], __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar,
__mmi_BinAUar, __mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
i=1;
__me<<< 1, threadsPerBlock, 0, stream1 >>>(&__s_x[i*(2*SET_SIZE + 1)], __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar,
__mmi_BinAUar, __mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
i=2;
__me<<< 1, threadsPerBlock, 0, stream2 >>>(&__s_x[i*(2*SET_SIZE + 1)], __B2modN, __bases, __mmi_NinB, __mmi_Bimodbi, __Bi_inAUar, __dbg, __NinAUar, __mmi_BinAUar,
__mmi_Ajmodaj, __Ajmodar, __mmi_Armodar, __AjinB, __minusAinB, &__z[i*(2*SET_SIZE + 1)], __e);
//printf("\n%s\n\n", cudaGetErrorString(cudaGetLastError()));
//}
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
printf("GPU %f µs : %f ms\n", time*1000, time);
cudaEventDestroy( start );
cudaEventDestroy( stop );
Ubuntu 11.04 64b, Cuda 5 RC, 560 Ti (8 SM)

All threads from a block always run on a same SM. You need to start more then one block to use other SMs.
There seems to be something wrong with your streams - do you call cudaStreamCreate for every stream? On my system it crashes with SEGFAULT if I don't use one though.

Program execution taking almost same usertime on CPU as well as GPU?

The program for finding prime numbers using OpenCL 1.1 gave the following benchmarks :
Device : CPU
Realtime : approx. 3 sec
Usertime : approx. 32 sec
Device : GPU
Realtime - approx. 37 sec
Usertime - approx. 32 sec
Why is the usertime of execution by GPU not less than that of CPU? Is data/task parallelization not occuring?
System specifications :64-bit CentOS 5.3 system with two ATI Radeon 5970 graphics card + Intel Core i7 processor(12 cores)

Your kernel is rather inefficient, I have an adjusted one below for you to consider. As to why it runs better on a cpu device...
Using your algorithm, the work items take varying amounts of time to execute. They will take longer as the numbers tested grow larger. A work group on a gpu will not finish until all of its items are finished some of the hardware will be left idle until the last item is done. On a cpu, it behaves more like a loop iterating over the kernel items, so the difference in cycles needed to compute each item won't drastically affect the performance.
'A' is not used by the kernel. It should not be copied unless it is used. It looks like you wanted to test the A[i] rather then 'i' itself though.
I think the gpu would be much better at FFT-based prime calculations, or even a sieve algorithm.
{
int t;
int i = get_global_id(0);
int end = sqrt(i);
if(i%2){
B[i] = 0;
}else{
B[i] = 1; //assuming only that it should be non-zero
}
for ( t = 3; (t<=end)&&(B[i] > 0) ; t+=2 ) {
if ( i % t == 0 ) {
B[ i ] = 0;
}
}
}

Calculating frames per second in a game

What's a good algorithm for calculating frames per second in a game? I want to show it as a number in the corner of the screen. If I just look at how long it took to render the last frame the number changes too fast.
Bonus points if your answer updates each frame and doesn't converge differently when the frame rate is increasing vs decreasing.

You need a smoothed average, the easiest way is to take the current answer (the time to draw the last frame) and combine it with the previous answer.
// eg.
float smoothing = 0.9; // larger=more smoothing
measurement = (measurement * smoothing) + (current * (1.0-smoothing))
By adjusting the 0.9 / 0.1 ratio you can change the 'time constant' - that is how quickly the number responds to changes. A larger fraction in favour of the old answer gives a slower smoother change, a large fraction in favour of the new answer gives a quicker changing value. Obviously the two factors must add to one!

This is what I have used in many games.
#define MAXSAMPLES 100
int tickindex=0;
int ticksum=0;
int ticklist[MAXSAMPLES];
/* need to zero out the ticklist array before starting */
/* average will ramp up until the buffer is full */
/* returns average ticks per frame over the MAXSAMPLES last frames */
double CalcAverageTick(int newtick)
{
ticksum-=ticklist[tickindex]; /* subtract value falling off */
ticksum+=newtick; /* add new value */
ticklist[tickindex]=newtick; /* save new value so it can be subtracted later */
if(++tickindex==MAXSAMPLES) /* inc buffer index */
tickindex=0;
/* return average */
return((double)ticksum/MAXSAMPLES);
}

Well, certainly
frames / sec = 1 / (sec / frame)
But, as you point out, there's a lot of variation in the time it takes to render a single frame, and from a UI perspective updating the fps value at the frame rate is not usable at all (unless the number is very stable).
What you want is probably a moving average or some sort of binning / resetting counter.
For example, you could maintain a queue data structure which held the rendering times for each of the last 30, 60, 100, or what-have-you frames (you could even design it so the limit was adjustable at run-time). To determine a decent fps approximation you can determine the average fps from all the rendering times in the queue:
fps = # of rendering times in queue / total rendering time
When you finish rendering a new frame you enqueue a new rendering time and dequeue an old rendering time. Alternately, you could dequeue only when the total of the rendering times exceeded some preset value (e.g. 1 sec). You can maintain the "last fps value" and a last updated timestamp so you can trigger when to update the fps figure, if you so desire. Though with a moving average if you have consistent formatting, printing the "instantaneous average" fps on each frame would probably be ok.
Another method would be to have a resetting counter. Maintain a precise (millisecond) timestamp, a frame counter, and an fps value. When you finish rendering a frame, increment the counter. When the counter hits a pre-set limit (e.g. 100 frames) or when the time since the timestamp has passed some pre-set value (e.g. 1 sec), calculate the fps:
fps = # frames / (current time - start time)
Then reset the counter to 0 and set the timestamp to the current time.

Increment a counter every time you render a screen and clear that counter for some time interval over which you want to measure the frame-rate.
Ie. Every 3 seconds, get counter/3 and then clear the counter.

There are at least two ways to do it:
The first is the one others have mentioned here before me.
I think it's the simplest and preferred way. You just to keep track of
cn: counter of how many frames you've rendered
time_start: the time since you've started counting
time_now: the current time
Calculating the fps in this case is as simple as evaluating this formula:
FPS = cn / (time_now - time_start).
Then there is the uber cool way you might like to use some day:
Let's say you have 'i' frames to consider. I'll use this notation: f[0], f[1],..., f[i-1] to describe how long it took to render frame 0, frame 1, ..., frame (i-1) respectively.
Example where i = 3
|f[0] |f[1] |f[2] |
+----------+-------------+-------+------> time
Then, mathematical definition of fps after i frames would be
(1) fps[i] = i / (f[0] + ... + f[i-1])
And the same formula but only considering i-1 frames.
(2) fps[i-1] = (i-1) / (f[0] + ... + f[i-2])
Now the trick here is to modify the right side of formula (1) in such a way that it will contain the right side of formula (2) and substitute it for it's left side.
Like so (you should see it more clearly if you write it on a paper):
fps[i] = i / (f[0] + ... + f[i-1])
= i / ((f[0] + ... + f[i-2]) + f[i-1])
= (i/(i-1)) / ((f[0] + ... + f[i-2])/(i-1) + f[i-1]/(i-1))
= (i/(i-1)) / (1/fps[i-1] + f[i-1]/(i-1))
= ...
= (i*fps[i-1]) / (f[i-1] * fps[i-1] + i - 1)
So according to this formula (my math deriving skill are a bit rusty though), to calculate the new fps you need to know the fps from the previous frame, the duration it took to render the last frame and the number of frames you've rendered.

This might be overkill for most people, that's why I hadn't posted it when I implemented it. But it's very robust and flexible.
It stores a Queue with the last frame times, so it can accurately calculate an average FPS value much better than just taking the last frame into consideration.
It also allows you to ignore one frame, if you are doing something that you know is going to artificially screw up that frame's time.
It also allows you to change the number of frames to store in the Queue as it runs, so you can test it out on the fly what is the best value for you.
// Number of past frames to use for FPS smooth calculation - because
// Unity's smoothedDeltaTime, well - it kinda sucks
private int frameTimesSize = 60;
// A Queue is the perfect data structure for the smoothed FPS task;
// new values in, old values out
private Queue<float> frameTimes;
// Not really needed, but used for faster updating then processing
// the entire queue every frame
private float __frameTimesSum = 0;
// Flag to ignore the next frame when performing a heavy one-time operation
// (like changing resolution)
private bool _fpsIgnoreNextFrame = false;
//=============================================================================
// Call this after doing a heavy operation that will screw up with FPS calculation
void FPSIgnoreNextFrame() {
this._fpsIgnoreNextFrame = true;
}
//=============================================================================
// Smoothed FPS counter updating
void Update()
{
if (this._fpsIgnoreNextFrame) {
this._fpsIgnoreNextFrame = false;
return;
}
// While looping here allows the frameTimesSize member to be changed dinamically
while (this.frameTimes.Count >= this.frameTimesSize) {
this.__frameTimesSum -= this.frameTimes.Dequeue();
}
while (this.frameTimes.Count < this.frameTimesSize) {
this.__frameTimesSum += Time.deltaTime;
this.frameTimes.Enqueue(Time.deltaTime);
}
}
//=============================================================================
// Public function to get smoothed FPS values
public int GetSmoothedFPS() {
return (int)(this.frameTimesSize / this.__frameTimesSum * Time.timeScale);
}

Good answers here. Just how you implement it is dependent on what you need it for. I prefer the running average one myself "time = time * 0.9 + last_frame * 0.1" by the guy above.
however I personally like to weight my average more heavily towards newer data because in a game it is SPIKES that are the hardest to squash and thus of most interest to me. So I would use something more like a .7 \ .3 split will make a spike show up much faster (though it's effect will drop off-screen faster as well.. see below)
If your focus is on RENDERING time, then the .9.1 split works pretty nicely b/c it tend to be more smooth. THough for gameplay/AI/physics spikes are much more of a concern as THAT will usually what makes your game look choppy (which is often worse than a low frame rate assuming we're not dipping below 20 fps)
So, what I would do is also add something like this:
#define ONE_OVER_FPS (1.0f/60.0f)
static float g_SpikeGuardBreakpoint = 3.0f * ONE_OVER_FPS;
if(time > g_SpikeGuardBreakpoint)
DoInternalBreakpoint()
(fill in 3.0f with whatever magnitude you find to be an unacceptable spike)
This will let you find and thus solve FPS issues the end of the frame they happen.

A much better system than using a large array of old framerates is to just do something like this:
new_fps = old_fps * 0.99 + new_fps * 0.01
This method uses far less memory, requires far less code, and places more importance upon recent framerates than old framerates while still smoothing the effects of sudden framerate changes.

You could keep a counter, increment it after each frame is rendered, then reset the counter when you are on a new second (storing the previous value as the last second's # of frames rendered)

JavaScript:
// Set the end and start times
var start = (new Date).getTime(), end, FPS;
/* ...
* the loop/block your want to watch
* ...
*/
end = (new Date).getTime();
// since the times are by millisecond, use 1000 (1000ms = 1s)
// then multiply the result by (MaxFPS / 1000)
// FPS = (1000 - (end - start)) * (MaxFPS / 1000)
FPS = Math.round((1000 - (end - start)) * (60 / 1000));

Here's a complete example, using Python (but easily adapted to any language). It uses the smoothing equation in Martin's answer, so almost no memory overhead, and I chose values that worked for me (feel free to play around with the constants to adapt to your use case).
import time
SMOOTHING_FACTOR = 0.99
MAX_FPS = 10000
avg_fps = -1
last_tick = time.time()
while True:
# <Do your rendering work here...>
current_tick = time.time()
# Ensure we don't get crazy large frame rates, by capping to MAX_FPS
current_fps = 1.0 / max(current_tick - last_tick, 1.0/MAX_FPS)
last_tick = current_tick
if avg_fps < 0:
avg_fps = current_fps
else:
avg_fps = (avg_fps * SMOOTHING_FACTOR) + (current_fps * (1-SMOOTHING_FACTOR))
print(avg_fps)

Set counter to zero. Each time you draw a frame increment the counter. After each second print the counter. lather, rinse, repeat. If yo want extra credit, keep a running counter and divide by the total number of seconds for a running average.

In (c++ like) pseudocode these two are what I used in industrial image processing applications that had to process images from a set of externally triggered camera's. Variations in "frame rate" had a different source (slower or faster production on the belt) but the problem is the same. (I assume that you have a simple timer.peek() call that gives you something like the nr of msec (nsec?) since application start or the last call)
Solution 1: fast but not updated every frame
do while (1)
{
ProcessImage(frame)
if (frame.framenumber%poll_interval==0)
{
new_time=timer.peek()
framerate=poll_interval/(new_time - last_time)
last_time=new_time
}
}
Solution 2: updated every frame, requires more memory and CPU
do while (1)
{
ProcessImage(frame)
new_time=timer.peek()
delta=new_time - last_time
last_time = new_time
total_time += delta
delta_history.push(delta)
framerate= delta_history.length() / total_time
while (delta_history.length() > avg_interval)
{
oldest_delta = delta_history.pop()
total_time -= oldest_delta
}
}

qx.Class.define('FpsCounter', {
extend: qx.core.Object
,properties: {
}
,events: {
}
,construct: function(){
this.base(arguments);
this.restart();
}
,statics: {
}
,members: {
restart: function(){
this.__frames = [];
}
,addFrame: function(){
this.__frames.push(new Date());
}
,getFps: function(averageFrames){
debugger;
if(!averageFrames){
averageFrames = 2;
}
var time = 0;
var l = this.__frames.length;
var i = averageFrames;
while(i > 0){
if(l - i - 1 >= 0){
time += this.__frames[l - i] - this.__frames[l - i - 1];
}
i--;
}
var fps = averageFrames / time * 1000;
return fps;
}
}
});

How i do it!
boolean run = false;
int ticks = 0;
long tickstart;
int fps;
public void loop()
{
if(this.ticks==0)
{
this.tickstart = System.currentTimeMillis();
}
this.ticks++;
this.fps = (int)this.ticks / (System.currentTimeMillis()-this.tickstart);
}
In words, a tick clock tracks ticks. If it is the first time, it takes the current time and puts it in 'tickstart'. After the first tick, it makes the variable 'fps' equal how many ticks of the tick clock divided by the time minus the time of the first tick.
Fps is an integer, hence "(int)".

Here's how I do it (in Java):
private static long ONE_SECOND = 1000000L * 1000L; //1 second is 1000ms which is 1000000ns
LinkedList<Long> frames = new LinkedList<>(); //List of frames within 1 second
public int calcFPS(){
long time = System.nanoTime(); //Current time in nano seconds
frames.add(time); //Add this frame to the list
while(true){
long f = frames.getFirst(); //Look at the first element in frames
if(time - f > ONE_SECOND){ //If it was more than 1 second ago
frames.remove(); //Remove it from the list of frames
} else break;
/*If it was within 1 second we know that all other frames in the list
* are also within 1 second
*/
}
return frames.size(); //Return the size of the list
}

In Typescript, I use this algorithm to calculate framerate and frametime averages:
let getTime = () => {
return new Date().getTime();
}
let frames: any[] = [];
let previousTime = getTime();
let framerate:number = 0;
let frametime:number = 0;
let updateStats = (samples:number=60) => {
samples = Math.max(samples, 1) >> 0;
if (frames.length === samples) {
let currentTime: number = getTime() - previousTime;
frametime = currentTime / samples;
framerate = 1000 * samples / currentTime;
previousTime = getTime();
frames = [];
}
frames.push(1);
}
usage:
statsUpdate();
// Print
stats.innerHTML = Math.round(framerate) + ' FPS ' + frametime.toFixed(2) + ' ms';
Tip: If samples is 1, the result is real-time framerate and frametime.

This is based on KPexEA's answer and gives the Simple Moving Average. Tidied and converted to TypeScript for easy copy and paste:
Variable declaration:
fpsObject = {
maxSamples: 100,
tickIndex: 0,
tickSum: 0,
tickList: []
}
Function:
calculateFps(currentFps: number): number {
this.fpsObject.tickSum -= this.fpsObject.tickList[this.fpsObject.tickIndex] || 0
this.fpsObject.tickSum += currentFps
this.fpsObject.tickList[this.fpsObject.tickIndex] = currentFps
if (++this.fpsObject.tickIndex === this.fpsObject.maxSamples) this.fpsObject.tickIndex = 0
const smoothedFps = this.fpsObject.tickSum / this.fpsObject.maxSamples
return Math.floor(smoothedFps)
}
Usage (may vary in your app):
this.fps = this.calculateFps(this.ticker.FPS)

I adapted #KPexEA's answer to Go, moved the globals into struct fields, allowed the number of samples to be configurable, and used time.Duration instead of plain integers and floats.
type FrameTimeTracker struct {
samples []time.Duration
sum time.Duration
index int
}
func NewFrameTimeTracker(n int) *FrameTimeTracker {
return &FrameTimeTracker{
samples: make([]time.Duration, n),
}
}
func (t *FrameTimeTracker) AddFrameTime(frameTime time.Duration) (average time.Duration) {
// algorithm adapted from https://stackoverflow.com/a/87732/814422
t.sum -= t.samples[t.index]
t.sum += frameTime
t.samples[t.index] = frameTime
t.index++
if t.index == len(t.samples) {
t.index = 0
}
return t.sum / time.Duration(len(t.samples))
}
The use of time.Duration, which has nanosecond precision, eliminates the need for floating-point arithmetic to compute the average frame time, but comes at the expense of needing twice as much memory for the same number of samples.
You'd use it like this:
// track the last 60 frame times
frameTimeTracker := NewFrameTimeTracker(60)
// main game loop
for frame := 0;; frame++ {
// ...
if frame > 0 {
// prevFrameTime is the duration of the last frame
avgFrameTime := frameTimeTracker.AddFrameTime(prevFrameTime)
fps := 1.0 / avgFrameTime.Seconds()
}
// ...
}
Since the context of this question is game programming, I'll add some more notes about performance and optimization. The above approach is idiomatic Go but always involves two heap allocations: one for the struct itself and one for the array backing the slice of samples. If used as indicated above, these are long-lived allocations so they won't really tax the garbage collector. Profile before optimizing, as always.
However, if performance is a major concern, some changes can be made to eliminate the allocations and indirections:
Change samples from a slice of []time.Duration to an array of [N]time.Duration where N is fixed at compile time. This removes the flexibility of changing the number of samples at runtime, but in most cases that flexibility is unnecessary.
Then, eliminate the NewFrameTimeTracker constructor function entirely and use a var frameTimeTracker FrameTimeTracker declaration (at the package level or local to main) instead. Unlike C, Go will pre-zero all relevant memory.

Unfortunately, most of the answers here don't provide either accurate enough or sufficiently "slow responsive" FPS measurements. Here's how I do it in Rust using a measurement queue:
use std::collections::VecDeque;
use std::time::{Duration, Instant};
pub struct FpsCounter {
sample_period: Duration,
max_samples: usize,
creation_time: Instant,
frame_count: usize,
measurements: VecDeque<FrameCountMeasurement>,
}
#[derive(Copy, Clone)]
struct FrameCountMeasurement {
time: Instant,
frame_count: usize,
}
impl FpsCounter {
pub fn new(sample_period: Duration, samples: usize) -> Self {
assert!(samples > 1);
Self {
sample_period,
max_samples: samples,
creation_time: Instant::now(),
frame_count: 0,
measurements: VecDeque::new(),
}
}
pub fn fps(&self) -> f32 {
match (self.measurements.front(), self.measurements.back()) {
(Some(start), Some(end)) => {
let period = (end.time - start.time).as_secs_f32();
if period > 0.0 {
(end.frame_count - start.frame_count) as f32 / period
} else {
0.0
}
}
_ => 0.0,
}
}
pub fn update(&mut self) {
self.frame_count += 1;
let current_measurement = self.measure();
let last_measurement = self
.measurements
.back()
.copied()
.unwrap_or(FrameCountMeasurement {
time: self.creation_time,
frame_count: 0,
});
if (current_measurement.time - last_measurement.time) >= self.sample_period {
self.measurements.push_back(current_measurement);
while self.measurements.len() > self.max_samples {
self.measurements.pop_front();
}
}
}
fn measure(&self) -> FrameCountMeasurement {
FrameCountMeasurement {
time: Instant::now(),
frame_count: self.frame_count,
}
}
}
How to use:
Create the counter:
let mut fps_counter = FpsCounter::new(Duration::from_millis(100), 5);
Call fps_counter.update() on every frame drawn.
Call fps_counter.fps() whenever you like to display current FPS.
Now, the key is in parameters to FpsCounter::new() method: sample_period is how responsive fps() is to changes in framerate, and samples controls how quickly fps() ramps up or down to the actual framerate. So if you choose 10 ms and 100 samples, fps() would react almost instantly to any change in framerate - basically, FPS value on the screen would jitter like crazy, but since it's 100 samples, it would take 1 second to match the actual framerate.
So my choice of 100 ms and 5 samples means that displayed FPS counter doesn't make your eyes bleed by changing crazy fast, and it would match your actual framerate half a second after it changes, which is sensible enough for a game.
Since sample_period * samples is averaging time span, you don't want it to be too short if you want a reasonably accurate FPS counter.

store a start time and increment your framecounter once per loop? every few seconds you could just print framecount/(Now - starttime) and then reinitialize them.
edit: oops. double-ninja'ed

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio