I am using a stm32f103c8 and I need a function that will return the correct time in microseconds when called from within an interrupt handler. I found the following bit of code online which proports to do that:
uint32_t microsISR()
{
uint32_t ret;
uint32_t st = SysTick->VAL;
uint32_t pending = SCB->ICSR & SCB_ICSR_PENDSTSET_Msk;
uint32_t ms = UptimeMillis;
if (pending == 0)
ms++;
return ms * 1000 - st / ((SysTick->LOAD + 1) / 1000);
}
My understanding of how this works is uses the system clock counter which repeatedly counts down from 8000 (LOAD+1) and when it reaches zero, an interrupt is generated which increments the variable UptimeMills. This gives the time in milliseconds. To get microseconds we get the current value of the system clock counter and divide it by 8000/1000 to give the offset in microseconds. Since the counter is counting down we subtract it from the current time in milliseconds * 1000. (Actually to be correct I believe one should have be added to the # milliseconds in this calculation).
This is all fine and good unless, when this function is called (in an interrupt handler), the system clock counter has already wrapped but the system clock interrupt has not yet been called, then UptimeMillis count will be off by one. This is the purpose of the following lines:
if (pending == 0)
ms++;
Looking at this does not make sense, however. It is incrementing the # ms if there is NO pending interrupt. Indeed if I use this code, I get a large number of glitches in the returned time at the points at which the counter rolls over. So I changed the lines to:
if (pending != 0)
ms++;
This produced much better results but I still get the occasional glitch (about 1 in every 2000 interrupts) which always occurs at a time when the counter is rolling over.
During the interrupt, I log the current value of milliseconds, microseconds and counter value. I find there are two situations where I get an error:
Milli Micros DT Counter Pending
1 1661 1660550 826 3602 0
2 1662 1661374 824 5010 0
3 1663 1662196 822 6436 0
4 1663 1662022 -174 7826 0
5 1664 1663847 1825 1228 0
6 1665 1664674 827 2614 0
7 1666 1665501 827 3993 0
The interrupts are comming in at a regular rate of about 820us. In this case what seems to be happening between interrupt 3 and 4 is that the counter has wrapped but the pending flag is NOT set. So I need to be adding 1000 to the value and since I fail to do so I get a negative elapsed time.
The second situation is as follows:
Milli Micros DT Counter Pending
1 1814 1813535 818 3721 0
2 1815 1814357 822 5151 0
3 1816 1815181 824 6554 0
4 1817 1817000 1819 2 1
5 1817 1816817 -183 1466 0
6 1818 1817637 820 2906 0
This is a very similar situation except in this case the counter has NOT yet wrapped and yet I am already getting the pending interrupt flag which causes me to erronously add 1000.
Clearly there is some kind of race condition between the two competing interrupts. I have tried setting the clock interrupt priority both above and below that of the external interrupt but the problem persists.
Does anyone have any suggestions how to deal with this problem or a suggestion for a different approach to get the time is microseconds within an interrupt handler.
Read UptimeMillis before and after SysTick->VAL to ensure a rollover has not occurred.
uint32_t microsISR()
{
uint32_t ms = UptimeMillis;
uint32_t st = SysTick->VAL;
// Did UptimeMillis rollover while reading SysTick->VAL?
if (ms != UptimeMillis)
{
// Rollover occurred so read both again.
// Must read both because we don't know whether the
// rollover occurred before or after reading SysTick->VAL.
// No need to check for another rollover because there is
// no chance of another rollover occurring so quickly.
ms = UptimeMillis;
st = SysTick->VAL;
}
return ms * 1000 - st / ((SysTick->LOAD + 1) / 1000);
}
Or here is the same idea in a do-while loop.
uint32_t microsISR()
{
uint32_t ms;
uint32_t st;
// Read UptimeMillis and SysTick->VAL until
// UptimeMillis doesn't rollover.
do
{
ms = UptimeMillis;
st = SysTick->VAL;
} while (ms != UptimeMillis);
return ms * 1000 - st / ((SysTick->LOAD + 1) / 1000);
}
Related
I am trying to parallelize Monte Carlo simulation by using OpenCL. I use the MWC64X as a uniform random number generator. The code runs well on different Intel CPUs, since the output of parallel computation is very close to the sequential one.
Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 # 1.80GHz
Literal influence running time: 0.029048 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:4781 influence0:0.478100 sum2:4781 influence2:0.478100 sum6:0 influence6:0.000000 sum10:0 sum12:0 influence12:0.000000 sum7:0 influence7:0.000000 influence10:0.000000 sum4:5962 influence4:0.596200 sum8:7971 influence8:0.797100 sum1:4781 influence1:0.478100 sum3:4781 influence3:0.478100 sum13:0 influence13:0.000000 sum11:1261 influence11:0.126100 sum9:0 influence9:0.000000 sum14:0 influence14:0.000000 sum5:0 influence5:0.000000 sum15:0 influence15:0.000000 sum16:0 influence16:0.000000
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971
However, when I run the code on GeForce GTX 1080 Ti, with NVIDIA-SMI 430.40 and CUDA 10.1 and OpenCL 1.2 CUDA installed, the output is as below:
Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:10000 sum1:10000 sum2:10000 sum3:10000 sum4:10000 sum5:0 sum6:0 sum7:0 sum8:10000 sum9:0 sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0
Parallel influence running time: 0.193581 seconds
The "Influence" value equals sum*1.0/10000, thus the parallel influence only composes of 1 and 0, which is incorrect (in GPU runs) and doesn't happen when parallelizing on a Intel CPU.
When I check the output of the random number generator if(flag==0) printf("randint=%u",randint);, it seems the outputs are all zero on GPU. Below is the clinfo and the .cl code:
Device Name GeForce GTX 1080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 430.40
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 68:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 28
Max clock frequency 1721MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 11720130560 (10.92GiB)
Error Correction support No
Max memory allocation 2930032640 (2.729GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 458752 (448KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };
void MWC64X_Step(mwc64x_state_t *s)
{
uint X=s->x, C=s->c;
uint Xn=MWC64X_A*X+C;
uint carry=(uint)(Xn<C); // The (Xn<C) will be zero or one for scalar
uint Cn=mad_hi(MWC64X_A,X,carry);
s->x=Xn;
s->c=Cn;
}
//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
MWC64X_Step(s);
return res;
}
__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){
int flag=get_global_id(0);
int sum=0;
int count=10000;
int assignment[N];
//or try to get newlambda like original version does
if(flag < literals){
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
float rand=randint*1.0/BASE;
//if(flag==0)
// printf("randint=%u",randint);
if(lambdap[j]<rand)
assignment[lambdas[j]]=0;
else
assignment[lambdas[j]]=1;
}
//the true case
assignment[flag]=1;
int valuet=0;
int index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuet=1;
break;
}
}
//the false case
assignment[flag]=0;
int valuef=0;
index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuef=1;
break;
}
}
sum += valuet-valuef;
}
influence[flag] = 1.0*sum/count;
printf("sum%d:%d\t", flag, sum);
}
}
What might be the problem when running the code on GPU? Is it MWC64X? According to its author, it can perform well on NVIDIA GPUs. If so, how can I fix it; if not, what might be the problem?
(This started out as a comment, it turns out this was the source of the problem so I'm turning it into an answer.)
You're not initialising your mwc64x_state_t rng; variable before reading from it, so any results will be undefined:
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
Where MWC64X_NextUint() immediately reads from the rng state before updating it:
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
Note that you will probably want to seed your RNG differently for each work-item, otherwise you will get nasty correlation artifacts in your results.
All use-cases of a pseudo-random number are a next-level challenge in true-[PARALLEL] computing platforms (not languages, platforms).
Either, there is some source-of-randomness, which gets us into a trouble once massively-parallel requests are to get fair-handled in a truly [PARALLEL] fashion (here, hardware resources may help, yet at a cost of not being able to reproduce the same behaviour "outside" of this very same platform ( and moment-in-time, if such a source is not software-operated with some seed-injection feature, that may setup the "just"-pseudo-random algorithm that creates a pure-[SERIAL] sequence-of-produced "just"-pseudo-random numbers ) )
Or,there is some "shared"-generator of pseudo-random numbers, which enjoys of a higher level of system-wide level-of-entropy (which is good for the resulting "quality" of pseudo-randomness) but at a cost of pure-serial dependence (no parallel execution possible,serial sequence gets served one after another in a sequential manner) and having close to zero chance for repeatable runs (a must for reproducible science) providing repeatably same sequences, needed for testing and for method-validation cases.
RESUME :
The code may employ a work-item-"private" pseudo-random generating function(s) ( privacy is a must for the sake of both the parallel code-execution and the mutual independence (non-intervening processes) of generating these pseudo-random numbers ) , yet each of instances must be a) independently initialised, so as to provide the expected level of randomness achievable in parallelised code-runs and b) any such initialisation ought be performed in a repeatably reproducible manner, for the sake of running the test on different times, often using different OpenCL target computing-platforms.
For __kernel-s, that do not rely on hardware-specific sources-of-randomness, meeting the conditions a && b will suffice for receiving repeatably reproducible (same) results for testing "in vitro" and thus providing a reasonably random method for generating results during generic production-level use-case code-runs "in vivo".
The comparison of net-run-times (benchmarked above) seems to show that Amdahl's law add-on overhead costs plus a tail-end effect of the atomicity-of-work have finally decided the net-run-time was ~ 3.6x faster on XEON compared to GPU:
index1 = 17
size = 51
dim1_size = 6
sum0: 4781 influence0: 0.478100
sum2: 4781 influence2: 0.478100
sum6: 0 influence6: 0.000000
sum10: 0 influence10: 0.000000
sum12: 0 influence12: 0.000000
sum7: 0 influence7: 0.000000
sum4: 5962 influence4: 0.596200
sum8: 7971 influence8: 0.797100
sum1: 4781 influence1: 0.478100
sum3: 4781 influence3: 0.478100
sum13: 0 influence13: 0.000000
sum11: 1261 influence11: 0.126100
sum9: 0 influence9: 0.000000
sum14: 0 influence14: 0.000000
sum5: 0 influence5: 0.000000
sum15: 0 influence15: 0.000000
sum16: 0 influence16: 0.000000
Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 # 1.80GHz using OpenCL
|....
index1 = 17 |....
size = 51 |....
dim1_size = 6 |....
sum0: 10000 |....
sum1: 10000 |....
sum2: 10000 |....
sum3: 10000 |....
sum4: 10000 |....
sum5: 0 |....
sum6: 0 |....
sum7: 0 |....
sum8: 10000 |....
sum9: 0 |....
sum10: 0 |....
sum11: 0 |....
sum12: 0 |....
sum13: 0 |....
sum14: 0 |....
sum15: 0 |....
sum16: 0 |....
Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL
What is the maximum value of Program Clock Reference(PCR) in MPEG?
I understand that it is derived from a 27MHz clock, periodically loaded into a 42bit register.
PCR(i)=PCR_Base(i) * 300 + PCR_Ext(i)
where PCR_Base is loaded into a 33 bits register
PCR_Ext is loaded into a 9-bit register.
So, the maximum value of PCR w.r.t 27MHz clock is:
PCR = (2^33 - 1)*300 + (2^9 - 1) = 2,576,980,374,811.
=> (2,576,980,374,811/27,000,000) = 95443.7s = 1590.7 min = 26.5 hours
The register overflow happens after 26.5 hours of continuous streaming. Is this understanding correct?
PCR_ext(i) value should be 0 .. 299.
So the maximum PCR = (2^33-1)*300+299 = 2,576,980,377,599
I have a program I am trying to parallelize using OpenMP - it makes a very large loop over some data. Since incrementing a shared variable (so I can report progress as it goes) is somewhat of an issue, I thought I'd break the loop up into smaller chunks, loop over those multiple times, and just report the status at the end of/outside the openmp loop.
Problem is, before the OpenMP for loop starts for the 3rd time, the program locks up. Just sits there, does nothing. I've stripped out all but the simplest code. Here it is:
some other variable declarations for removed code above here
int dbl = 0;
int lasttime = 0;
int seedbase = 0;
const char *pl = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
const double mm = 62.0 / 2147483647.0;
for(dbl = 0; dbl < 2048 && !abort; dbl++) {
seedbase = dbl; //(dbl * 2097152) - 2147483648;
printf("Loop %d %d\n", dbl, abort);
#pragma omp parallel for private(seed) shared(dbl)
for(seed = 0; seed < 20971; seed++) { //52
if(dbl == 2)
printf("oo\n");
}
if(abort)
break;
lasttime = time();
hps = (double)((dbl*2097152) * clk_tck) / (double)((times(&tms) - start_time));
printf("So far: %0.2fsec (%0.2fhps) %0.2f sec left\n", (double)(times(&tms) - start_time) / (double)clk_tck, hps, (((long)1 << 32) - (dbl * 2097152)) / hps);
}
}
When compiled and run, I get:
Loop 0 0
So far: 0.02sec (0.00hps) inf sec left
Loop 1 0
So far: 0.02sec (104857600.00hps) 40.94 sec left
Loop 2 0
^C
Loop 0 starts, and the openmp runs (and does nothing) then exits, and the "So far:" is printed.
Loop 1 starts, same thing.
Loop 2 starts, and everything hangs. The printf("oo"); never happens. If I change the line to be if(dbl <= 2) my screen fills with looped "oo"'s as the loop runs.
But before the seed loop ever happens the third time - it's dead. Just sits there chewing up CPU time doing nothing.
Can you not quickly loop over a openmp loop? Is that the issue? I find it odd it's ALWAYS stopping before the 3rd run, regardless of how complex the code inside the seed loop is (I removed 200 lines of code - it had no effect)
Hi I am taking in data in real time where the value goes from 1009 , 1008 o 1007 to 0. I am trying to count the number of distinct times this occurs, for example the snippet below should count 2 distinct periods of change.
1008
1009
1008
0
0
0
1008
1007
1008
1008
1009
9
0
0
1009
1008
I have written a for loop as below but I can't figure out if the logic is correct as I get multiple increments instead of just the one
if(current != previous && current < 100)
x++;
else
x = x;
You tagged this with the LabVIEW tag. Is this actually supposed to be LabVIEW code?
Your logic has a bug related to the noise you say you have - if the value is less than 100 and it changes (for instance from 9 to 0), you log that as a change. You also have a line which doesn't do anything (x=x), although if this is supposed to be LV code, then this could make sense.
The code you posted here does not seem to make sense to me if I understand your goal. My understanding is that you want to identify this specific pattern:
1009
1008
1007
0
And that any deviation from this sequence of numbers would constitute data that should be ignored. To this end, you should be monitoring the history of the past 3 numbers. In C you might write this logic in the following way:
#include <stdio.h>
//Function to get the next value from our data stream.
int getNext(int *value) {
//Variable to hold our return code.
int code;
//Replace following line to get gext number from the stream. Possible read from a file?
*value = 0;
//Replace following logic to set 'code' appropriately.
if(*value == -1)
code = -1;
else
code = 0;
//Return 'code' to the caller.
return code;
}
//Example application for counting the occurrences of the sequence '1009','1008','1007','0'.
int main(int argc, char **argv) {
//Declare 4 items to store the past 4 items in the sequence (x0-x3)
//Declare a count and a flag to monitor the occurrence of our pattern
int x0 = 0, x1 = 0, x2 = 0, x3 = 0, count = 0, occurred = 0;
//Run forever (just as an example, you would provide your own looping structure or embed the algorithm in your app).
while(1) {
//Get the next element (implement getNext to provide numbers from your data source).
//If the function returns non-zero, exit the loop and print the count.
if( getNext(&x0) != 0 )
break;
//If the newest element is 0, we can trigger a check of the prior 3.
if(x0 == 0) {
//Set occurred to 0 if the prior elements don't match our pattern.
occurred = (x1 == 1007) && (x2 == 1008) && (x3 == 1009);
if(occurred) {
//Occurred was 1, meaning the pattern was matched. Increment our count.
count++;
//Reset occurred
occurred = 0;
}
//If the newest element is not 0, dont bother checking. Just shift the elements down our list.
} else {
x3 = x2; //Shift 3rd element to 4th position
x2 = x1; //Shift 2nd element to 3rd position
x1 = x0; //Shift 1st element to 2nd position
}
}
printf("The pattern count is %d\n", count);
//Exit application
return 0;
}
Note that the getNext function is just shown here as an example but obviously what I have implemented will not work. This function should be implemented based on how you are extracting data from the stream.
Writing the application in this way might not make sense within your larger application but the algorithm is what you should take away from this. Essentially you want to buffer 4 elements in a rolling window. You push the newest element into x0 and shift the others down. After this process you check the four elements to see if they match your desired pattern and increment the count accordingly.
If the requirement is to count falling edges and you don't care about the specific level, and want to reject noise band or ripple in the steady state then just make the conditional something like
if ((previous - current) > threshold)
No complex shifting, history, or filtering required. Depending on the application you can follow up with a debounce (persistency check) to ignore spurious samples (just keep track of falling/rising, or fell/rose as simple toggling state spanning a desired number of samples).
Code to the pattern, not the specific values; use constant or adjustable parameters to control the value sensitivity.
A simple question:
Which is the QueryPerformanceFrequency unit?
Hz (ticks per second)?
Thank you very much,
Bruno
Q: Units of QueryPerformanceFrequency?
A: KILO-HERTZ (NOT Hz)
=========== DETAILS ==============================================
My research indicates that both Counters and Freq are in KILOs, KILO-clock-ticks and KILO-HERTZ!
The counters register KILO-Clicks (KLICKS) and the freq is either in kHz or I am woefully UnderClocked. When you divide the Clock_Ticks by Clock_Frequency, kclicks/(kclicks*sec^-1), everything wipes out except for seconds.
Here is an example C program stripped to just the essentials:
#include "stdio.h"
#include <windows.h> // Needed for LARGE_INTEGER
// gcc cpu.freq.test.c -o cft.exe
// cft.exe -> Sleep d_KLICKS=3417790, d_time=0.999182880 sec, CPU_Freq=3420585 KILO-Hz
void main(int argc, char *argv[]) {
// Clock KILO-ticks start, end, CPU_Freq in kHz. KILOs cancel
LARGE_INTEGER sklick, eklick, cpu_khz;
double delta_time; // Expected time in SECONDS. All units above are k.
QueryPerformanceFrequency(&cpu_khz); // Gets clock KILO-tics, Klicks/sec
QueryPerformanceCounter(&sklick); // Capture cpu Start Klicks
Sleep(1000); // Sleep 1000 MILLI-seconds
QueryPerformanceCounter(&eklick); // Capture cpu End Klicks
delta_time = (eklick.QuadPart-sklick.QuadPart) / (double)cpu_khz.QuadPart;
printf("Sleep d_KLICKS=%lld, d_time=%4.9lf sec, CPU_Freq=%lld KILO-Hz\n",
eklick.QuadPart-sklick.QuadPart, delta_time, cpu_khz.QuadPart);
}
It actually compiles! Running...
Sleep d_KLICKS=3418803, d_time=0.999479036 sec, CPU_Freq=3420585 KILO-Hz
The CPU freq reads 3420585 or 3.420585E6 or 3.4 M-Hertz? <- MEGA-HURTS !OUCH!
The actual CPU freq is 3.4 Mega-Kilo-Hz or 3.4 GHz
microsoft appears to be confused (some things Never Change):
https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Activity to be timed
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second.
The number of "elapsed ticks" in 1 second is in the MILLIONS, NOT BILLIONS so they are NOT UNIT-CPU-CLOCK-TICKS but KILO-CPU-CLOCK-TICKS
Same off-by-3-orders-of-magnitude error for FREQ: 3.4 MILLION is not "ticks-per-second" but THOUSAND-ticks-per-second.
As long as you divide one by the other, the ?clicks cancel with a result in seconds. If one were so fatuous as to take ms at their document and try to use their "ticks-per-second" in some other calculation, you would wind up off by a factor of 1000 or ~1 standard_ms_error!
Perhaps we should call Heinrich in to check HIS units? Oops! 153 years too late. :(