Something faster than GetSystemTime?

Something faster than GetSystemTime? - windows

I'm writing a DDE logging applet in visual c++ that logs several hundred events per minute and I need a faster way to keep time than calling GetSystemTime in winapi. Do you have any ideas?
(asking this because in testing under load, all exceptions were caused by a call to getsystemtime)

There is a mostly undocumented struct named USER_SHARED_DATA at a high usermode readable address that contains the time and other global things, but GetSystemTime just extracts the time from there and calls RtlTimeToTimeFields so there is not much you can gain by using the struct directly.

Possibly crazy thought: do you definately need an accurate timestamp? Suppose you only got the system time say every 10th call - how bad would that be?

As per your comments; calling GetSystemDateFormat and GetSystemTimeFormat each time will be a waste of time. These are not likely to change, so those values could easily be cached for improved performance. I would imagine (without having actually tested it) that these two calls are far more time-consuming than a call to GetSystemTime.

first of all, find out why your code is throwing exceptions (assuming you have described it correctly: i.e. a real exception has been thrown, and the app descends down into kernel mode - which is really slow btw.)
then you will most likely solve any performance bottleneck.
Chris J.

First, The fastest way I could think is using RDTSC instruction. RDTSC is an Intel time-stamp counter instruction with a nanosecond precision. Use it as a combination with CPUID instruction. To use those instructions properly, you can read "Using the RDTSC Instruction for Performance Monitoring", then you can convert the nanoseconds to seconds for your purpose.
Second, consider using QueryPerformanceFrequency() and QueryPerformanceCounter().
Third, not the fastest but much faster than the standard GetSystemTime(), that is, using timeGetTime().

Related

Performance profiling a KEXT

How to measure performance impact of a kext in OS X in terms of CPU, memory or thread usage during some user defined activities ? Any particular method tool that can be use from user land ? OR any approach/method that can be considered?

You've essentially got 2 options:
Instrumenting your kext with time measurements. Take stamps before and after the operation you're trying to measure using mach_absolute_time(), convert to a human-readable unit using absolutetime_to_nanoseconds() and take the difference, then collect that information somewhere in your kext where it can be extracted from userspace.
Sampling kernel stacks using dtrace (iprofiler -kernelstacks -timeprofiler from the command line, or using Instruments.app)
Personally, I've had a lot more success with the former method, although it's definitely more work. Most kext code runs so briefly that a sampling profiler barely catches any instances of it executing, unless you reduce the sampling interval so far that measurements start interfering with the system, or your kext is seriously slow. It's pretty easy to do though, so it's often a valid sanity check.
You can also get your compiler to instrument your code with counters (-fprofile-arcs), which in theory will allow you to combine the sampling statistics with the branch counters to determine the runtime of each branch. Extracting this data is a pain though (my code may help) and again, the statistical noise has made this useless for me in practice.
The explicit method also allows you to measure asynchronous operations, etc., but of course also comes with some intrinsic overhead. Accumulating the data safely is also a little tricky. (I use atomic operations, but you could use spinlocks too. Don't forget to not just measure means but also standard deviation, and minimum/maximum times.) And extracting the data can be a pain because you have to add a userspace interface to your kext for it. But it's definitely worth it!

Who calls delay_tsc() on Linux

I used OProfile to profiling my Linux box. During the profiling processes, I've found that besides "native_safe_halt" function, the "delay_tsc" is the second most significant function consuming cpu cycles (around 10%). It seems delay_tsc() is a busy loop. But who calls it and what is its function?

Nobody calls it directly since it's a local function inside that piece of source you link to. The way to call it is by the published __delay() function.
When you call __delay(), this will use the delay_fn function pointer (also local to that file) to select one of several delay functions. By default, the one selected is delay_loop(), which uses x86 instructions to try and mark time.
However, if use_tsc_delay() has been called (at boot time), it switches the function pointer to delay_tsc(), which uses the time stamp counter (a CPU counter) to mark time.
It's called by any kernel code that wants a reasonably reliable, high-resolution delay function. You can see all the code in the kernel that references __delay here (quite a few places).
I think it's probably pretty safe, in terms of profiling, to ignore the time spent in that function since its intent is to delay. In other words, it's not useful work that's taking a long time to perform - if callers didn't want to delay, they wouldn't call it.
Some examples from that list:
A watchdog timer uses it to pace the cores so that their output is not mixed up with each other, by delaying for some multiple of the current core ID.
The ATI frame buffer driver appears to use it for delays between low-level accesses to the hardware. In fact, it's used quite a bit for that purpose in many device drivers.
It's used during start-up to figure out the relationship between jiffies and the actual hardware speeds.

When should we use a scatter/gather(vectored) IO?

Windows file system supports scatter/gather IO.(Of course, other platform does)
But I don't know when do I use the IO mechanism.
Could you explain me a proper case?
And what benefit can we get from using the I/O mechanism?(Just a little IO request?)

You use Scatter/Gather IO when you are doing lots of random (i.e. non-sequential) reads / writes, and you want to save on context switches / syscalls - Scatter/Gather is a form of batching in this sense. However, unless you've got a very fast disk (or more likely, a large array of disks), the syscall cost is negligible.
If you were writing a Database server, you might care about this, but anything less than a big-iron machine handling thousands or millions of requests a second won't see any benefit.

Paul -- one extra note: one additional advantage is that you hand multiple requests to the disk driver at the same time. The driver then can sort the requests and issue them in the optimal order. While syscall time is small, seek time (many milliseconds) can be punitive (that's less than 1000 I/O's/sec).
Chris's comment about demonstrating the efficiency is pragmatic. Mother nature never lies. Well, almost never.

I would imagine that you would use scatter gatehr IO when you (a) suspected your application had a performance bottleneck, and (b) you built a performance analysis framework that could show significant improvment using it.
Unless you can show a provable improvement, the additional code complexity is just a risk, and theres no magic recipe that says that, when some condition is met, and application will automatically benefit in a significant way from some programming cleverness.
Or - to put it another way - dont base major architectural decisions based on the statements of 'some guy on an internet forum'. Create a test, and find out.

in posix, readv and writev read from or write to discontinuous memory but to read and write discontinuous file ranges from discontinuous memory in one go you want readx and writex which were one of the proposed posix additions
doing a readx is faster then doing a lot of reads as it's only one system call and it lets the disk scheduler have the most io's to reorder i remember some one saying that for the ext2/3/.. fsck program that they wanted this as it knows what ranges it wants

How do I obtain CPU cycle count in Win32?

In Win32, is there any way to get a unique cpu cycle count or something similar that would be uniform for multiple processes/languages/systems/etc.
I'm creating some log files, but have to produce multiple logfiles because we're hosting the .NET runtime, and I'd like to avoid calling from one to the other to log. As such, I was thinking I'd just produce two files, combine them, and then sort them, to get a coherent timeline involving cross-world calls.
However, GetTickCount does not increase for every call, so that's not reliable. Is there a better number, so that I get the calls in the right order when sorting?
Edit: Thanks to #Greg that put me on the track to QueryPerformanceCounter, which did the trick.

Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.
Conclusion:
Using regular old timeGetTime() to do
timing is not reliable on many
Windows-based operating systems
because the granularity of the system
timer can be as high as 10-15
milliseconds, meaning that
timeGetTime() is only accurate to
10-15 milliseconds. [Note that the
high granularities occur on NT-based
operation systems like Windows NT,
2000, and XP. Windows 95 and 98 tend
to have much better granularity,
around 1-5 ms.]
However, if you call
timeBeginPeriod(1) at the beginning of
your program (and timeEndPeriod(1) at
the end), timeGetTime() will usually
become accurate to 1-2 milliseconds,
and will provide you with extremely
accurate timing information.
Sleep() behaves similarly; the length
of time that Sleep() actually sleeps
for goes hand-in-hand with the
granularity of timeGetTime(), so after
calling timeBeginPeriod(1) once,
Sleep(1) will actually sleep for 1-2
milliseconds,Sleep(2) for 2-3, and so
on (instead of sleeping in increments
as high as 10-15 ms).
For higher precision timing
(sub-millisecond accuracy), you'll
probably want to avoid using the
assembly mnemonic RDTSC because it is
hard to calibrate; instead, use
QueryPerformanceFrequency and
QueryPerformanceCounter, which are
accurate to less than 10 microseconds
(0.00001 seconds).
For simple timing, both timeGetTime
and QueryPerformanceCounter work well,
and QueryPerformanceCounter is
obviously more accurate. However, if
you need to do any kind of "timed
pauses" (such as those necessary for
framerate limiting), you need to be
careful of sitting in a loop calling
QueryPerformanceCounter, waiting for
it to reach a certain value; this will
eat up 100% of your processor.
Instead, consider a hybrid scheme,
where you call Sleep(1) (don't forget
timeBeginPeriod(1) first!) whenever
you need to pass more than 1 ms of
time, and then only enter the
QueryPerformanceCounter 100%-busy loop
to finish off the last < 1/1000th of a
second of the delay you need. This
will give you ultra-accurate delays
(accurate to 10 microseconds), with
very minimal CPU usage. See the code
above.

You can use the RDTSC CPU instruction (assuming x86). This instruction gives the CPU cycle counter, but be aware that it will increase very quickly to its maximum value, and then reset to 0. As the Wikipedia article mentions, you might be better off using the QueryPerformanceCounter function.

System.Diagnostics.Stopwatch.GetTimestamp() return the number of CPU cycle since a time origin (maybe when the computer start, but I'm not sure) and I've never seen it not increased between 2 calls.
The CPU Cycles will be specific for each computer so you can't use it to merge log file between 2 computers.

RDTSC output may depend on the current core's clock frequency, which for modern CPUs is neither constant nor, in a multicore machine, consistent.
Use the system time, and if dealing with feeds from multiple systems use an NTP time source. You can get reliable, consistent time readings that way; if the overhead is too much for your purposes, using the HPET to work out time elapsed since the last known reliable time reading is better than using the HPET alone.

Use the GetTickCount and add another counter as you merge the log files. Won't give you perfect sequence between the different log files, but it will at least keep all logs from each file in the correct order.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.

If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!

Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.

It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)

This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.

If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?

Understand your worst case runtime for complex operations and use timers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio