Do calling an API function considered an expensive operation? (when comparing it to a function call which is part of my application)
I understand that some API functions cause a leap to kernel mode but is it true for all of them?
More specific: how expensive will it be to call GetThreadId function?
GetThreadId of an arbitrary thread requires a kernel switch, so it more expensive than GetCurrentThreadId which retrieves cached data from user mode.
The cost of an API is highly dependent on the nature of the API itself. An API that retrieves a cached value from user mode is very cheap, an API that transitions to kernel mode is much more expensive, on the order of microseconds. API's that read or write to disk can cost dozens of milliseconds. API's that work on the network can take seconds or minutes.
Moderately cheap. There's a user/kernel mode transition, but in the kernel, it's just a simple memory read. Dereference the handle, grab the ID from the structure.
More specific: what viable alternatives do you see?
EDIT: unless it's the current thread ID that you're after. GetCurrentThreadId() is completely user-mode, therefore very, very cheap.
This is not so for all of them. For most API calls, it's just a call to something inside a DLL and do not require a mode transition. In this case it will be a few to tens of nanoseconds per call.
Related
when programming with OpenGL, glGet functions should be avoided because they force the GPU and CPU to synchronize, does this also apply to the wgl function "wglGetCurrentContext" which obtains a handle to the current OpenGL context? if not, are there any other performance problems around wglGetCurrentContext?
There is some misconception implied in your question:
glGet functions should be avoided because they force the GPU and CPU to synchronize
The first part is true. The second part is not. Most glGet*() call do not force a GPU/CPU synchronization. They only get state stored in the driver code, and do not involve the GPU at all.
There are some exceptions, which include the glGet*() calls that actually get data produced by the GPU. Typical examples include:
glGetBufferSubData(): Has to block if data is produced by the GPU (e.g. using transform feedback).
glGetTexImage(): Block if texture data is produced by GPU (e.g. if used as a render target).
glGetQueryObjectiv(..., GL_QUERY_RESULT, ...): Blocks if query has not finished.
Now, it's still true that you should avoid glGet*() calls where possible. Mainly for two reasons:
Most of them are unnecessary, since they query state that you set yourself, so you should already know what it is. And any unnecessary call is a waste.
They may cause synchronization between threads in multi-threaded driver implementations. So they may result in synchronization, but not with the GPU.
There are of course some good uses for glGet*() calls, for example:
To get implementation limits, like maximum texture sizes, etc. Call once during startup.
Calls like glGetUniformLocation() and glGetAttribLocation(). Make sure that you only call them once after shader linkage. Even though these are also avoidable by using qualifiers in the shader code at least in recent GLSL versions.
glGetError(), during debugging only.
As for wglGetCurrentContext(), I doubt that it would be very expensive. The current context is typically stored in some kind of thread local storage, which can be accessed very efficiently.
Still, I don't see why calling it would be necessary. If you need the context again later, you can store it away when you create it. And if that's not possible for some reason, you can call wglGetCurrentContext() once, and store the result. There definitely shouldn't be a need to call it repeatedly.
All of the WGL functions vary widely in there effects on the performance characteristics depending on the vendor and driver.
I don't expect that wglGetCurrentContext would be an especially performance hogging call (unless you do it a ton of times). Since the WGL functions are generally divorced from the GL context's state vector.
That being said, SETTING the current context will cause all manner of syncing between contexts, often in deeply undocumented ways. I've dealt with a couple of AMD and Intel driver bugs where some things that were supposed to be synchronized via other means could ONLY be synchronized with a redundant MakeCurrent call every frame.
In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.
You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.
I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.
I used OProfile to profiling my Linux box. During the profiling processes, I've found that besides "native_safe_halt" function, the "delay_tsc" is the second most significant function consuming cpu cycles (around 10%). It seems delay_tsc() is a busy loop. But who calls it and what is its function?
Nobody calls it directly since it's a local function inside that piece of source you link to. The way to call it is by the published __delay() function.
When you call __delay(), this will use the delay_fn function pointer (also local to that file) to select one of several delay functions. By default, the one selected is delay_loop(), which uses x86 instructions to try and mark time.
However, if use_tsc_delay() has been called (at boot time), it switches the function pointer to delay_tsc(), which uses the time stamp counter (a CPU counter) to mark time.
It's called by any kernel code that wants a reasonably reliable, high-resolution delay function. You can see all the code in the kernel that references __delay here (quite a few places).
I think it's probably pretty safe, in terms of profiling, to ignore the time spent in that function since its intent is to delay. In other words, it's not useful work that's taking a long time to perform - if callers didn't want to delay, they wouldn't call it.
Some examples from that list:
A watchdog timer uses it to pace the cores so that their output is not mixed up with each other, by delaying for some multiple of the current core ID.
The ATI frame buffer driver appears to use it for delays between low-level accesses to the hardware. In fact, it's used quite a bit for that purpose in many device drivers.
It's used during start-up to figure out the relationship between jiffies and the actual hardware speeds.
I'm writing a DDE logging applet in visual c++ that logs several hundred events per minute and I need a faster way to keep time than calling GetSystemTime in winapi. Do you have any ideas?
(asking this because in testing under load, all exceptions were caused by a call to getsystemtime)
There is a mostly undocumented struct named USER_SHARED_DATA at a high usermode readable address that contains the time and other global things, but GetSystemTime just extracts the time from there and calls RtlTimeToTimeFields so there is not much you can gain by using the struct directly.
Possibly crazy thought: do you definately need an accurate timestamp? Suppose you only got the system time say every 10th call - how bad would that be?
As per your comments; calling GetSystemDateFormat and GetSystemTimeFormat each time will be a waste of time. These are not likely to change, so those values could easily be cached for improved performance. I would imagine (without having actually tested it) that these two calls are far more time-consuming than a call to GetSystemTime.
first of all, find out why your code is throwing exceptions (assuming you have described it correctly: i.e. a real exception has been thrown, and the app descends down into kernel mode - which is really slow btw.)
then you will most likely solve any performance bottleneck.
Chris J.
First, The fastest way I could think is using RDTSC instruction. RDTSC is an Intel time-stamp counter instruction with a nanosecond precision. Use it as a combination with CPUID instruction. To use those instructions properly, you can read "Using the RDTSC Instruction for Performance Monitoring", then you can convert the nanoseconds to seconds for your purpose.
Second, consider using QueryPerformanceFrequency() and QueryPerformanceCounter().
Third, not the fastest but much faster than the standard GetSystemTime(), that is, using timeGetTime().
I'm wondering which approach is faster and why ?
While writing a Win32 server I have read a lot about the Completion Ports and the Overlapped I/O, but I have not read anything to suggest which set of API's yields the best results in the server.
Should I use completion routines, or should I use the WaitForMultipleObjects API and why ?
You suggest two methods of doing overlapped I/O and ignore the third (or I'm misunderstanding your question).
When you issue an overlapped operation, a WSARecv() for example, you can specify an OVERLAPPED structure which contains an event and you can wait for that event to be signalled to indicate the overlapped I/O has completed. This, I assume, is your WaitForMultipleObjects() approach and, as previously mentioned, this doesn't scale well as you're limited to the number of handles that you can pass to WaitForMultipleObjects().
Alternatively you can pass a completion routine which is called when completion occurs. This is known as 'alertable I/O' and requires that the thread that issued the WSARecv() call is in an 'alertable' state for the completion routine to be called. Threads can put themselves in an alertable state in several ways (calling SleepEx() or the various EX versions of the Wait functions, etc). The Richter book that I have open in front of me says "I have worked with alertable I/O quite a bit, and I'll be the first to tell you that alertable I/O is horrible and should be avoided". Enough said IMHO.
There's a third way, before issuing the call you should associate the handle that you want to do overlapped I/O on with a completion port. You then create a pool of threads which service this completion port by calling GetQueuedCompletionStatus() and looping. You issue your WSARecv() with an OVERLAPPED structure WITHOUT an event in it and when the I/O completes the completion pops out of GetQueuedCompletionStatus() on one of your I/O pool threads and can be handled there.
As previously mentioned, Vista/Server 2008 have cleaned up how IOCPs work a little and removed the problem whereby you had to make sure that the thread that issued the overlapped request continued to run until the request completed. Link to a reference to that can be found here. But this problem is easy to work around anyway; you simply marshal the WSARecv over to one of your I/O pool threads using the same IOCP that you use for completions...
Anyway, IMHO using IOCPs is the best way to do overlapped I/O. Yes, getting your head around the overlapped/async nature of the calls can take a little time at the start but it's well worth it as the system scales very well and offers a simple "fire and forget" method of dealing with overlapped operations.
If you need some sample code to get you going then I have several articles on writing IO completion port systems and a heap of free code that provides a real-world framework for high performance servers; see here.
As an aside; IMHO, you really should read "Windows Via C/C++ (PRO-Developer)" by Jeffrey Richter and Christophe Nasarre as it deals will all you need to know about overlapped I/O and most other advanced windows platform techniques and APIs.
WaitForMultipleObjects is limited to 64 handles; in a highly concurrent application this could become a limitation.
Completion ports fit better with a model of having a pool of threads all of which are capable of handling any event, and you can queue your own (non-IO based) events into the port, whereas with waits you would need to code your own mechanism.
However completion ports, and the event based programming model, are a more difficult concept to really work against.
I would not expect any significant performance difference, but in the end you can only make your own measurements to reflect your usage. Note that Vista/Server2008 made a change with completion ports that the originating thread is not now needed to complete IO operations, this may make a bigger difference (see this article by Mark Russinovich).
Table 6-3 in the book Network Programming for Microsoft Windows, 2nd Edition compares the scalability of overlapped I/O via completion ports vs. other techniques. Completion ports blow all the other I/O models out of the water when it comes to throughput, while using far fewer threads.
The difference between WaitForMultipleObjects() and I/O completion ports is that IOCP scales to thousands of objects, whereas WFMO() does not and should not be used for anything more than 64 objects (even though you could).
You can't really compare them for performance, because in the domain of < 64 objects, they will be essentially identical.
WFMO() however does a round-robin on its objects, so busy objects with low index numbers can starve objects with high index numbers. (E.g. if object 0 is going off constantly, it will starve objects 1, 2, 3, etc). This is obviously undesireable.
I wrote an IOCP library (for sockets) to solve the C10K problem and put it in the public domain. I was able on a 512mb W2K machine to get 4,000 sockets concurrently transferring data. (You can get a lot more sockets, if they're idle - a busy socket consumes more non-paged pool and that's the ultimate limit on how many sockets you can have).
http://www.45mercystreet.com/computing/libiocp/index.html
The API should give you exactly what you need.
Not sure. But I use WaitForMultipleObjects and/or WaitFoSingleObjects. It's very convenient.
Either routine works and I don't really think one is any significant faster then another.
These two approaches exists to satisfy different programming models.
WaitForMultipleObjects is there to facilitate async completion pattern (like UNIX select() function) while completion ports is more towards event driven model.
I personally think WaitForMultipleObjects() approach result in cleaner code and more thread safe.