programatically determine amount of time remaining before preemption - windows

i am trying to implement some custom lock-free structures. its operates similar to a stack so it has a take() and a free() method and operates on pointer and underlying array. typically it uses optimistic conncurrency. free() writes a dummy value to pointer+1 increments the pointer and writes the real value to the new address. take() reads the value at pointer in a spin/sleep style until it doesnt read the dummy value and then decrements the pointer. in both operations changes to the pointer are done with compare and swap and if it fails, the whole operation starts again. the purpose of the dummy value is to insure consistency since the write operation can be preempted after the pointer is incremented.
this situation leads me to wonder weather it is possible to prevent preemtion in that critical place by somhow determining how much time is left before the thread will be preempted by the scheduler for another thread. im not worried about hardware interrupts. im trying to eliminate the possible sleep from my reading function so that i can rely on a pure spin.
is this at all possible?
are there other means to handle this situation?
EDIT: to clarify how this may be helpful, if the critical operation is interrupted, it will effectively be like taking out an exclusive lock, and all other threads will have to sleep before they could continue with their operations
EDIT: i am not hellbent on having it solved like this, i am merely trying to see if its possible. the probability of that operation being interrupted in that location for a very long time is extremely unlikely and if it does happen it will be OK if all the other operations need to sleep so that it can complete.
some regard this as premature optimization, but this is just my pet project. regardless - that does not exclude research and sience from attempting to improve techniques. even though computer sience has reasonably matured and every new technology we use today is just an implementation of what was already known 40 years ago, we should not stop to be creative to address even the smallest of concerns, like trying to make a reasonable set of operations atomic woithout too much performance implications.

Such information surely exists somewhere, but it is of no use for you.
Under "normal conditions", you can expect upwards of a dozen DPCs and upwards of 1,000 interrupts per second. These do not respect your time slices, they occur when they occur. Which means, on the average, you can expect 15-16 interrupts within a time slice.
Also, scheduling does not strictly go quantum by quantum. The scheduler under present Windows versions will normally let a thread run for 2 quantums, but may change its opinion in the middle if some external condition changes (for example, if an event object is signalled).
Insofar, even if you know that you still have so and so many nanoseconds left, whatever you think you know might not be true at all.

Cnnot be done without time-travel. You're stuffed.

Related

Does context switching usually happen between calling a function, and executing it?

So I have been working on the source code of a complex application (written by hundreds of programmers) for a while now. And among other things, I have created some time checking functions, along with suitable data structures to measure execution periods of different segments of the main loop and run some analysis on these measurements.
Here's a pseudocode that helps explaining:
main()
{
TimeSlicingSystem::AddTimeSlice(0);
FunctionA();
TimeSlicingSystem::AddTimeSlice(3);
FuncitonB();
TimeSlicingSystem::AddTimeSlice(6);
PrintTimeSlicingValues();
}
void FunctionA()
{
TimeSlicingSystem::AddTimeSlice(1);
//...
TimeSlicingSystem::AddTimeSlice(2);
}
FuncitonB()
{
TimeSlicingSystem::AddTimeSlice(4);
//...
TimeSlicingSystem::AddTimeSlice(5);
}
PrintTimeSlicingValues()
{
//Prints the different between each slice, and the slice before it,
//starting from slice number 1.
}
Most measurements were very reasonable, for instance assigning a value to a local variable will cost less than a fraction of a microsecond. Most functions will execute from start to finish in a few microseconds, and rarely ever reach one millisecond.
I then ran a few tests for half an hour or so, and I found some strange results that I couldn't quite understand. Certain functions will be called, and when measuring the time from the moment of calling the function (last line in 'calling' code) to the first line inside the 'called' function will take a very long time, up to a 30 milliseconds period. That's happening in a loop that would otherwise complete a full iteration in less than 8 milliseconds.
To get a picture of that, in the pseudocode I included, the time period between the slice number 0, and the slice number 1, or the time between the slice number 3, and the slice number 4 is measured. This the sort of periods I am referring to. It is the measured time between calling a function, and running the first line inside the called function.
QuestionA. Could this behavior be due to thread, or process switching by the OS? Does calling a function is a uniquely vulnerable spot to that? The OS I am working on is Windows 10.
Interestingly enough, there was never a last line in a function returning to the first line after the call in the 'calling' code problem at all ( periods from slice number 2 to 3 or from 5 to 6 in pseudocode)! And all measurements were always less than 5 microseconds.
QuestionB. Could this be, in any way, due to the time measurement method I am using? Could switching between different cores gives an allusion of slower than actually is context switching due to clock differences? (although I never found a single negative delta time so far, which seems to refute this hypothesis altogether). Again, the OS I am working on is Windows 10.
My time measuring function looks looks this:
FORCEINLINE double Seconds()
{
Windows::LARGE_INTEGER Cycles;
Windows::QueryPerformanceCounter(&Cycles);
// add big number to make bugs apparent where return value is being passed to float
return Cycles.QuadPart * GetSecondsPerCycle() + 16777216.0;
}
QuestionA. Could this behavior be due to thread, or process switching by the OS?
Yes. Thread switches can happen at any time (e.g. when a device sends an IRQ that causes a different higher priority thread to unblock and preempt your thread immediately) and this can/will cause unexpected time delays in your thread.
Does calling a function is a uniquely vulnerable spot to that?
There's nothing particularly special about calling your own functions that makes them uniquely vulnerable. If the function involves the kernel's API a thread switch can be more likely, and some things (e.g. calling "sleep()") are almost guaranteed to cause a thread switch.
Also there's potential interaction with virtual memory management - often things (e.g. your executable file, your code, your data) use "memory mapped files" where accessing it for the first time may cause OS to fetch the code or data from disk (and your thread can be blocked until the code or data it wanted arrived from disk); and rarely used code or data can also be sent to swap space and need to be fetched.
QuestionB. Could this be, in any way, due to the time measurement method I am using?
In practice it's likely that Windows' QueryPerformanceCounter() is implemented with an RDTSC instruction (assuming 80x86 CPU/s) and doesn't involve the kernel at all, and for modern hardware it's likely that this is monatomic. In theory Windows could emulate RDTSC and/or implement QueryPerformanceCounter() in another way to guard against security problems (timing side channels), as has been recommended by Intel for about 30 years now, but this is unlikely (modern operating systems, including but not limited to Windows, tend to care more about performance than security); and in theory your hardware/CPU could be so old (about 10+ years old) that Windows has to implement QueryPerformanceCounter() in a different way, or you could be using some other CPU (e.g. ARM and not 80x86).
In other words; it's unlikely (but not impossible) that the time measurement method you're using is causing any timing problems.

V8 isolates mapped memory leaks

V8 developer is needed.
I've noticed that the following code leaks mapped memory (mmap, munmap), concretely the amount of mapped regions within cat /proc/<pid>/maps continuously grows and hits the system limit pretty quickly (/proc/sys/vm/max_map_count).
void f() {
auto platform = v8::platform::CreateDefaultPlatform();
v8::Isolate::CreateParams create_params;
create_params.array_buffer_allocator =
v8::ArrayBuffer::Allocator::NewDefaultAllocator();
v8::V8::InitializePlatform(platform);
v8::V8::Initialize();
for (;;) {
std::shared_ptr<v8::Isolate> isolate(v8::Isolate::New(create_params), [](v8::Isolate* i){ i->Dispose(); });
}
v8::V8::Dispose();
v8::V8::ShutdownPlatform();
delete platform;
delete create_params.array_buffer_allocator;
}
I've played a little bit with platform-linux.cc file and have found that UncommitRegion call just remaps region with PROT_NONE, but not release it. Probably thats somehow related to that problem..
There are several reasons why we recreate isolates during the program execution.
The first one is that creating new isolate along with discarding the old one is more predictable in terms of GC. Basically, I found that doing
auto remoteOldIsolate = std::async(
std::launch::async,
[](decltype(this->_isolate) isolateToRemove) { isolateToRemove->Dispose(); },
this->_isolate
);
this->_isolate = v8::Isolate::New(cce::Isolate::_createParams);
//
is more predictable and faster than call to LowMemoryNotification. So we monitor memory consumptions using GetHeapStatistics and recreate isolate when it hits the limit. Turns out we cannot consider GC activity as a part of code execution, this leads to bad user experience.
The second reason is that having isolate per code allows as to run several codes in parallel, otherwise v8::Locker will block second code for that particular isolate.
Looks like at this stage I have no choices and will rewrite application to have a pool of isolates and persistent context per code..of course this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all, but at least it will not leak.
PS. I've mentioned that we use GetHeapStatistics for memory monitoring. I want to clarify a little bit that part.
In our case its a big problem when GC works during code execution. Each code has execution timeout (100-500ms). Having GC activity during code execution locks code and sometimes we have timeouts just for assignment operation. GC callbacks don't give you enough accuracy, so we cannot rely on them.
What we actually do, we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
PPS. I also mentioned that sharing isolate between codes may affect users.
Say you have user#1 and user#2. Each of them have their own code, both are unrelated. code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
V8 developer is needed.
Please file a bug at crbug.com/v8/new. Note that this issue will probably be considered low priority; we generally assume that the number of Isolates per process remains reasonably small (i.e., not thousands or millions).
have a pool of isolates
Yes, that's probably the way to go. In particular, as you already wrote, you will need one Isolate per thread if you want to execute scripts in parallel.
this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all
No, that can't happen. Only allocations trigger GC activity. Allocation-free code will spend zero time doing GC. Also (as we discussed before in your earlier question), GC activity is split into many tiny (typically sub-millisecond) steps (which in turn are triggered by allocations), so in particular a short-running bit of code will not encounter some huge GC pause.
sometimes we have timeouts just for assignment operation
That sounds surprising, and doesn't sound GC-related; I would bet that something else is going on, but I don't have a guess as to what that might be. Do you have a repro?
we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
Have you tried not doing any of that? V8's GC is very finely tuned by default, and I would assume that side-stepping it in this way is causing more problems than it solves. Of course you can experiment with whatever you like; but if the resulting behavior isn't what you were hoping for, then my first suggestion is to just let V8 do its thing, and only interfere if you find that the default behavior is somehow unsatisfactory.
code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
Again: no. Code that doesn't allocate will not be interrupted by GC. And several functions in the same Isolate can never run in parallel; only one thread may be active in one Isolate at the same time.

Pre-emption can occur if the code exceeds the time slice intended for it, then how do we ensure code length/execution time in the spinlock?

--> Re-editing my question. I thought to picture my understanding. Here is the picture. Please correct me here. By task, I mean process only. A picture is worth a thousand words.
What will happen in the multi-processor, if the third process wants to acquire the lock.
Since, it is two processor. The third processor may try to acquire the lock on CPU A since CPU B is busy polling. This can lead to a problem. Is my understanding correct? So, while using spin lock - one should ensure that the number of process contending for the critical region should not be greater than the number of CPU's available in the system? Also, if my system is uniprocessor, I shouldn't use spin lock at all. As it is compiled off? Is my understanding correct? We do not use sleep inside spinlocks is that because we don't want the code to do pre-empted during sleep when actually inside the spinlock () - which disables pre-emption. But pre-emption can occur if the code execution time exceeds the time slice intended for it. Thus, my question is basic - should we use a short length of code inside critical region? Because a long execution code can cause also pre-emption as the sleep would cause?
The confusion is because you looking everything from "process context" only and totally forget Intr context, premption
http://www.makelinux.net/ldd3/chp-5-sect

Is there ever a situation where an infinite loop may be desired?

Infinite loops are taught as evil. Is there ever a good use?
When coding them by accident, the CPU peaks and I imagine memory does too, especially if assigning variables inside the loop.
If there is a good use, how are those issues prevented?
Basicly every operating system or server spins in an infinte loop.
To avoid these memory issues normally you wouldn't allocate memory inside the loop unless it can be freed later inside the same loop. For example you would allocate memory for a request and delete it once it was served.
To avoid cpu peaks you would wait for interrupts in case of an os or call a blocking function like poll() which waits for a new event once per iteration.
First of all, the word "infinite" in this phrase should be taken a bit more loosely. I am presuming you are talking about a while (true) loop with a break instruction, which will eventually end, as opposed to a loop which will run until the end of time and all humanity.
In the former sense, yes, there are use cases where it's appropriate:
Games use infinite game loops.
Embedded programs use infinite main loops.
Windows applications use infinite message loops.
One example where they might be used inappropriately is when they are used to create time delays by spinning the CPU, which is what novice programmers tend to do to avoid dealing with timer interrupts (or timer events, or other non-procedural constructs). However, when spinning the CPU is done to acquire a shared resource, then the "infinite loop" is also a perfectly valid implementation choice. Even the .NET CLR Monitor, for example, tries spinning for several hundred cycles before issuing a true wait on a kernel event handle and creating a more expensive thread switch.
In addition to programs that run on event loops (like the the system processes that #Christoph mentions), some languages have a concept known as a generator, that allow and even encourage you to write an infinite loop. The trick is that the object only runs for a finite time when it "yields" (returns) some expression. After that its state is "frozen" until it is needed again. For example, in Python you can have an object that alternates between LEFT and RIGHT:
def side():
while True:
yield "LEFT"
yield "RIGHT"
a = side()
print a.next()
print a.next()
print a.next()
Which would give LEFT RIGHT LEFT. The side function looks like an infinite loop with the statement While True:, but it will only ever run for a finite amount of time per call.
All the applications on your handset run in infinite event loops.

Haskell: Concurrent data structure guidelines

I've been trying to get a understanding of concurrency, and I've been trying to work out what's better, one big IORef lock or many TVars. I've came to the following guidelines, comments will be appreciated, regarding whether these are roughly right or whether I've missed the point.
Lets assume our concurrent data structure is a map m, accessed like m[i]. Lets also say we have two functions, f_easy and f_hard. The f_easy is quick, f_hard takes a long time. We'll assume the arguments to f_easy/f_hard are elements of m.
(1) If your transactions look roughly like this m[f_easy(...)] = f_hard(...), use an IORef with atomicModifyIORef. Laziness will ensure that m is only locked for a short time as it's updated with a thunk. Calculating the index effectively locks the structure (as something is going to get updated, but we don't know what yet), but once it's known what that element is, the thunk over the entire structure moves to a thunk only over that particular element, and then only that particular element is "locked".
(2) If your transactions look roughly like this m[f_hard(...)] = f_easy(...), and the don't conflict too much, use lots of TVars. Using an IORef in this case will effectively make the app single threaded, as you can't calculate two indexes at the same time (as there will be an unresolved thunk over the entire structure). TVars let you work out two indexes at the same time, however, the negative is that if two concurrent transactions both access the same element, and one of them is a write, one transaction must be scrapped, which wastes time (which could have been used elsewhere). If this happens a lot, you may be better with locks that come (via blackholing) from IORef, but if it doesn't happen very much, you'll get better parallelism with TVars.
Basically in case (2), with IORef you may get 100% efficiency (no wasted work) but only use 1.1 threads, but with TVar if you have a low number of conflicts you might get 80% efficiency but use 10 threads, so you still end up 7 times faster even with the wasted work.
Your guidelines are somewhat similar to the findings of [1] (Section 6) where the performance of the Haskell STM is analyzed:
"In particular, for programs that do not perform much work inside transactions, the commit overhead appears to be very high. To further observe this overhead, an analysis needs to be conducted on the performance of commit-time course-grain and fine-grain STM locking mechanisms."
I use atomicModifyIORef or an MVar when all the synchronization I need is something that simple locking will ensure. When looking at concurrent accesses to a data structure, it also depends on how this data structure is implemented. For example, if you store your data inside a IORef Data.Map and frequently perform read/write access then I think atmoicModifyIORef will degrade to a single thread performance, as you have conjectured, but the same will be true for a TVar Data.Map. My point is that it's important to use a data structure that is suitable for concurrent programming (balanced trees aren't).
That said, in my opinion the winning argument for using STM is composability: you can combine multiple operations into a single transactions without headaches. In general, this isn't possible using IORef or MVar without introducing new locks.
[1] The limits of software transactional memory (STM): dissecting Haskell STM applications on a many-core environment.
http://dx.doi.org/10.1145/1366230.1366241
Answer to #Clinton's comment:
If a single IORef contains all your data, you can simply use atomicModifyIORef for composition. But if you need to process lots of parallel read/write requests to that data, the performance loss might become significant, since every pair of parallel read/write requests to that data might cause a conflict.
The approach that I would try is to use a data structure where the entries themselves are stored inside a TVar (vs putting the whole data structure into a single TVar). That should reduce the possibility of livelocks, as transactions won't conflict that often.
Of course, you still want to keep your transactions as small as possible and use composability only if it's absolutely necessary to guarantee consistency. So far I haven't encountered a scenario where combining more than a few insert/lookup operations into a single transaction was necessary.
Beyond performance, I see a more fundamental reason to using TVar--the type system ensures you dont do any "unsafe" operations like readIORef or writeIORef. That your data is shared is a property of the type, not of the implementation. EDIT: unsafePerformIO is always unsafe. readIORef is only unsafe if you are also using atomicModifyIORef. At the very least wrap your IORef in a newtype and only expose a wrapped atomicModifyIORef
Beyond that, don't use IORef, use MVar or TVar
The first usage pattern you describe probably does not have nice performance characteristics. You likely end up being (almost) entirely single threaded--because of laziness no actual work happens each time you update the shared state, but whenever you need to use this shared state, the entire accumulated pile of thunks needs to be forced, and has a linear data dependency structure.
Having 80% efficiency but substantially higher parallelism allows you to exploit growing number of cores. You can expect minimal performance improvements over the coming years on single threaded code.
Many word CAS is likely coming to a processor near you in the form of "Hardware Transactional Memory" allowing STMs to become far more efficient.
Your code will be more modular--every piece of code has to be changed if you add more shared state when your design has all shared state behind a single reference. TVars and to a lesser extent MVars support natural modularity.

Resources