How many contiuous cycles per process? - multiprocessing

I've just read, in multiprocessing, when a waiting process comes into context, the entire cache becomes invalid and we see a lot of cache misses. I'm wondering how long a process runs continuously before it goes to wait state... Is it long enough such that the newly updated cache can be used meaningfully? But then, other processes will be waiting too long ? Appreciate any help. Thanks!

The simple answer is that "yes, it is usually long enough". If it wasn't long enough, they would not put a cache on the chip. The time is controlled by the operating system, you can see a post on it here: Linux Scheduler Time Slice.
However, there are cases where it might not be long enough. This would typically be where you have a lot of I/O and not much computation in between. A good example would be a program like telnet that can read a character, output a character and then go to sleep. However, I don't think that it is correct to say that the entire cache is invalidated. This would only happen if the new process used enough memory to overwrite all of the cache entries from the previous process.
To avoid issues like this you should use buffered file I/O or database access, caches in the application and avoid character by character processing.

Related

V8 isolates mapped memory leaks

V8 developer is needed.
I've noticed that the following code leaks mapped memory (mmap, munmap), concretely the amount of mapped regions within cat /proc/<pid>/maps continuously grows and hits the system limit pretty quickly (/proc/sys/vm/max_map_count).
void f() {
auto platform = v8::platform::CreateDefaultPlatform();
v8::Isolate::CreateParams create_params;
create_params.array_buffer_allocator =
v8::ArrayBuffer::Allocator::NewDefaultAllocator();
v8::V8::InitializePlatform(platform);
v8::V8::Initialize();
for (;;) {
std::shared_ptr<v8::Isolate> isolate(v8::Isolate::New(create_params), [](v8::Isolate* i){ i->Dispose(); });
}
v8::V8::Dispose();
v8::V8::ShutdownPlatform();
delete platform;
delete create_params.array_buffer_allocator;
}
I've played a little bit with platform-linux.cc file and have found that UncommitRegion call just remaps region with PROT_NONE, but not release it. Probably thats somehow related to that problem..
There are several reasons why we recreate isolates during the program execution.
The first one is that creating new isolate along with discarding the old one is more predictable in terms of GC. Basically, I found that doing
auto remoteOldIsolate = std::async(
std::launch::async,
[](decltype(this->_isolate) isolateToRemove) { isolateToRemove->Dispose(); },
this->_isolate
);
this->_isolate = v8::Isolate::New(cce::Isolate::_createParams);
//
is more predictable and faster than call to LowMemoryNotification. So we monitor memory consumptions using GetHeapStatistics and recreate isolate when it hits the limit. Turns out we cannot consider GC activity as a part of code execution, this leads to bad user experience.
The second reason is that having isolate per code allows as to run several codes in parallel, otherwise v8::Locker will block second code for that particular isolate.
Looks like at this stage I have no choices and will rewrite application to have a pool of isolates and persistent context per code..of course this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all, but at least it will not leak.
PS. I've mentioned that we use GetHeapStatistics for memory monitoring. I want to clarify a little bit that part.
In our case its a big problem when GC works during code execution. Each code has execution timeout (100-500ms). Having GC activity during code execution locks code and sometimes we have timeouts just for assignment operation. GC callbacks don't give you enough accuracy, so we cannot rely on them.
What we actually do, we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
PPS. I also mentioned that sharing isolate between codes may affect users.
Say you have user#1 and user#2. Each of them have their own code, both are unrelated. code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
V8 developer is needed.
Please file a bug at crbug.com/v8/new. Note that this issue will probably be considered low priority; we generally assume that the number of Isolates per process remains reasonably small (i.e., not thousands or millions).
have a pool of isolates
Yes, that's probably the way to go. In particular, as you already wrote, you will need one Isolate per thread if you want to execute scripts in parallel.
this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all
No, that can't happen. Only allocations trigger GC activity. Allocation-free code will spend zero time doing GC. Also (as we discussed before in your earlier question), GC activity is split into many tiny (typically sub-millisecond) steps (which in turn are triggered by allocations), so in particular a short-running bit of code will not encounter some huge GC pause.
sometimes we have timeouts just for assignment operation
That sounds surprising, and doesn't sound GC-related; I would bet that something else is going on, but I don't have a guess as to what that might be. Do you have a repro?
we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
Have you tried not doing any of that? V8's GC is very finely tuned by default, and I would assume that side-stepping it in this way is causing more problems than it solves. Of course you can experiment with whatever you like; but if the resulting behavior isn't what you were hoping for, then my first suggestion is to just let V8 do its thing, and only interfere if you find that the default behavior is somehow unsatisfactory.
code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
Again: no. Code that doesn't allocate will not be interrupted by GC. And several functions in the same Isolate can never run in parallel; only one thread may be active in one Isolate at the same time.

Pre-emption can occur if the code exceeds the time slice intended for it, then how do we ensure code length/execution time in the spinlock?

--> Re-editing my question. I thought to picture my understanding. Here is the picture. Please correct me here. By task, I mean process only. A picture is worth a thousand words.
What will happen in the multi-processor, if the third process wants to acquire the lock.
Since, it is two processor. The third processor may try to acquire the lock on CPU A since CPU B is busy polling. This can lead to a problem. Is my understanding correct? So, while using spin lock - one should ensure that the number of process contending for the critical region should not be greater than the number of CPU's available in the system? Also, if my system is uniprocessor, I shouldn't use spin lock at all. As it is compiled off? Is my understanding correct? We do not use sleep inside spinlocks is that because we don't want the code to do pre-empted during sleep when actually inside the spinlock () - which disables pre-emption. But pre-emption can occur if the code execution time exceeds the time slice intended for it. Thus, my question is basic - should we use a short length of code inside critical region? Because a long execution code can cause also pre-emption as the sleep would cause?
The confusion is because you looking everything from "process context" only and totally forget Intr context, premption
http://www.makelinux.net/ldd3/chp-5-sect

Overlapped IO or file mapping?

In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.
You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.
I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.

Preventing a heavy process from sinking in the swap file

Our service tends to fall asleep during the nights on our client's server, and then have a hard time waking up. What seems to happen is that the process heap, which is sometimes several hundreds of MB, is moved to the swap file. This happens at night, when our service is not used, and others are scheduled to run (DB backups, AV scans etc). When this happens, after a few hours of inactivity the first call to the service takes up to a few minutes (consequent calls take seconds).
I'm quite certain it's an issue of virtual memory management, and I really hate the idea of forcing the OS to keep our service in the physical memory. I know doing that will hurt other processes on the server, and decrease the overall server throughput. Having that said, our clients just want our app to be responsive. They don't care if nightly jobs take longer.
I vaguely remember there's a way to force Windows to keep pages on the physical memory, but I really hate that idea. I'm leaning more towards some internal or external watchdog that will initiate higher-level functionalities (there is already some internal scheduler that does very little, and makes no difference). If there were a 3rd party tool that provided that kind of service is would have been just as good.
I'd love to hear any comments, recommendations and common solutions to this kind of problem. The service is written in VC2005 and runs on Windows servers.
As you mentioned, forcing the app to stay in memory isn't the best way to share resources on the machine. A quick solution that you might find that works well is to simply schedule an event that wakes your service up at a specific time each morning before your clients start to use it. You can just schedule it in the windows task scheduler with a simple script or EXE call.
I'm not saying you want to do this, or that it is best practice, but you may find it works well enough for you. It seems to match what you've asked for.
Summary: Touch every page in the process, on page at a time, on a regular basis.
What about a thread that runs in the background and wakes up once every N seconds. Each time the page wakes up, it attempts to read from address X. The attempt is protected with an exception handler in case you read a bad address. Then increment X by the size of a page.
There are 65536 pages in 4GB, 49152 pages in 3GB, 32768 pages in 2GB. Divide your idle time (overnight dead time) by how often you want (attempt) to hit each page.
BYTE *ptr;
ptr = NULL;
while(TRUE)
{
__try
{
BYTE b;
b = *ptr;
}
__except(EXCEPTION_EXECUTE_HANDLER)
{
// ignore, some pages won't be accessible
}
ptr += sizeofVMPage;
Sleep(N * 1000);
}
You can get the sizeOfVMPage value from the dwPageSize value in the returned result from GetSystemInfo().
Don't try to avoid the exception handler by using if (!IsBadReadPtr(ptr)) because other threads in the app may be modifying memory protections at the same time. If you get unstuck because of this it will almost impossible to identify why (it will most likely be a non-repeatable race condition), so don't waste time with it.
Of course, you'd want to turn this thread off during the day and only run it during your dead-time.
A third approach could be to have your service run a thread that does something trivial like incrementing a counter and then sleeps for a fairly long period, say 10 seconds. Thios should have minimal effect on other applications but keep at least some of your pages available.
The other thing to ensure is that your data is localized.
In other words: do you really need all 300 MiB of the memory before you can do anything? Can the data structures you use be rearranged so that any particular request could be satisfied with only a few megabytes?
For example
if your 300 MiB of heap memory contains facial recognition data. Can the data internally be arranged so that male and female face data are stored together? Or big-noes are separate from small-noses?
if it has some sort of logical structure to it can be it sorted? so that a binary search can be used to skip over a lot of pages?
if it's a propritary, in-memory, database engine, can the data be better indexed/clustered to not require so many memory page hits?
if they're image textures, can commonly used textures be located near each other?
Do you really need all 300 MiB of the memory before you can do anything? You cannot service request without all that data back in memory?
Otherwise: scheduled task at 6 ᴀᴍ to wake it up.
In terms of cost, the cheapest and easiest solution is probably just to buy more RAM for that server, and then you can disable the page file entirely. If you're running 32-bit Windows, just buy 4GB of RAM. Then the entire address space will be backed with physical memory, and the page file won't be doing anything anyway.

A question about cache of filesystem

When I read a large file in the file system, can the cache improve the speed of
the operation?
I think there are two different answers:
1.Yes. Because cache can prefetch thus the performance gets improved.
2.No. Because the speed to read from cache is more faster than the speed to read from
disk, at the end we can find that the cache doesn't help,so the reading speed is also
the speed to read from disk.
Which one is correct? How can I testify the answer?
[edit]
And another questions is :
What I am not sure is
that, when you turn on the cache the bandwidth is used to
1.prefetch
2.prefetch and read
which one is correct?
While if you
turn off the cache ,the bandwith of disk is just used to read.
If I turn off the cache and randomly access the disk, is the time needed comparable with the time when read sequentially with the cache turned on?
1 is definitely correct. The operating system can fetch from the disk to the cache while your code is processing the data it's already received. Yes, the disk may well still be the bottleneck - but you won't have read, process, read, process, read, process, but read+process, read+process, read+process. For example, suppose we have processing which takes half the time of reading. Representing time going down the page, we might have this sort of activity without prefetching:
Read
Read
Process
Read
Read
Process
Read
Read
Process
Whereas with prefetching, this is optimised to:
Read
Read
Read Process
Read
Read Process
Read
Process
Basically the total time will be "time to read whole file + time to process last piece of data" instead of "time to read whole file + time to process whole file".
Testing it is tricky - you'll need to have an operating system where you can tweak or turn off the cache. Another alternative is to change how you're opening the file - for instance, in .NET if you open the file with FileOptions.SequentialScan the cache is more likely to do the right thing. Try with and without that option.
This has spoken mostly about prefetching - general caching (keeping the data even after it's been delivered to the application) is a different matter, and obviously acts as a big win if you want to use the same data more than once. There's also "something in between" where the application has only requested a small amount of data, but the disk has read a whole block - the OS isn't actively prefetching blocks which haven't been requested, but can cache the whole block so that if the app then requests more data from the same block it can return that data from the cache.
First answer is correct.
The disk has a fixed underlying performance - but that fixed underlying performance differs in different circumstances. You obtain better real performance from a drive when you read long sections of data - e.g. when you cache ahead. So caching permits the drive to achieve genuine improvement its real performance.
In the general case, it will be faster with the cache. Some points to consider:
The data on the disk is organized in surfaces (aka heads), tracks and blocks. It takes the disk some time to position the reading heads so that you can start reading a track. Now you need five blocks from that track. Unfortunately, you ask for then in a different order than they appear on the physical media. The cache will help greatly by reading the whole track into memory (lots more blocks than you need), then reindex them (when the head starts to read, it probably will be anywhere on the track, not on the start of the first block). Without this, you'd have to wait until the first block of the track rotates under the head and start reading -> the time to read a track would be effectively doubled. So with a cache, you can read the blocks of a track in any order and you start reading as soon as the head arrives over the track.
If the file system is pretty full, the OS will start to squeeze your data into various empty spaces. Imagine block 1 is on track 5, block 2 is on track 7, block 3 is again on track 5. Without a cache, you'd loose a lot of time for positioning the head. With a cache, track 5 is read, kept in RAM as the head goes to track 7 and when you ask for block 3, you get it immediately.
Large files need a lot of meta-data, namely where the data blocks for the file are. In this case, the cache will keep this data live as you read the file, saving you from a lot more head trashing.
The cache will allow other programs to access their data in an efficient way as you hog the disk. So overall performance will be better. This is very important when a second program starts to write as you read. In this case, the cache will collect some writes before it interrupts your reads. Also, most programs read data, process it and then write it back. Without the cache, a program would either get into its own way or it would have to implement its own caching scheme to avoid head trashing.
A cache allows the OS to reorder the disk I/O. Say you have blocks on track 5, 7 and 13 but file order asks for track 5, 13 and then 7. Obviously, it's more efficient to read track 7 on the way to 13 rather than going all the way to 13 and then come back to 7.
So while theoretically, reading lots of data would be faster without a cache, this is only true if your file is the only one on the disk and all meta-data is ordered perfectly, the physical layout of the data is in such a way that the reading heads always start reading a track at the start of the first block, etc.
Jon Skeet has a very interesting benchmark with .NET about this topic. The basic result was that prefetching helps, the more processing per unit read you have to do.
If the files are larger than your memory, then it definitely has no way of helping.
Another point: Chances are, frequently used files will be in the cache before one is even starting to read one of them.

Resources