TL;DR Does it make sense to write multiple dumps for the same crash event, and if yes, what do you need to look out for.
We're using MiniDumpWriteDump to write a crash dump when there is a unhandled-exception / abort / younameit in our application.
The code so far actually writes two dumps:
One with MiniDumpWithDataSegs to get a small one that can be sent by even crappy email w/o problem once zipped.
A full one MiniDumpWithFullMemory to have the full info available should we need it.
To make this work, we call MiniDUmpWriteDump twice:
1 Open/create file for small dump
2 Write small dump
3 Open/create file for large dump
4 Write large dump
As far as I can tell, one additinoal idea of this scheme was that writing the small dump is faster. It's always subsecond basically, while writing the large dump often can take quite a few seconds, especially when the application is fully loaded and the large dump will easily be 1.2 GB or more.
The idea behind writing the small dump first, as far as I can tell, was that because it's faster, it would take a more detailed snapshot of the crashed process at the point in time it crashed, as the process is heavily multithreaded.
Obviously, the threads of the process continue to run between the end of the first call and the start of the second call to MDWP, so we do have quite some cases where the info in the small dump is actually more accurate than the info in the large dump.
After thinking about this, I would assume however, that to write the dump, MiniDumpWriteDump has to suspend the threads of the process anyway, so if we were to write the large dump first, we would have the large dump more accurate than the small one.
Question
Should we write the large dump before the small one? Should we even be writing two dumps? Could we somehow have the system first suspend the threads of the process and then write two dumps that are completely "synchronous"?
I ever analyzed dumps from various customers for a couple years, the followings is only my personal perspective to your question, hope this helps.
Should we write the large dump before the small one?
i don't consider the order is important for crash, hang etc typical issues. the crash spot is there, the deadlock is there in dump, first captured or after.
Should we even be writing two dumps?
i would suggest write at least 1 full dump, the small dump is very convenient for you to get an initial impression of what's the problem, but it's very limited esp. when your application crash. so you may suggest customer to email you the small dump to do first round triage, if this can not help you find the root cause, then ask the full dump. technically you can strip a small dump out from a full dump, however, you may not want your customer to do this sort of work for you. so this depends on how you interact with your customer.
Could we somehow have the system first suspend the threads of the process and then write two dumps that are completely "synchronous"?
technically this is doable. e.g. it's relatively easy to do out-proc, a simple NtSuspendProcess() suspend all target threads, but it have to be called from another process. if you prefer to do in-proc, you have to enumerate all threads and call SuspendThread(), this is how MiniDumpWriteDump() works. However, i think sync/asyn does not affect the accuracy of the dump.
Writing two dumps at the same time is not advisable because MiniDumpWriteDump is not thread safe.
All DbgHelp functions, such as this one, are single threaded. Therefore, calls from more than one thread to this function will likely result in unexpected behavior or memory corruption.
Whether you should write a large dump with a small dump depends on your application and the kind of bugs you might expect. The minidump only contains stack information, it does not contain heap memory, handle information or recently unloaded modules.
Obtaining stack information will obviously give you a stack trace, but if the stack trace only tells you that your last action was to reference some memory on the heap the trace isn't much use. Your expected failure modes will dictate which makes more sense. If you have legacy code that maybe isn't using RAII to manage handles, or the handling of heap allocated memory isn't as managed as you'd like then a full dump will be useful.
You should also consider the person who will submit the memory dump. If your customers are on the Internet, they might not appreciate submitting a sizable memory dump. They might also be worried about the private data that may also be submitted along with a full memory dump. A minimdump is much smaller and easier to submit and less likely (though not impossible) to contain private data. If you're customers are running on an internal network, then a full memory dump is more acceptable.
It is better to write a minidump first and then a large dump. This way you are more likely to get some data out quickly, rather than waiting for a full dump. A full dump can take a while users are often impatient. They may decide to kill the process so they can get back to work. Also, if the disk is getting full (potentially the cause of the crash) it's slightly more likely that you have room for a minidump than a full dump.
DbgHelp.dll imports SuspendThread and ResumeThread. You can do the same thing. Call SuspendThread for all threads (minus the current of course), call MiniDumpWriteDump as many times as you need, then call ResumeThread on each thread you suspended. This should give you consistently accurate dumps.
Related
V8 developer is needed.
I've noticed that the following code leaks mapped memory (mmap, munmap), concretely the amount of mapped regions within cat /proc/<pid>/maps continuously grows and hits the system limit pretty quickly (/proc/sys/vm/max_map_count).
void f() {
auto platform = v8::platform::CreateDefaultPlatform();
v8::Isolate::CreateParams create_params;
create_params.array_buffer_allocator =
v8::ArrayBuffer::Allocator::NewDefaultAllocator();
v8::V8::InitializePlatform(platform);
v8::V8::Initialize();
for (;;) {
std::shared_ptr<v8::Isolate> isolate(v8::Isolate::New(create_params), [](v8::Isolate* i){ i->Dispose(); });
}
v8::V8::Dispose();
v8::V8::ShutdownPlatform();
delete platform;
delete create_params.array_buffer_allocator;
}
I've played a little bit with platform-linux.cc file and have found that UncommitRegion call just remaps region with PROT_NONE, but not release it. Probably thats somehow related to that problem..
There are several reasons why we recreate isolates during the program execution.
The first one is that creating new isolate along with discarding the old one is more predictable in terms of GC. Basically, I found that doing
auto remoteOldIsolate = std::async(
std::launch::async,
[](decltype(this->_isolate) isolateToRemove) { isolateToRemove->Dispose(); },
this->_isolate
);
this->_isolate = v8::Isolate::New(cce::Isolate::_createParams);
//
is more predictable and faster than call to LowMemoryNotification. So we monitor memory consumptions using GetHeapStatistics and recreate isolate when it hits the limit. Turns out we cannot consider GC activity as a part of code execution, this leads to bad user experience.
The second reason is that having isolate per code allows as to run several codes in parallel, otherwise v8::Locker will block second code for that particular isolate.
Looks like at this stage I have no choices and will rewrite application to have a pool of isolates and persistent context per code..of course this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all, but at least it will not leak.
PS. I've mentioned that we use GetHeapStatistics for memory monitoring. I want to clarify a little bit that part.
In our case its a big problem when GC works during code execution. Each code has execution timeout (100-500ms). Having GC activity during code execution locks code and sometimes we have timeouts just for assignment operation. GC callbacks don't give you enough accuracy, so we cannot rely on them.
What we actually do, we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
PPS. I also mentioned that sharing isolate between codes may affect users.
Say you have user#1 and user#2. Each of them have their own code, both are unrelated. code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
V8 developer is needed.
Please file a bug at crbug.com/v8/new. Note that this issue will probably be considered low priority; we generally assume that the number of Isolates per process remains reasonably small (i.e., not thousands or millions).
have a pool of isolates
Yes, that's probably the way to go. In particular, as you already wrote, you will need one Isolate per thread if you want to execute scripts in parallel.
this way code#1 may affect code#2 by doing many allocations and GC will run on code2 with no allocations at all
No, that can't happen. Only allocations trigger GC activity. Allocation-free code will spend zero time doing GC. Also (as we discussed before in your earlier question), GC activity is split into many tiny (typically sub-millisecond) steps (which in turn are triggered by allocations), so in particular a short-running bit of code will not encounter some huge GC pause.
sometimes we have timeouts just for assignment operation
That sounds surprising, and doesn't sound GC-related; I would bet that something else is going on, but I don't have a guess as to what that might be. Do you have a repro?
we specify --max-old-space-size=32000 (32GB). That way GC don't want to run, cuz it should see that a lot of memory exists. And using GetHeapStatistics (along with isolate recreation I've mentioned above) we have manual memory monitoring.
Have you tried not doing any of that? V8's GC is very finely tuned by default, and I would assume that side-stepping it in this way is causing more problems than it solves. Of course you can experiment with whatever you like; but if the resulting behavior isn't what you were hoping for, then my first suggestion is to just let V8 do its thing, and only interfere if you find that the default behavior is somehow unsatisfactory.
code#1 has a loop with tremendous memory allocation, code#2 is just an assignment operation. Chances are GC will run during code#2 and user#2 will receive timeout.
Again: no. Code that doesn't allocate will not be interrupted by GC. And several functions in the same Isolate can never run in parallel; only one thread may be active in one Isolate at the same time.
I have 3 ocaml modules, the last one of the tree does the actual computation and uses functions defined in the other 2.
The basic idea of the program is to have a port as initial state, which contains boats, and moving those boats until we reach either a winning situation or no more moves are possible.
The actual code is not really a problem here. It is probably not very efficient, but finds a solution... except when it does not.
While the program takes a few seconds for complex situations, in some cases, when I add a bit more complexity, it takes forever to find a solutions.
When I execute the program like this:
$ ocamlc -o test.exe port.ml moves.ml solver.ml
$ ./test.exe > file
The resulting file is really huge, but its size stops increasing after some time..
It seems to me that after a while the program stops running but without terminating, no Stackoverflow or out of memory error is thrown. The program simply does not continue the execution. In other words, the command
$ ./test.exe > file
is still executed, but no more new lines are added to the file. If I log to the shell itself and not a file, I get the same result: no new lines keep getting added after some time.
What could this be?
The main function (that is responsible for finding a solution) uses a Depth-First-Search algorithm, and contains a lot of List operations, such as List.fold, List.map, List.iter, List.partition, List.filter. I was thinking that maybe these functions have problems dealing with huge lists of complex types at some moment, but again, no error is thrown, the execution just stops.
I explained this very vaguely, but I really don't understand the problem here. I don't know whether the problem has to do with my shell (Ubuntu subsystem on windows) running out of memory, or ocaml List functions being limited at some point... If you have any suggestions, feel free to comment
To debug such cases you should use diagnostic utilities provided by your operating system and the OCaml infrastructure itself.
First of all, you shall look into the state of your process. You can use top or htop utilities if you're running a Unix machine. Otherwise, you can use the task manager.
If the process is running out of the physical memory, it could be swapped by the operating system. In that case, all memory operations will turn into hard drive reads and writes. Therefore, garbage collecting a heap stored in a hard drive will take some time. If this was the case, then you can use a memory profiler to identify the crux of the problem.
If the process is constantly running without a change in the memory footprint, then it looks like that you either hit a bug in your code, i.e., an infinite loop, or that some of your algorithms have exponential complexity, as Konstantin has mentioned in the comment. Use a debugging output or tracing to identify the location where the program stalled.
Finally, if your program is in the sleeping state, then it could be a deadlock. For example, if you're reading and writing to the same file, this can end up in a race condition. In general, if your program is multithreaded or operates multiple processes, there are lots of possibilities to induce a race condition.
In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.
You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.
I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.
I understand that delete returns memory to the heap that was allocated of the heap, but what is the point? Computers have plenty of memory don't they? And all of the memory is returned as soon as you "X" out of the program.
Example:
Consider a server that allocates an object Packet for each packet it receives (this is bad design for the sake of the example).
A server, by nature, is intended to never shut down. If you never delete the thousands of Packet your server handles per second, your system is going to swamp and crash in a few minutes.
Another example:
Consider a video game that allocates particles for the special effect, everytime a new explosion is created (and never deletes them). In a game like Starcraft (or other recent ones), after a few minutes of hilarity and destruction (and hundres of thousands of particles), lag will be so huge that your game will turn into a PowerPoint slideshow, effectively making your player unhappy.
Not all programs exit quickly.
Some applications may run for hours, days or longer. Daemons may be designed to run without cease. Programs can easily consume more memory over their lifetime than available on the machine.
In addition, not all programs run in isolation. Most need to share resources with other applications.
There are a lot of reasons why you should manage your memory usage, as well as any other computer resources you use:
What might start off as a lightweight program could soon become more complex, depending on your design areas of memory consumption may grow exponentially.
Remember you are sharing memory resources with other programs. Being a good neighbour allows other processes to use the memory you free up, and helps to keep the entire system stable.
You don't know how long your program might run for. Some people hibernate their session (or never shut their computer down) and might keep your program running for years.
There are many other reasons, I suggest researching on memory allocation for more details on the do's and don'ts.
I see your point, what computers have lots of memory but you are wrong. As an engineer you have to create programs, what uses computer resources properly.
Imagine, you made program which runs all the time then computer is on. It sometimes creates some objects/variables with "new". After some time you don't need them anymore and you don't delete them. Such a situation occurs time to time and you just make some RAM out of stock. After a while user have to terminate your program and launch it again. It is not so bad but it not so comfortable, what is more, your program may be loading for a while. Because of these user feels bad of your silly decision.
Another thing. Then you use "new" to create object you call constructor and "delete" calls destructor. Lets say you need to open so file and destructor closes it and makes it accessible for other processes in this case you would steel not only memory but also files from other processes.
If you don't want to use "delete" you can use shared pointers (it has garbage collector).
It can be found in STL, std::shared_ptr, it has one disatvantage, WIN XP SP 2 and older do not support this. So if you want to create something for public you should use boost it also has boost::shared_ptr. To use boost you need to download it from here and configure your development environment to use it.
I am working on an analysis tool that reads output from a process and continuously converts this to an internal format. After the "logging phase" is complete, analysis is done on the data. The data is all held in memory.
However, due to the fact that all logged information is held in memory, there is a limit on the duration of the logging. For most use cases this is ok, but it should be possible to run for longer, even if this will hurt performance.
Ideally, the program should be able to start using hard drive space in addition to RAM once the RAM usage reaches a certain limit.
This leads to my question:
Are there any existing solutions for doing this? It has to work on both Unix and Windows.
To use the disk after memory is full, we use Cache technologies such as EhCache. They can be configured with the amount of memory to use, and to overflow to disk.
But they also have smarter algorithms you can configure as needed, such as sending to disk data not used in the last 10 minutes etc... This could be a plus for you.
Without knowing more about your application it is not possible to provide a perfect answer. However it does sound a bit like you are re-inventing the wheel. Have you considered using an in-process database library like sqlite?
If you used that or similar it will take care of moving the data to and from the disk and memory and give you powerful SQL query capabilities at the same time. Even if your logging data is in a custom format if each item has a key or index of some kind a small light database may be a good fit.
This might seem too obvious, but what about memory mapped files? This does what you want and even allows a 32 bit application to use much more than 4GB of memory. The principle is simple, you allocate the memory you need (on disk) and then map just a portion of that into system memory. You could, for example, map something like 75% of the available physical memory size. Then work on it, and when you need another portion of the data, just re-map. The downside to this is that you have to do the mapping manually, but that's not necessarily bad. The good thing is that you can use more data than what fits into physical memory and into the per-process memory limit. It works really great if you actually use only part of the data at any given time.
There may be libraries that do this automatically, like the one KLE suggested (though I do not know that one). Doing it manually means you'll learn a lot about it and have more control, though I'd prefer a library if it does exactly what you want with regard to how and when the disk is being used.
This works similar on both Windows on Unix. For Windows, here is an article by Raymond Chen that shows a simple example.