I am making an application that reads a target process's memory. The memory that I read needs to be as up-to-date as possible, and the total memory read time preferably <1ms (right now I'm hovering around 0.8ms).
Due to the nature of my application, I have lots of ReadProcessMemory() calls that can't be reduced down through caching or similar methods. I'm wondering, is it possible to read process memory faster? I'm willing to put my application inside a kernel driver if the performance boost is big enough.
What are my choices?
Calling NtReadVirtualMemory will give you a 1% increase in speed, not very exciting.
You can increase efficiency by reading large blocks of memory, and accesses each individual variable you need. If you're reading offset 0x8 and 0x200 in 2 different calls, then you should just read the entire structure instead and then locally parse it.
NtWriteVirtualMemory can give you an almost 100% increase in speed over regular WriteProcessMemory in certain situations.
Other than that, injecting and having direct access to the memory is the best route.
Related
In a Windows application I have a class which wraps up a filename and a buffer. You construct it with a filename and you can query the object to see if the buffer is filled yet, returning nullptr if not and the buffer addres if so. When the object falls out of scope, the buffer is released:
class file_buffer
{
public:
file_buffer(const std::string& file_name);
~file_buffer();
void* buffer();
private:
...
}
I want to put the data into memory asynchronously, and as far as I see it I have two choices: either create a buffer and use overlapped IO through ReadFileEx, or use MapViewOfFile and touch the address on another thread.
At the moment I'm using ReadFileEx which presents some problems, as requests greater than about 16MB are prone to failure: I can try splitting up the request but then I get synchronisation issues, and if the object falls out of scope before the IO is complete I have buffer-cleanup issues. Also, if multiple instances of the class are created in quick succession things get very fiddly.
Mapping and touching the data on another thread would seem to be considerably easier since I won't have the upper limit issues: also if the client absolutely has to have the data right now, they can simply dereference the address, let the OS worry about page faults and take the blocking hit.
This application needs to support single core machines, so my question is: will page faults on another software thread be any more expensive than overlapped IO on the current thread? Will they stall the process? Does overlapped IO stall the process in the same way or is there some OS magic I don't understand? Are page faults carried out using overlapped IO anyway?
I've had a good read of these topics:
http://msdn.microsoft.com/en-us/library/aa365199(v=vs.85).aspx (IO Concepts in File Management)
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx (File mapping)
but I can't seem to infer how to make a performance tradeoff.
You will definitively want to go with memory-mapped files. Overlapped IO (with FILE_FLAG_NO_BUFFERING) has been advocated as "the fastest way to get data into RAM" by some people for years, but this is only true in very contrieved cases with very specific conditions. In the normal, average case, turning off the buffer cache is a serious anti-optimization.
Now, overlapped IO without FILE_FLAG_NO_BUFFERINGhas all the quirks of overlapped IO, and is about 50% slower (for a reason I still cannot understand).
I've done some rather extensive benchmarking a year ago. The bottom line is: Memory mapped files are faster, better, less surprising.
Overlapped IO uses more CPU, is much slower when using the buffer cache, asynchronous reverts to synchronous under some well-documented and some undocumented conditions (e.g. encryption, compression, and... pure chance? request size? number of requests?), stalling your application at unpredictable times.
Submitting requests can sometimes take "funny" amounts of time, and CancelIO sometimes doesn't cancel anything but waits for completion. Processes with outstanding requests are unkillable. Managing buffers with outstanding overlapped writes is non-trivial extra work.
File mapping just works. Fullstop. And it works nicely. No surprises, no funny stuff. Touching every page has very little overhead and delivers as fast as the disk is able to deliver, and it takes advantage of the buffer cache. Your concern about a single-core CPU is no problem. If the touch-thread faults, it blocks, and as always when a thread blocks, another thread gets CPU time instead.
I'm even using file mapping for writing now, whenever I have more than a few bytes to write. This is somewhat non-trivial (have to manually grow/preallocate files and mappings, and truncate to actual length when closing), but with some helper classes it's entirely doable. Write 500 MiB of data, and it takes "zero time" (you basically do a memcpy, the actual write happens in the background, any time later, even after your program has finished). It's stunning how well this works, even if you know that it's the natural thing for an operating system to do.
Of course you had better not have a power failure before the OS has written out all pages, but that's true for any kind of writing. What's not on the disk yet is not on the disk -- there's really not much more to say to it than that. If you must be sure about that, you have to wait for a disk sync to complete, and even then you can't be sure the lights aren't going out while you wait for the sync. That's life.
I don't claim to understand this better than you, as it seem you made some inventigation. And to be totally sure you will need to experiment. But this is my understanding of the issues, in reverse order:
File mapping and overlapped IO in Windows are different implentations and none of them rely on the other under the hood. But both use the asynchronous block device layer. As I imagine it, in the kernel every IO is actually asynchronous, but some user operations wait for it to finish and so they create the illusion of synchronicity.
From point 1, if a thread does IO, other threads from the same process will not stall. That, unless the system resources are scarce or these other threads do IO themselves and face some kind of contention. This will be true no matter the kind of IO the first thread does: blocking, non-blocking, overlapped, memory-mapped.
In memory-mapped files, the data is read at least one page at a time, probably more because of the read-ahead, but you cannot be sure about that. So the probing thread will have to touch the mapped memory at least one on every page. That will be something like probe/block-probe-probe-probe-probe/block-probe... That might be a bit less efficient than a big overlapped read of several MB. Or maybe the kernel programmers were smart and it is even more efficient. You will have to make a little profiling... Hey, you could even go without the probing thread and see what happens.
Cancelling overlapping operations is a PITA, so my recommendation will be to go with the memory-mapped files. That is way easier to set up and you get extra functionality:
the memory is usable even before it is fully in memory
the memory can/will be shared by several instances of the process
if the memory is in the cache, it will be ready instantaneously instead of just quickly.
if the data is read-only, you can protect the memory from writing, catching bugs.
Paging creates illusion that each process has infinite RAM by moving pages to and from disk. So if we have infinite memory(in some hypothetical situation), do we still need Paging? If yes, then why? I faced this question in an interview.
Assuming that "infinite memory" means infinite randomly accessable memory, or RAM, we will still need paging. Although paging is often associated with the ability to swap pages in and out of RAM to a hard disk to conserve memory, this is merely one aspect of paging. Here are some other reasons to have paging:
Security. Paging is a method to enforce operating system security and memory protection by ensuring that a processes cannot access the memory of another process and that it cannot modify the resident kernel.
Multitasking. Paging aids in multitasking by virtualizing the memory space, that is, address 0xFOO in Process A can be something completely different than 0xFOO in Process B
Memory Allocation. Paging aids in memory allocation by reducing fragmentation and ensuring RAM is only allocated when accessed. What this means is that although a process needs, suppose, 100MB of continuous RAM space, this need not be continuous physically. Additionally, when a program requests 100MB of space, the operating system will tell the program it's safe to use that 100MB of space, yet it will not be actually allocated until the program uses that space to its fullest.
Admittedly, the latter would not be entirely necessary if one had infinite RAM, nonetheless; it is always good practice to be eifficient even when we are not resource constrained. It also demonstrates a use for paging that is sometimes not considered.
This is a philosophical question, so here's a philosophical answer :)
The trick in this question is that you make assumptions about the infinite memory. It's ok to say "no, no need to use paging, BUT". And follow by:
The infinite memory has to be accessible within the acceptable time limit for memory access. If it's not (because infinity takes a lot of space, and the memory sits farther away from the processing unit), there's no difference between it and the disk, both are not satisfying the readily available memory requirement, which is what caching via pages tries to solve.
Take for example Amazon's S3, which for all practical purposes is infinite. If you can rely on S3 to satisfy all your memory requirements in the sense that when you need something fetched within time x you can fetch it from S3, there's no need to page anything or even hold it in "local" memory. Just get it from S3 whenever you need it, as many times as you need it. (Obviously this would have other repercussions like cost and network, but let's ignore that for now).
Of course you can always say that optimally you want memory access to be as fast as possible, and "fast enough" is probably slower than "fastest", so local memory access would give you better performance etc.
And last, if I had to envision a memory which is infinite and has the same access time no matter how "far" the memory unit is from the fetching unit, I would have to envision a sphere where the processing unit is in the middle, so that you can't argue that one memory unit is slower than the other because of the distance. Otherwise you could say that paging would be done internally within the memory so that access is faster to the memory units that are most used (or whatever algorithm you choose to use).
Our professor asked us to think of an embedded system design where caches cannot be used to their full advantage. I have been trying to find such a design but could not find one yet. If you know such a design, can you give a few tips?
Caches exploit the fact data (and code) exhibit locality.
So an embedded system wich does not exhibit locality, will not benefit from a cache.
Example:
An embedded system has 1MB of memory and 1kB of cache.
If this embedded system is accessing memory with short jumps it will stay long in the same 1kB area of memory, which could be successfully cached.
If this embedded system is jumping in different distant places inside this 1MB and does that frequently, then there is no locality and cache will be used badly.
Also note that depending on architecture you can have different caches for data and code, or a single one.
More specific example:
If your embedded system spends most of its time accessing the same data and (e.g.) running in a tight loop that will fit in cache, then you're using cache to a full advantage.
If your system is something like a database that will be fetching random data from any memory range, then cache can not be used to it's full advantage. (Because the application is not exhibiting locality of data/code.)
Another, but weird example
Sometimes if you are building safety-critical or mission-critical system, you will want your system to be highly predictable. Caches makes your code execution being very unpredictable, because you can't predict if a certain memory is cached or not, thus you don't know how long it will take to access this memory. Thus if you disable cache it allows you to judge you program's performance more precisely and calculate worst-case execution time. That is why it is common to disable cache in such systems.
I do not know what you background is but I suggest to read about what the "volatile" keyword does in the c language.
Think about how a cache works. For example if you want to defeat a cache, depending on the cache, you might try having your often accessed data at 0x10000000, 0x20000000, 0x30000000, 0x40000000, etc. It takes very little data at each location to cause cache thrashing and a significant performance loss.
Another one is that caches generally pull in a "cache line" A single instruction fetch may cause 8 or 16 or more bytes or words to be read. Any situation where on average you use a small percentage of the cache line before it is evicted to bring in another cache line, will make your performance with the cache on go down.
In general you have to first understand your cache, then come up with ways to defeat the performance gain, then think about any real world situations that would cause that. Not all caches are created equal so there is no one good or bad habit or attack that will work for all caches. Same goes for the same cache with different memories behind it or a different processor or memory interface or memory cycles in front of it. You also need to think of the system as a whole.
EDIT:
Perhaps I answered the wrong question. not...full advantage. that is a much simpler question. In what situations does the embedded application have to touch memory beyond the cache (after the initial fill)? Going to main memory wipes out the word full in "full advantage". IMO.
Caching does not offer an advantage, and is actually a hindrance, in controlling memory-mapped peripherals. Things like coprocessors, motor controllers, and UARTs often appear as just another memory location in the processor's address space. Instead of simply storing a value, those locations can cause something to happen in the real world when written to or read from.
Cache causes problems for these devices because when software writes to them, the peripheral doesn't immediately see the write. If the cache line never gets flushed, the peripheral may never actually receive a command even after the CPU has sent hundreds of them. If writing 0xf0 to 0x5432 was supposed to cause the #3 spark plug to fire, or the right aileron to tilt down 2 degrees, then the cache will delay or stop that signal and cause the system to fail.
Similarly, the cache can prevent the CPU from getting fresh data from sensors. The CPU reads repeatedly from the address, and cache keeps sending back the value that was there the first time. On the other side of the cache, the sensor waits patiently for a query that will never come, while the software on the CPU frantically adjusts controls that do nothing to correct gauge readings that never change.
In addition to almost complete answer by Halst, I would like to mention one additional case where caches may be far from being an advantage. If you have multiple-core SoC where all cores, of course, have own cache(s) and depending on how program code utilizes these cores - caches can be very ineffective. This may happen if ,for example, due to incorrect design or program specific (e.g. multi-core communication) some data block in RAM is concurrently used by 2 or more cores.
I am currently looking into malloc() implementation under Windows. But in my research I have stumbled upon things that puzzled me:
First, I know that at the API level, windows uses mostly the HeapAlloc() and VirtualAlloc() calls to allocate memory. I gather from here that the Microsoft implementation of malloc() (that which is included in the CRT - the C runtime) basically calls HeapAlloc() for blocks > 480 bytes and otherwise manage a special area allocated with VirtualAlloc() for small allocations, in order to prevent fragmentation.
Well that is all good and well. But then there are other implementation of malloc(), for instance nedmalloc, which claim to be up to 125% faster than Microsoft's malloc.
All this makes me wonder a few things:
Why can't we just call HeapAlloc() for small blocks? Does is perform poorly in regard to fragmentation (for example by doing "first-fit" instead of "best-fit")?
Actually, is there any way to know what is going under the hood of the various API allocation calls? That would be quite helpful.
What makes nedmalloc so much faster than Microsoft's malloc?
From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true? Or is the malloc() "wrapper" just needed because of fragmentation? One would think that system calls like this would be quick - or at least that some thoughts would have been put into them to make them efficient.
If it is true, why is it so?
On average, how many (an order of magnitude) memory reads/write are performed by a typical malloc call (probably a function of the number of already allocated segments)? I would intuitively says it's in the tens for an average program, am I right?
Calling HeapAlloc doesn't sound cross-platform. MS is free to change their implementation when they wish; advise to stay away. :)
It is probably using memory pools more effectively, much like the Loki library does with its "small object allocator"
Heap allocations, which are general purpose by nature, are always slow via any implementation. The more "specialized" the allocator, the faster it will be. This returns us to point #2, which deals with memory pools (and the allocation sizes used that are specific to your application).
Don't know.
From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true?
The OS-level system calls are designed and optimized for managing the entire memory space of processes. Using them to allocate 4 bytes for an integer is indeed suboptimal - you get overall better performance and memory usage by managing tiny allocations in library code, and letting the OS optimize for larger allocations. At least as far as I understand it.
Problem Description
I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I'm doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.
In summary:
Needs to be constant space
Fast as possible
Assume very large files
Result fits in memory
Question
What are the fastest ways to read/stream data from a file using constant space?
Ideas I've had
If the file was small enough to fit in memory, then mmap on POSIX systems would be very fast, unfortunately that's not the case here. Is there any performance advantage to using mmap with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread?
I wouldn't be so sure that mmap would be very fast (where very fast is defined as significantly faster than fread).
Grep used to use mmap, but switched back to fread. One of the reasons was stability (strange things happen with mmap if the file shrinks whilst it is mapped or an IO error occurs). This page discusses some of the history about that.
You can compare the performance on your system with the option --mmap to grep. On my system the difference in performance on a 200GB file is negligible, but your mileage might vary!
In short, I'd use fread with a fixed size buffer. It's simpler to code, easier to handle errors and will almost certainly be fast enough.
Depending on the language you are using, a C-like fread() loop based on a file for which you declared a particular buffer size will require exactly this buffer size, no more no less.
We typically choose a buffer size of 4 to 128 kBytes, there is little gain if any with bigger buffers.
If performance was extremely important, for relatively little gain (and at the risk of re-inventing something), one could consider using a two-thread implementation, whereby one thread reads the file in a set of two buffers, and the other thread perform calculations sequential fashion in one of the buffers at a time. In this fashion the disk access delays can be removed.
mjv is right. You can use double-buffers and overlapped I/O. That way your crunching and the disk reading can be happening at the same time. Then I would profile or stack-shot the crunching to make it as fast as possible. With luck it will be faster than the I/O, so you will end up running the I/O at top speed without pause. Then things like file fragmentation come into the picture.