Shared memory without memory-mapped files

Shared memory without memory-mapped files - windows

Is it possible to share memory between two MFC C++ applications without using Memory mapped files? Currently we are using this method to share structs, and it is too slow for our requirements. Is there a better way?

Are you sure it is the memory mapped files that are slow? The OS maps the same piece of RAM into both process spaces (when it's paged in.) Other culprits to performance issues could be mutexes and other synchronization primitives/volatile reads and cache invalidation to propagate concurrent changes to memory between processes.
You might try making changes locally to a non-shared region, and then bulk copying that, rather than repeatedly writing to the shared memory.
Other alternatives are message passing, RPC or DCOM, but I doubt these will be more performant, especially if the amount of data being transferred/referenced is large.

I would have thought that once you'd established the memory mapping (with the MapViewOfFile), that would be pretty fast.
Is your performance problem with actually setting up the mapped memory, rather than using it once it's set up?
If you do genuinely have a verified problem with the memory mapped files, this is another technique: http://msdn.microsoft.com/en-us/library/h90dkhs0%28VS.80%29.aspx (DLL shared memory segments), but I doubt it's really going to help you.

Related

Does GetWriteWatch work with Memory-Mapped FIles?

outI'm working with memory mapped files (MMF) with very large datasets (depending on the input file), where each file has ~50GB and there are around 40 files open at the same time. Of course this depends, I can also have smaller files, but I can also have larger files - so the system should scale itself.
The MMF is acting as a backing buffer, so as long as I have enough free memory there shoud occur no paging. The problem is that the windows memory manager and my application are two autonomous processes. In good conditions everything is working fine, but the memory manager obviously is too slow in conditions where I'm entering low memory conditions, the memory is full and then the system starts to page (which is good), but I'm still allocating memory, because I don't get any information about the paging.
In the end I'm entering a state where the system stalls, the memory manager pages and I'm allocating.
So I came to the point where I need to advice the memory manager, check current memory conditions and invoke the paging myself. For that reason I wanted to use the GetWriteWatch to inspect the memory region I can flush.
Interestingly the GetWriteWatch does not work in my situation, it returns a -1 without filling the structures. So my question is does GetWriteWatch work with MMFs?

Does GetWriteWatch work with Memory-Mapped Files?
I don't think so.
GetWriteWatch accepts memory allocated via VirtualAlloc function using MEM_WRITE_WATCH.
File mapping are mapped using MapViewOfFile* functions that do not have this flag.

Does accessing shared memory simultaneously cause a performance hit?

I have a simple multi-threaded app for my multi-core system. This app has a parallel region in which no threads write to a given memory address, but some may read simultaneously.
Will there still be some type of overhead or performance hit associated with several threads accessing the same memory even if though no locking is used? If so, why? How big an impact can it have and what can be done about it?

This can depend on the specific cache synchronization protocol in use, but most modern CPUs support having the same cache line shared in multiple processor caches, provided there is no write activity to the cache line. That said, make sure you align your allocations to the cache line size; if you don't, it's possible that data that's being written to could share the same cache line as your read-only data, resulting in a performance hit when the dirtied cache line is flushed on other processors (false sharing).

I would say there wouldn't be. However the problem arises when you have multiple writers to the same references.

What pitfalls should I be wary of when memory mapping BIG files?

I have a bunch of big files, each file can be over 100GB, the total amount of data can be 1TB and they are all read-only files (just have random reads).
My program does small reads in these files on a computer with about 8GB main memory.
In order to increase performance (no seek() and no buffer copying) i thought about using memory mapping, and basically memory-map the whole 1TB of data.
Although it sounds crazy at first, as main memory << disk, with an insight on how virtual memory works you should see that on 64bit machines there should not be problems.
All the pages read from disk to answer to my read()s will be considered "clean" from the OS, as these pages are never overwritten. This means that all these pages can go directly to the list of pages that can be used by the OS without writing back to disk OR swapping (wash them). This means that the operating system could actually store in physical memory just the LRU pages and would operate just reads() when the page is not in main memory.
This would mean no swapping and no increase in i/o because of the huge memory mapping.
This is theory; what I'm looking for is any of you who has every tried or used such an approach for real in production and can share his experience: are there any practical issues with this strategy?

What you are describing is correct. With a 64-bit OS you can map 1TB of address space to a file and let the OS manage reading and writing to the file.
You didn't mention what CPU architecture you are on but most of them (including amd64) the CPU maintains a bit in each page table entry as to whether data in the page has been written to. The OS can indeed use that flag to avoid writing pages that haven't been modified back to disk.
There would be no increase in IO just because the mapping is large. The amount of data you actually access would determine that. Most OSes, including Linux and Windows, have a unified page cache model in which cached blocks use the same physical pages of memory as memory mapped pages. I wouldn't expect the OS to use more memory with memory mapping than with cached IO. You're just getting direct access to the cached pages.
One concern you may have is with flushing modified data to disk. I'm not sure what the policy is on your OS specifically but the time between modifying a page and when the OS will actually write that data to disk may be a lot longer than your expecting. Use a flush API to force the data to be written to disk if it's important to have it written by a certain time.
I haven't used file mappings quite that large in the past but I would expect it to work well and at the very least be worth trying.

Memory mapped files causes low physical memory

I have a 2GB RAM and running a memory intensive application and going to low available physical memory state and system is not responding to user actions, like opening any application or menu invocation etc.
How do I trigger or tell the system to swap the memory to pagefile and free physical memory?
I'm using Windows XP.
If I run the same application on 4GB RAM machine it is not the case, system response is good. After getting choked of available physical memory system automatically swaps to pagefile and free physical memory, not that bad as 2GB system.
To overcome this problem (on 2GB machine) attempted to use memory mapped files for large dataset which are allocated by application. In this case virtual memory of the application(process) is fine but system cache is high and same problem as above that physical memory is less.
Even though memory mapped file is not mapped to process virtual memory system cache is high. why???!!! :(
Any help is appreciated.
Thanks.

If your data access pattern for using the memory mapped file is sequential, you might get slightly better page recycling by specifying the FILE_FLAG_SEQUENTIAL_SCAN flag when opening the underlying file. If your data pattern accesses the mapped file in random order, this won't help.
You should consider decreasing the size of your map view. That's where all the memory is actually consumed and cached. Since it appears that you need to handle files that are larger than available contiguous free physical memory, you can probably do a better job of memory management than the virtual memory page swapper since you know more about how you're using the memory than the virtual memory manager does. If at all possible, try to adjust your design so that you can operate on portions of the large file using a smaller view.
Even if you can't get rid of the need for full random access across the entire range of the underlying file, it might still be beneficial to tear down and recreate the view as needed to move the view to the section of the file that the next operation needs to access. If your data access patterns tend to cluster around parts of the file before moving on, then you won't need to move the view as often. You'll take a hit to tear down and recreate the view object, but since tearing down the view also releases all the cached pages associated with the view, it seems likely you'd see a net gain in performance because the smaller view significantly reduces memory pressure and page swapping system wide. Try setting the size of the view based on a portion of the installed system RAM and move the view around as needed by your file processing. The larger the view, the less you'll need to move it around, but the more RAM it will consume potentially impacting system responsiveness.

As I think you are hinting in your post, the slow response time is probably at least partially due to delays in the system while the OS writes the contents of memory to the pagefile to make room for other processes in physical memory.
The obvious solution (and possibly not practical) is to use less memory in your application. I'll assume that is not an option or at least not a simple option. The alternative is to try to proactively flush data to disk to continually keep available physical memory for other applications to run. You can find the total memory on the machine with GlobalMemoryStatusEx. And GetProcessMemoryInfo will return current information about your own application's memory usage. Since you say you are using a memory mapped file, you may need to account for that in addition. For example, I believe the PageFileUsage information returned from that API will not include information about your own memory mapped file.
If your application is monitoring the usage, you may be able to use FlushViewOfFile to proactively force data to disk from memory. There is also an API (EmptyWorkingSet) that I think attempts to write as many dirty pages to disk as possible, but that seems like it would very likely hurt performance of your own application significantly. Although, it could be useful in a situation where you know your application is going into some kind of idle state.
And, finally, one other API that might be useful is SetProcessWorkingSetSizeEx. You might consider using this API to give a hint on an upper limit for your application's working set size. This might help preserve more memory for other applications.
Edit: This is another obvious statement, but I forgot to mention it earlier. It also may not be practical for you, but it sounds like one of the best things you might do considering that you are running into 32-bit limitations is to build your application as 64-bit and run it on a 64-bit OS (and throw a little bit more memory at the machine).

Well, it sounds like your program needs more than 2GB of working set.
Modern operating systems are designed to use most of the RAM for something at all times, only keeping a fairly small amount free so that it can be immediately handed out to processes that need more. The rest is used to hold memory pages and cached disk blocks that have been used recently; whatever hasn't been used recently is flushed back to disk to replenish the pool of free pages. In short, there isn't supposed to be much free physical memory.
The principle difference between using a normal memory allocation and memory mapped a files is where the data gets stored when it must be paged out of memory. It doesn't necessarily have any effect on when the memory will be paged out, and will have little effect on the time it takes to page it out.
The real problem you are seeing is probably not that you have too little free physical memory, but that the paging rate is too high.
My suggestion would be to attempt to reduce the amount of storage needed by your program, and see if you can increase the locality of reference to reduce the amount of paging needed.

Do static classes cause performance issues on multi-core systems?

the other day a colleague of mine stated that using static classes can cause performance issues on multi-core systems, because the static instance cannot be shared between the processor caches. Is that right? Are there some benchmarks around proofing this statement? This statement was made in the context of .Net development (with C#) related discussion, but it sounds to me like a language and environment independent problem.
Thx for your comments.

I would push your colleague for data or at least references.
The thing is, if you've got shared data, you've got shared data. Whether that's exposed through static classes, a singleton, whatever, isn't terribly important. If you don't need the shared data in the first place, I expect you wouldn't have a static class anyway.
Besides all of this, in any given application there's likely to be a much bigger bottleneck than processor caches for shared data in static classes.
As ever, write the most sensible, readable, maintainable code first - then work out if you have a performance bottleneck and act accordingly.

"[a] static instance cannot be shared between the processor caches. Is that right?"
That statement doesn't make much sense to me. The point of each processor's dedicated cache is that it contains a private copy of a small patch of memory, so that if the processor is doing some algorithm that only needs to access that particular memory region then it doesn't have to go to keep going back to access the external memory. If we're talking about the static fields inside a static class, the memory for those fields may all fit into a contiguous chunk of memory that will in turn fit into a single processor's (or core's) dedicated cache. But they each have their own cached copy - it's not "shared". That's the point of caches.
If an algorithm's working set is bigger than a cache then it will defeat that cache. Meaning that as the algorithm runs, it repeatedly causes the processor to pull data from external memory, because all the necessary pieces won't fit in the cache at once. But this is a general problem that doesn't apply specifically to static classes.
I wonder if your colleague was actually talking not about performance but about the need to apply correct locking if multiple threads are reading/writing the same data?

If multiple threads are writing to that data, you'll have cache thrashing (the write on one CPU's cache invalidates the caches of the other CPUs). Your friend is technically correct, but there's a good chance it's not your primary bottleneck, so it doesn't matter.
If multiple threads are reading the data, your friend is flat-out wrong.

If you don't use any kind of locks or synchronization then static-vs.-non-static won't have any influence on your performance.
If you're using synchronization then you could run into a problem if all threads need to acquire the same lock, but that's only a side-effect of the static-ness and not a direct result of the methods being static.

In any "virtual machine" controlled language (.NET, Java, etc) this control is likely delegated to the underlying OS and likely further down to the BIOS and other scheduling controls. That being said, in the two biggies, .NET and Java, static vs. non-static is a memory issue, not a CPU issue.
Re-iterating saua's point, the impact on the CPU comes from the synchronization and thread control, not the access to the static information.
The problem with CPU cache management is not limited to only static methods. Only one CPU can update any memory address at a time. An object in your virtual machine, and specifically a field in your object, is a pointer to said memory address. Thus, even if I have a mutable object Foo, calling setBar(true) on Foo will only be allowed on a single CPU at a time.
All that being said, the point of .NET and Java is that you shouldn't be spending your time sweating these problems until you can prove that you have a problem and I doubt you will.

if you share mutable data between threads, you need either a lock or a lock-free algorithm (seldom available, and sometimes hard to use, unfortunately).
having few, widely used, lock-arbitrated resources can get you to bottlenecks.
static data is similar to a single-instance resource.
therefore:
if many threads access static data, and you use a lock to arbitrate, your threads are going to fight for access.
when designing a highly multithreaded app, try to use many fine-grained locks. split your data so that a thread can grab one piece and run with it, hopefully no other thread will need to wait for it because they're busy with their own pieces of data.

x86 architecture implements cache-snooping to keep data caches in sync on writes should they happen to cache the same thing... Not all architectures do that in hardware, some depend on software to make sure that the case never occurs.

Even if it were true, I suspect you have plenty of better ways to improve performance. When it gets down to changing static to instance, for processor caching, you'll know you are really pushing the envelope.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio