Are sockets communicating on the same PC that much slower than using shared memory? - windows

I have a Windows DLL that provides video to an external application. My main application creates each video frame and I use globally shared memory backed by the system page file to pass that frame to the DLL. The video frame is subsequently retrieved by the external application and then displayed. I do not own the external application, just the DLL it loads to get video from. I am considering switching to a socket based approach to talk between my main application and the DLL and getting rid of the shared memory approach. I do not like watching the "soft page faults" pile up as I repetitively invalidate the shared memory location each time I write a new video frame to it. I believe that the soft page faults are harmless, just a side effect of the memory paging involved, but I would be more comfortable without it.
Since the video is being delivered at a frame rate of about 25 frames per second, I have approximately 1/25th of a second to transfer the frame. The frames are never larger than 640 x 480 and they are compressed JPEG frames so they aren't very large at all, usually about 10,000 bytes. So here's my question:
With an already open and persistent socket connection between two sockets on the same PC, will the time to transfer a video frame be significantly longer using a socket instead of a shared memory location? Or at the O/S level is it just a fast memory write with some insignificant "window dressing" around it to support the socket communication?

The main advantage of using shared memory is avoiding memory copies from application to kernel buffers (and back on the receiving end) and getting rid of user to kernel mode switching via system calls. You still need synchronization between cooperating processes, but that could be done in userland avoiding the kernel. All this is far from trivial and few people get it right, but my point is that switching to sockets will make your system slower. By how much and if that is acceptable is for you to measure and judge.
There's another side to socket-based vs. shared memory-based setup - flexibility - with sockets it's easy to switch to a distributed setup. With networks getting faster and faster that's what might be in store for you down the road.
Hope this helps.

Strictly speaking, shared memory would be faster as the socket communication adds a layer of indirection and instructions. Do you need backing for your shared memory? Windows allows shared memory without disk backing. I believe there's a way to keep the region from getting swapped as well, but don't know off-hand.

Related

What library or API should I use to implement a linux kernel module doing asynchronous IO?

First I will tell environment of my PC, background of my question, my problem, than I will explain my exact question.
Environment:
OS: Ubuntu 16.04
Kernel: 4.17.1
CPU: i7-6700k
Memory: 8GB DRAM
Storage: SSD 120GB
Background:
I'm trying to optimizing linux kernel for my specific application. Following is abstract logic of this application.
1. call malloc, allocate the memory space which size is exactly 4KB(page size)
2. Copy predefined data(also, size is 4KB) to allocated memory space.
3. Do computation
4. Free allocated memory space.
This sequence occurs about several thousands to ten thousands times a second.
So I thought copy predefined data to allocated memory space using memcpy() thousands of times every second is very inefficient. But I cannot fix the code of this application.
My problem:
I want to do these copies asynchronously by kernel module, using less CPU cycles as possible. So I'm trying to implement a kernel module that copy this predefined data to free page frames asynchronously in kernel, and managing a pool page frames which has predefined data on them. When my specific application request a page frame, my kernel will give a page frame from this pool.
To copy data asynchronously, I first considered DMA, but intel idma64 of my CPU cannot copy data memory to memory asynchronously. Now, I'm trying to copy this data from secondary storage(SSD) to memory. I found that there is library for asynchronous IO named libaio in linux.
My question:
1. Can I use libaio libraries in kernel module? If not, what kind of library or APIs do I have to use to copy asynchronously in my kernel module?
2. Will libaio(or something else) really do copies without exploiting CPU cycles?
I don't think you need to write a kernel module. A user space thread pool of CPU pinned threads working with a collection of memory maps of files will be as efficient as is possible to implement. Just be careful of "TLB shootdown" i.e. avoid modifying the address space of the process, and throw as much virtual address space as you can at the problem to avoid that. Perhaps a little bit of hinting to the kernel as to what written pages will never be used again via madvise(), and you should be optimal - sufficient multiple threads will maximise queue depth to the SSD, you want to aim for QD8 to QD16, and you should easily saturate a NVMe link whilst keeping CPU usage below 100%.
Things get harder if you have many NVMe linked SSDs, you may need to consider replacing Linux will something with more scalable storage i/o, but there is a throughput vs scalability tradeoff there. Windows and FreeBSD will scale better with lots of devices if you partition the work up right, but Linux will do much better with a few devices. Good luck!

D3D11_USAGE_STAGING, what kind of GPU/CPU memory is used?

I read the D3D11 Usage page and coming from a CUDA background I'm wondering what kind of memory would a texture marked as D3D11_USAGE_STAGING be stored into.
I suppose in CUDA it would be pinned page-locked zero-copy memory. I measured the transfer time from a ID3D11Texture2D with D3D11_USAGE_STAGING to a host buffer allocated with malloc and it took almost 7 milliseconds (quite a lot in streaming/gaming) and I thought this would be the time required from GPU global memory to that memory area.
Is any of my suppositions correct? What is D3D11_USAGE_STAGING using as GPU memory?
The primary use for D3D11_USAGE_STAGING is as a way to load data into other D3D11_USAGE_DEFAULT pool resources. Another common usage is for 'readback' of a render target to CPU accessible memory. You can use CopySubResourceRegion to move data between DEFAULT and STAGING resources (discrete hardware often uses Direct Memory Access to handle the moving of data between system memory and VRAM).
This is a generalization because it depends on the architecture and driver choices, but in short::
D3D11_USAGE_STAGING means put it in system memory, and the GPU can't access it.
D3D11_USAGE_DEFAULT put it in the VRAM, the CPU can't access it. To put data into it requires copying data from a STAGING resource. You can think of UpdateSubresource as a convenience function that creates a STAGING resource, copies the data to it, copies from STAGING to DEFAULT, then releases the STAGING copy.
There is an optional feature in DirectX 11.2 or later than can be implemented by the driver if they choose to allow CPU access even to D3D11_USAGE_DEFAULT pool stuff. This depends on how they have their memory system setup (i.e. in Unified Memory Architectures, system RAM and VRAM are the same thing).
D3D11_USAGE_IMMUTABLE is basically the same as D3D11_USAGE_DEFAULT, but you are saying you are only going to initialize it once in the creation call.
D3D11_USAGE_DYNAMIC means put it in shared system memory, the CPU & GPU both need access to it. There's usually a performance hit for the GPU to read from this compared to DEFAULT, so you want to use it sparingly. It's really for stuff you generate every frame on the CPU and then need to render (such as terrain systems or procedural geometry).
Games that use content streaming typically have a set of STAGING textures they load data into in the background, and then copy it to DEFAULT textures as they become available for efficient rendering. This lets them reuse both STAGING and DEFAULT textures without the overhead of destroying/creating resources every frame.
See this somewhat dated article on Microsoft Docs: Resource Management Best Practices

kmalloc'ed memory is slow

We have an app that requires ~1MB buffers for a hardware device to fill, therefore we wrote a kernel module that allocates buffers using kmalloc(). We did not use dma_alloc_coherent() as we need to manipulative the buffers and therefore wanted them to be cached (we flush the cache when needed). One of the manipulations that is done is the kernel module copies one buffer to another buffer. In timing these copies we see it takes about ~2ms to copy a buffer. The time does not include any cache flushing.
As this seemed slow we wrote a standard userspace test app, that used malloc() to create 1MB buffers and copied them. The userspace copies took about .5ms, which is about the correct time to move this amount of memory on the processor/memory config we are using.
Thinks we tried: To make sure it wasn't a different memcpy() in kernel space and user space we wrote our own NEON optimized copy, but made no difference. Changed the buffer size from 100KB to 10MB and made no difference. All times were over 10 copies, but always very very consistent. Time routine used gettimeofday() in userspace.
Only thing we can think of is that the data cache is setup up different for kmalloc()'ed memory then malloc()'ed memory???
We are working on iMX6 ARM, Linaro kerne.
The kmalloc() memory will be contiguous in physical space. The user space will definitely not (mlock() may result in closer to contiguous). If you have several SDRAM chips, it is possible that your memory controller allow pipelining or multiple issue reads/writes to different chips simultaneously. It may even be faster with multiple banks. vmalloc() will not use contiguous pages.Ref You should be able to write a test to swap kmalloc() with vmalloc(). If something has changed with the newer ARMs and the cache is not VIVT, the difference in physical addresses could cause cache (aliasing?) effects on some processors.
I do not think that the cache are setup differently for kernel memory versus user memory; at least with 2.6.34 variants; but they may come from different pools. Also, for a memcpy() a large cache is not needed; you just need enough to make sure the SDRAM will burst.
Another issue is peripherals. For instance, a large graphics buffer on one chip maybe stealing cycles via DMA. If you can change your machine file or device table to disable as many drivers as possible, this can be eliminated. This combined with the pipelining could account for the type of slow-down observed.
I believe this is a platform issue. If it was strictly Linux, I think that one of the millions of users may have encountered it. However, you haven't given a specific version of Linux. It could be an ARM based issue; so I tagged it as such. I think it is your platform/ARM combination; simply because others would observe this. Can you also provide a specific machine file or device table that your design was based upon and the Linux version.

How much memory should a caching system use on Windows?

I'm developing a client/server application where the server holds large pieces of data such as big images or video files which are requested by the client and I need to create an in-memory client caching system to hold a few of those large data to speed up the process. Just to be clear, each individual image or video is not that big but the overall size of all of them can be really big.
But I'm faced with the "how much data should I cache" problem and was wondering if there are some kind of golden rules on Windows about what strategy I should adopt. The caching is done on the client, I do not need caching on the server.
Should I stay under x% of global memory usage at all time ? And how much would that be ? What will happen if another program is launched and takes up a lot of memory, should I empty the cache ?
Should I request how much free memory is available prior to caching and use a fixed percentage of that memory for my needs ?
I hope I do not have to go there but should I ask the user how much memory he is willing to allocate to my application ? If so, how can I calculate the default value for that property and for those who will never use that setting ?
Rather than create your own caching algorithms why don't you write the data to a file with the FILE_ATTRIBUTE_TEMPORARY attribute and make use of the client machine's own cache.
Although this approach appears to imply that you use a file, if there is memory available in the system then the file will never leave the cache and will remain in memory the whole time.
Some advantages:
You don't need to write any code.
The system cache takes account of all the other processes running. It would not be practical for you to take that on yourself.
On 64 bit Windows the system can use all the memory available to it for the cache. In a 32 bit Delphi process you are limited to the 32 bit address space.
Even if your cache is full and your files to get flushed to disk, local disk access is much faster than querying the database and then transmitting the files over the network.
It depends on what other software runs on the server. I would make it possible to configure it manually at first. Develop a system that can use a specific amount of memory. If you can, build it so that you can change that value while it is running.
If you got those possibilities, you can try some tweaking to see what works best. I don't know any golden rules, but I'd figure you should be able to set a percentage of total memory or total available memory with a specific minimum amount of memory to be free for the system at all times. If you save a miminum of say 500 MB for the server OS, you can use the rest, or 90% of the rest for your cache. But those numbers depend on the version of the OS and the other applications running on the server.
I think it's best to make the numbers configurable from the outside and create a management tool that lets you set the values manually first. Then, if you found out what works best, you can deduct formulas to calculate those values, and integrate them in your management tool. This tool should not be an integral part of the cache program itself (which will probably be a service without GUI anyway).
Questions:
One image can be requested by multiple clients? Or, one image can be requested by multiple times in a short interval?
How short is the interval?
The speed of the network is really high? Higher than the speed of the hard drive?? If you have a normal network, then the harddrive will be able to read the files from disk and deliver them over network in real time. Especially that Windows is already doing some good caching so the most recent files are already in cache.
The main purpose of the computer that is running the server app is to run the server? Or is just a normal computer used also for other tasks? In other words is it a dedicated server or a normal workstation/desktop?
but should I ask the user how much
memory he is willing to allocate to my
application ?
I would definitively go there!!!
If the user thinks that the server application is not a important application it will probably give it low priority (low cache). Else, it it thinks it is the most important running app, it will allow the app to allocate all RAM it needs in detriment of other less important applications.
Just deliver the application with that setting set by default to a acceptable value (which will be something like x% of the total amount of RAM). I will use like 70% of total RAM if the main purpose of the computer to hold this server application and about 40-50% if its purpose is 'general use' computer.
A server application usually needs resources set aside for its own use by its administrator. I would not care about others application behaviour, I would care about being a "polite" application, thereby it should allow memory cache size and so on to be configurable by the administator, which is the only one who knows how to configure his systems properly (usually...)
Defaults values should anyway take into consideration how much memory is available overall, especially on 32 bit systems with less than 4GB of memory (as long as Delphi delivers only 32 bit apps), to leave something free to the operating systems and avoids too frequent swapping. Asking the user to select it at setup is also advisable.
If the application is the only one running on a server, a value between 40 to 75% of available memory could be ok (depending on how much memory is needed beyond the cache), but again, ask the user because it's almost impossible to know what other applications running may need. You can also have a min cache size and a max cache size, start by allocating the lower value, and then grow it when and if needed, and shrink it if necessary.
On a 32 bit system this is a kind of memory usage that could benefit from using PAE/AWE to access more than 3GB of memory.
Update: you can also perform a monitoring of cache hits/misses and calculate which cache size would fit the user needs best (it could be too small but too large as well), and the advise the user about that.
To be honest, the questions you ask would not be my main concern. I would be more concerned with how effective my cache would be. If your files are really that big, how many can you hold in the cache? And if your client server app has many users, what are the chances that your cache will actually cache something someone else will use?
It might be worth doing an analysis before you burn too much time on the fine details.

Average performance measurements for local IPC

We are now assessing different IPC (or rather RPC) methods for our current project, which is in its very early stages. Performance is a big deal, and so we are making some measurements to aid our choice. Our processes that will be communicating will reside on the same machine.
A separate valid option is to avoid IPC altogether (by encapsulating the features of one of the processes in a .NET DLL and having the other one use it), but this is an option we would really like to avoid, as these two pieces of software are developed by two separate companies and we find it very important to maintain good "fences", which make good neighbors.
Our tests consisted of passing messages (which contain variously sized BLOBs) across process boundaries using each method. These are the figures we get (performance range correlates with message size range):
Web Service (SOAP over HTTP):
25-30 MB/s when binary data is encoded as Base64 (default)
70-100 MB/s when MTOM is utilized
.NET Remoting (BinaryFormatter over TCP): 100-115 MB/s
Control group - DLL method call + mem copy: 800-1000 MB/s
Now, we've been looking all over the place for some average performance figures for these (and other) IPC methods, including performance of raw TCP loopback sockets, but couldn't find any. Do these figures look sane? Why is the performance of these local IPC methods at least 10 times slower than copying memory? I couldn't get better results even when I used raw sockets - is the overhead of TCP that big?
Shared memory is the fastest.
A producer process can put its output into memory shared between processes and notify other processes that the shared data has been updated. On Linux you naturally put a mutex and a condition variable in that same shared memory so that other processes can wait for updates on the condition variable.
Memory-mapped files + synchronization objects is the right way to go (almost the same as shared memory, but with more control). Sockets are way too slow for local communications. Especially it sometimes happens that network drivers are slower with localhost, than over network.
Several parts of our system have been redesigned so that we don't have to pass 30MB messages around, but rather 3MB. This allowed us to choose .NET Remoting with BinaryFormatter over named pipes (IpcChannel), which gives satisfactory results.
Our contingency plan (in case we ever do need to pass 30MB messages around) is to pass protobuf-serialized messages over named pipes manually. We have determined that this also provides satisfactory results.

Resources