GPU Memory allocation with multiple cuda stream - memory-management

sorry for bothering!
I am trying to deploy a model with dynamic connection of our company to production environment. Since the model is dynamically activated, so batching requests and do inference is not a good idea. Instead, I want to use multiple cuda streams to deal with several requests on one GPU concurrently. (one stream for one request)
I have tried libtorch, since it support multiple stream. However, I found that, when using libtorch, the memory allocated by each stream will be cached by that stream, which cannot be reused by other stream. (Suppose there are 2G memory on one GPU, and stream A caches 1G after deal with request1. Now stream B want to deal with request2, stream A will have to first return the memory back to OS, and stream B needs to call cuda malloc, which is very slow.)
I am wondering whether I could use tf-serving to deal with my model. Will the same thing happen with tf-serving? Can different streams in tf-serving reuse cached gpu memory?
I am looking forward for your reply! Thank you so much!

Related

Advantage of using a CUDA Stream

I am trying to understand where a Stream might help me with processing multiple Regions of Interest on a video frame. If using NPP functions that support a stream, is this a case where one would launch as many streams as there are ROIs? Possibly even creating a CPU thread for each Stream? Or is the benefit in using one stream to process all the ROIs and possibly using this single stream from multiple threads in the CPU?
In CUDA, usage of streams generally helps to better utilize GPU in two ways. Firstly, memory copies between host and device can be overlapped by kernel execution if copying and execution occur in different streams. Secondly, individual kernels running in different streams can overlap if there are enough resources on the GPU.
Further, whether creating a thread for each ROI would help depends on comparison of GPU vs CPU (if any) utilization. If there is a lot of processing on CPU and CPU holds off GPU computation, creating more threads helps.
There are further details (see the documentation for actual version of CUDA) which constrain overlapping of operations in the streams. A memory copy overlaps with a kernel execution only if memory source or destination in RAM is page-locked. Or, synchronization between streams occurs when host thread issues command(s) in the default stream. (Since CUDA 7 each thread has its own default stream, so processing ROIs in different threads would help again.)
Hence, satisfying certain conditions, it should improve performance of your algorithm if the processing of ROIs occurs in different streams up to certain limit (depending on resource consumption of the kernels, ratio of memory copies and computation, etc...)

Understanding tensorflow queues and cpu <-> gpu transfer

After reading this github issue I feel like I'm missing something in my understanding on queues:
https://github.com/tensorflow/tensorflow/issues/3009
I thought that when loading data into a queue, it will get pre-transferred to the GPU while the last batch is getting computed, so that there is virtually no bandwidth bottleneck, assuming computation takes longer than the time to load the next batch.
But the above link suggests that there is an expensive copy from queue into the graph (numpy <-> TF) and that it would be faster to load the files into the graph and do preprocessing there instead. But that doesn't make sense to me. Why does it matter if I load a 256x256 image from file vs a raw numpy array? If anything, I would think that the numpy version is faster. What am I missing?
There's no implementation of GPU queue, so it only loads stuff into main memory and there's no asynchronous prefetching into GPU. You could make something like a GPU-based queue using variables pinned to gpu:0
The documentation suggests that it is possible to pin a queue to a device:
N.B. Queue methods (such as q.enqueue(...)) must run on the same device as the queue. Incompatible device placement directives will be ignored when creating these operations.
But the above implies to me that any variables one is attempting to enqueue should already be on the GPU.
This comment suggests it may be possible to use tf.identity to perform the prefetch.

Are sockets communicating on the same PC that much slower than using shared memory?

I have a Windows DLL that provides video to an external application. My main application creates each video frame and I use globally shared memory backed by the system page file to pass that frame to the DLL. The video frame is subsequently retrieved by the external application and then displayed. I do not own the external application, just the DLL it loads to get video from. I am considering switching to a socket based approach to talk between my main application and the DLL and getting rid of the shared memory approach. I do not like watching the "soft page faults" pile up as I repetitively invalidate the shared memory location each time I write a new video frame to it. I believe that the soft page faults are harmless, just a side effect of the memory paging involved, but I would be more comfortable without it.
Since the video is being delivered at a frame rate of about 25 frames per second, I have approximately 1/25th of a second to transfer the frame. The frames are never larger than 640 x 480 and they are compressed JPEG frames so they aren't very large at all, usually about 10,000 bytes. So here's my question:
With an already open and persistent socket connection between two sockets on the same PC, will the time to transfer a video frame be significantly longer using a socket instead of a shared memory location? Or at the O/S level is it just a fast memory write with some insignificant "window dressing" around it to support the socket communication?
The main advantage of using shared memory is avoiding memory copies from application to kernel buffers (and back on the receiving end) and getting rid of user to kernel mode switching via system calls. You still need synchronization between cooperating processes, but that could be done in userland avoiding the kernel. All this is far from trivial and few people get it right, but my point is that switching to sockets will make your system slower. By how much and if that is acceptable is for you to measure and judge.
There's another side to socket-based vs. shared memory-based setup - flexibility - with sockets it's easy to switch to a distributed setup. With networks getting faster and faster that's what might be in store for you down the road.
Hope this helps.
Strictly speaking, shared memory would be faster as the socket communication adds a layer of indirection and instructions. Do you need backing for your shared memory? Windows allows shared memory without disk backing. I believe there's a way to keep the region from getting swapped as well, but don't know off-hand.

RTP: recommend strategy in order to achieve fluent audio stream

Let me explain what I mean when I say fluent audio stream.
I have a VOIP application which transfers PCMU encoded audio wrapped in RTP packages through UDP. I already implemented mechanisms which deal with package losses(as suggested in rfc3550).
The problem is that due to platform limitations(blackberry OS) I need to maintain a constant flow of data i.e. I need to pass X bytes every S milliseconds.
Because of network delays, undelivered datagrams etc. I can't guarantee that constant data flow so I created a separate thread which compensates the packages which were dropped or delivered late with fake packages("silence").
So my question is - can anyone suggest a good way to combine the fake packages and the real ones? I realize that adding a fake package automatically increases the lag and maybe I should ignore a real RTP packages after that but as I said this is because of platform limitations and I am willing to make compromises with the quality of the audio and have some additional speech loss.
You need to read up on:
Jitter Buffers
Packet Loss Concealment
These exist to handle exactly the sort of problems you're dealing with.

Average performance measurements for local IPC

We are now assessing different IPC (or rather RPC) methods for our current project, which is in its very early stages. Performance is a big deal, and so we are making some measurements to aid our choice. Our processes that will be communicating will reside on the same machine.
A separate valid option is to avoid IPC altogether (by encapsulating the features of one of the processes in a .NET DLL and having the other one use it), but this is an option we would really like to avoid, as these two pieces of software are developed by two separate companies and we find it very important to maintain good "fences", which make good neighbors.
Our tests consisted of passing messages (which contain variously sized BLOBs) across process boundaries using each method. These are the figures we get (performance range correlates with message size range):
Web Service (SOAP over HTTP):
25-30 MB/s when binary data is encoded as Base64 (default)
70-100 MB/s when MTOM is utilized
.NET Remoting (BinaryFormatter over TCP): 100-115 MB/s
Control group - DLL method call + mem copy: 800-1000 MB/s
Now, we've been looking all over the place for some average performance figures for these (and other) IPC methods, including performance of raw TCP loopback sockets, but couldn't find any. Do these figures look sane? Why is the performance of these local IPC methods at least 10 times slower than copying memory? I couldn't get better results even when I used raw sockets - is the overhead of TCP that big?
Shared memory is the fastest.
A producer process can put its output into memory shared between processes and notify other processes that the shared data has been updated. On Linux you naturally put a mutex and a condition variable in that same shared memory so that other processes can wait for updates on the condition variable.
Memory-mapped files + synchronization objects is the right way to go (almost the same as shared memory, but with more control). Sockets are way too slow for local communications. Especially it sometimes happens that network drivers are slower with localhost, than over network.
Several parts of our system have been redesigned so that we don't have to pass 30MB messages around, but rather 3MB. This allowed us to choose .NET Remoting with BinaryFormatter over named pipes (IpcChannel), which gives satisfactory results.
Our contingency plan (in case we ever do need to pass 30MB messages around) is to pass protobuf-serialized messages over named pipes manually. We have determined that this also provides satisfactory results.

Resources