Advantage of using a CUDA Stream - parallel-processing

Advantage of using a CUDA Stream - parallel-processing

I am trying to understand where a Stream might help me with processing multiple Regions of Interest on a video frame. If using NPP functions that support a stream, is this a case where one would launch as many streams as there are ROIs? Possibly even creating a CPU thread for each Stream? Or is the benefit in using one stream to process all the ROIs and possibly using this single stream from multiple threads in the CPU?

In CUDA, usage of streams generally helps to better utilize GPU in two ways. Firstly, memory copies between host and device can be overlapped by kernel execution if copying and execution occur in different streams. Secondly, individual kernels running in different streams can overlap if there are enough resources on the GPU.
Further, whether creating a thread for each ROI would help depends on comparison of GPU vs CPU (if any) utilization. If there is a lot of processing on CPU and CPU holds off GPU computation, creating more threads helps.
There are further details (see the documentation for actual version of CUDA) which constrain overlapping of operations in the streams. A memory copy overlaps with a kernel execution only if memory source or destination in RAM is page-locked. Or, synchronization between streams occurs when host thread issues command(s) in the default stream. (Since CUDA 7 each thread has its own default stream, so processing ROIs in different threads would help again.)
Hence, satisfying certain conditions, it should improve performance of your algorithm if the processing of ROIs occurs in different streams up to certain limit (depending on resource consumption of the kernels, ratio of memory copies and computation, etc...)

Related

Improved image storage/import performance on HDD [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 months ago.
Improve this question
Question
Hello, I have a question about threadpool, HDD read/write simultaneously. It's my first time leaving a question, so I am sorry in advance because the writing is lengthy...
On one PC,
The Image processing and image storage programs,
and The image loading program is running.
If the image storage and image import operations are running simultaneously on one HDD, the image processing operation seems to slow down.
HDD has only one disk head, so I know it's the fastest to do only one move at a time... There is nothing we can do about this part, so I want to minimize or slowdown.
Next, the development environment and implementation situation.
I worked with MFC + OpenCV (Windows 10.0.19044)
The image processing program is repeated every time an instruction is received and is running 24 hours a day.
The image is 16384 * 40000 pixels * 1bytes 2 sheets.
Since it is a high-capacity image, both image processing and image storage after image area division are performed in a thread pool.
The image loading program operates when the user needs it.
When inquiring, DB inquires video information and retrieves images from HDD.
The PC is equipped with SSD and two HDDs (13TB)
The processor is i9-12900KF, 16core, 24thread.
Any job is taken out by queuing it, and both image processing and image storage jobs are processing on the one thread pool.
I share the same thread pool and use it, so I guess that during image storage, the number of threads used for image processing decreases.
I set the number of threads at 40 for both programs. There's no particular reason. I heard that we need to catch it efficiently depending on the number of cores, but I am considering it.
I store the image in png format and jpg format respectively.
The default action for image loading is to load the file into a small jpg and the function is divided so that the user can load it directly into png if necessary.
When saving a split image,
The image encoding operation is performed simultaneously in the thread pool
Memory -> hdd transmissions are sequentially transmitted one by one in a single thread.
For image loading, hdd -> memory is loaded one by one sequentially
The image decoding operation is performed simultaneously in the thread pool.
The image processing result should be stored in the DB, and the result should be sent quickly.
It doesn't matter if the image storage is slowed down.
The image loading action is not satisfactory to the user, but it can be compromised to some extent. (Still, I want to implement it to give the result as soon as possible...)
So what I thought
If image storage/importing threads lower thread priority, will image processing threads do more work and work?
Is it meaningful to divide the thread pool for image storage/image processing instead of one thread pool?
Why don't you save the image on SDD, create a separate service program, and send it slowly to the HDD?
Actually, isn't there a problem with the disk?
1, 2, will be developed, and released. (It is difficult to reproduce problems in the office...)
The third method is to write to an HDD in SDD, write to an HDD at once, and overlap with the HDD reading
I think it's just the development that gets complicated. However, it is significantly faster than HDD when storing images.
In the case of number 4, jpg is not slow to load images due to the low file capacity... The process of decoding is slow. I thought it would have nothing to do with HDD from the decoding stage.
So, both programs have 40 threads in the thread pool The image import program reduced the number of threads to two and sent an update, but it was reported that the image import operation was only slow and the issue remained.
The situation is complicated and there are many suspicious things, but I'm asking you because I think there are parts that I don't know or have errors...

First of all, you use a thread pool with far more threads than the number of cores on the i9-12900KF processor. Having two threads running on the same physical core generally cause them to be slower. If they run on the same logical core, then they cannot run simultaneously (they will be constantly interrupted). In fact, even if they run on different physical cores, one thread can significantly slow down another if it intensively make use of the L3 cache or the memory which is likely your case. Operating on a large buffer can causes cache lines of the cache of other cores to be evicted and thus reloaded later. This is known as cache trashing. This problem can become critical with non-contiguous loads/stores.
The target processor is a big-little one so the scheduling of threads on such a processor is more complex than usual. In fact, many libraries do not support well such architecture yet (they are not running efficiently). Even OS stacks are barely suited for such kind of architecture (at least on Windows and Linux). The number of threads per core is not the same for all core: big core can execute 2 threads simultaneously (sharing available resources) while little core can only execute 1 thread at a time. It is worth noting that the frequency of the little core is not the same than the big core: 2.4 GHz VS 3.2 GHz for the base frequency and 3.9 GHz vs 5.1 for the turbo frequency). Regarding the scheduling of the thread to the core, the performance of the target thread can change.
The frequency of the cores running the threads is dependent of the number of cores used and the work done on each cores. For example, running a computationally intensive code using the FP AVX-2 units (or the non-officially supported AVX-512 units) on a core can significantly reduce the frequency of other cores. The higher the number of active core, the lower the frequency. Dynamic frequency stalling affect the scalability of application but this scaling is important for the processor to fulfil its power budget (and not melt too).
Caching also matters a lot. Indeed, mainstream OS tends to put HDD read/written data in memory so to operate faster. This requires some additional memory which is not considered allocated. When a process request a large amount of memory, the OS flush/invalidate the IO cache regarding the requested space and later accesses cause data to be reloaded from the storage device (much slower). The solution is to check the amount fully available memory (the part not cached) and not to use too much memory if the remaining space is used by the storage device cache.
Having two thread doing IO operations is generally not faster than 1 thread on HDD (especially with 1 head). Some OS stacks use locks if not even a giant lock. Because of that, one loading thread with asynchronous IO can be faster than blocking IO on one/multiple threads. Indeed, the OS can reorder requests so they can be more contiguous in that case (so to reduce the seek time by loading data on the way).

GPU Memory allocation with multiple cuda stream

sorry for bothering!
I am trying to deploy a model with dynamic connection of our company to production environment. Since the model is dynamically activated, so batching requests and do inference is not a good idea. Instead, I want to use multiple cuda streams to deal with several requests on one GPU concurrently. (one stream for one request)
I have tried libtorch, since it support multiple stream. However, I found that, when using libtorch, the memory allocated by each stream will be cached by that stream, which cannot be reused by other stream. (Suppose there are 2G memory on one GPU, and stream A caches 1G after deal with request1. Now stream B want to deal with request2, stream A will have to first return the memory back to OS, and stream B needs to call cuda malloc, which is very slow.)
I am wondering whether I could use tf-serving to deal with my model. Will the same thing happen with tf-serving? Can different streams in tf-serving reuse cached gpu memory?
I am looking forward for your reply! Thank you so much!

Are OpenCL workgroups executed simultaneously?

My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?

The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.

You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.

The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.

is it possible to use same memory buffers for different kernels in OpenCL?

I am implementing a kernel function in which the memory from the host side is transferred to kernel.The kernel has three functions.. Is it possible to share the same memory buffers with the kernels at different times ??

Yes, multiple kernels can use the same memory objects, as long as there is no risk for the kernels to be executed at the same time. It is the case for the usual "single command queue not created with out of order execution".

Yes, I do this with my ray tracer. I have three kernels. A preprocessor which changes geometry, a ray tracer , and a post processor which does image processing. I share memory buffers with all three of them. I make sure the kernels finish before I start the next one.

You can share memory without any problem. If the memory is read only you can even use that memory object as an input for 2 kernels running concurrently (ie: different GPUs/same context).
However, if you want to overwrite the memory zones, then be careful and use events to sync your kernels. I strongly recomend the events mechanism, since it enables parallel I/O read and writes to the memory zones in another queue.

opencl - resources for multiple commandqueues

I'm working on an application where I real-time process a video feed on my GPU and once in a while I need to do some resource extensive calculations on my GPU besides that. My problem now is that I want to keep my video processing at real-time speed while doing the extra work in parallel once it comes up.
The way I think this should be done is with two command-queues, one for the real time video processing and one for the extensive calculations. However, I have no idea how this will turn out with the computing resources of the GPU: will there be equally many workers assigned to the command-queues during parallel execution? (so I could expect a slowdown of about 50% of my real-time computations?) Or is it device dependent?

The OpenCL specification leaves it up to the vendor to decide how to balance execution resources between multiple command queues. So a vendor could implement OpenCL in such a way that causes the GPU to work on only one kernel at a time. That would be a legal implementation, in my opinion.
If you really want to solve your problem in a device-independent way, I think you need to figure out how to break up your large non-real-time computation into smaller computations.

AMD has some extensions (some of which I think got adopted in OpenCL 1.2) for device fission, which means you can reserve some portion of the device for one context and use the rest for others.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio