Inter-block synchronization in CUDA - parallel-processing

I've searched a month for this problem. I cannot synchronize blocks in CUDA.
I've read a lot of posts about atomicAdd, cooperative groups, etc. I decided to use an global array so a block could write on one element of global array. After this writing, a thread of block waits(i.e. trapped in a while loop) until all blocks write global array.
When I used 3 blocks my synchronization works well (because I have 3 SM). But using 3 blocks gives me 12% occupancy. So I need to use more blocks, but they can't be synchronized.
The problem is: a block on a SM waits for other blocks, so the SM can't get another block.
What can I do? How can synchronize blocks when there are blocks more than the number of SMs?
CUDA-GPU specification: CC. 6.1, 3 SM, windows 10, VS2015, GeForce MX150 graphic card.
Please help me for this problem. I used a lot of codes but none of them works.

The CUDA programming model methods to do inter-block synchronization are
(implicit) Use the kernel launch itself. Before the kernel launch, or after it completes, all blocks (in the launched kernel) are synchronized to a known state. This is conceptually true whether the kernel is launched from host code or as part of CUDA Dynamic Parallelism launch.
(explicit) Use a grid sync in CUDA Cooperative groups. This has a variety of requirements for support, which you are starting to explore in your other question. The simplest definition of support is if the appropriate property is set (cooperativeLaunch). You can query the property programmatically, using cudaGetDeviceProperties().

Related

Can I use boost named_semaphore in place of ACE_SEMAPHORE as I am trying to move from ACE to boost libraries?

I am moving my code from ACE library support to boost library support. I need to replace ACE_Semaphore. It seems C++11 doesn't support semaphore methods. I have seen named_semaphore in boost. Another choice I saw was to go for POCO semaphore where in I have to include POCO libraries. Kindly give me suggestions as to which is the best way to move forward.
Edit: This is for intra process thread synchronization.
The short answer is: yes.
If for intra-process synchronization, you can simply emulate one with a mutex+condition variable:
C++0x has no semaphores? How to synchronize threads?
Note though, usually a mutex + condition variable will do, as the concrete condition doesn't usually take the form of a counter. (E.g. Synchronizing three threads with Condition Variable)
For interprocess synchronization you use the named semaphore. An example: How to limit the number of running instances in C++ Note that there are implementation differences¹.
¹ e.g. named_semaphore in boost allocates its own shared resource, while in ACE it's assumed the user allocates it from existing shared space. In boost, you obviously also can, as long as your platform supports native synchronization primitives in shared memory

Driving motors with image processing on raspberry pi

I have a question about processing the image while driving a motor. I did some researches, probably I need to use multiprocessing. However, I couldn't find out how to run two processors together.
Let's say I have two functions as imageProcessing() and DrivingMotor(). With coming information from imageProcessing(), I need to update my DrivingMotor() function simultaneously. How can I handle this issue?
In multiprocessing, you must create two process(process means program in execution) and must implement interproccesing communication methods to communicate process each other, this is tedious,hard and inefficient way .Multiproccesing less efficient than multithreading.Therefore I think you should multithread ,it is very efficient way ,communication between thread is very easy, you can use global data for communication.
You shall create two threads, one thread is handle imageProcessing() ,and other thread DrivingMotor().Operating system handled execution of thread,Operating system run synchronous these threads.
there is basic tutorial for multithreading below links
https://www.tutorialspoint.com/python/python_multithreading.htm

what's the memory allocation functions can be called from the interrupt environment in AIX?

xmalloc can be used in the process environment only when I write a AIX kernel extension.
what's the memory allocation functions can be called from the interrupt environment in AIX?
thanks.
The network memory allocation routines. Look in /usr/include/net/net_malloc.h. The lowest level is net_malloc and net_free.
I don't see much documentation in IBM's pubs nor the internet. There are a few examples in various header files.
There is public no prototype that I can find for these.
If you look in net_malloc.h, you will see MALLOC and NET_MALLOC macros defined that call it. Then if you grep in all the files under /usr/include, you will see uses of these macros. From these uses, you can deduce the arguments to the macros and thus deduce the arguments to net_malloc itself. I would make one routine that is a pass through to net_malloc that you controlled the interface to.
On your target system, do "netstat -m". The last bucket size you see will be the largest size you can call net_malloc with the M_NOWAIT flag. M_WAIT can be used only at process time and waits for netm to allocate more memory if necessary. M_NOWAIT returns with a 0 if there is not enough memory pinned. At interrupt time, you must use M_NOWAIT.
There is no real checking for the "type" but it is good to pick an appropriate type for debugging purposes later on. The netm output from kdb shows the type.
In a similar fashion, you can figure out how to call net_free.
Its sad IBM has chosen not to document this. An alternative to get this information officially is to pay for an "ISV" question. If you are doing serious AIX development, you want to become an ISV / Partner. It will save you lots of heart break. I don't know the cost but it is within reach of small companies and even individuals.
This book is nice to have too.

make_request and queue limits

I'm writing a linux kernel module that emulates a block device.
There are various calls that can be used to tell the block size to the kernel, so it aligns and sizes every request toward the driver accordingly. This is well documented in the "Linux Device Drives 3" book.
The book describes two methods of implementing a block device: using a "request" function, or using a "make_request" function.
It is not clear, whether the queue limit calls apply when using the minimalistic "make_request" approach (which is also the more efficient one if the underlying device is has really no benefit from sequential over random IO, which is the case with me).
I would really like to get the kernel to talk to me using 4K block sizes, but I see smaller bio-s hitting my make_request function.
My question is that should the blk_queue_limit_* affect the bio size when using make_request?
Thank you in advance.
I think I've found enough evidence in the kernel code that if you use make_request, you'll get correctly sized and aligned bios.
The answer is:
You must call blk_queue_make_request first, because it sets queue limits to defaults. After this, set queue limits as you'd like.
It seems that every part of the kernel submitting bios are do check for validity, and it's up to the submitter to do these checks. I've found incomplete validation in submit_bio and generic_make_request. But as long as no one does tricks, it's fine.
Since it's a policy to submit correct bio's, but it's up to the submitter to take care, and no one in the middle does, I think I have to implement explicit checks and fail the wrong bio-s. Since it's a policy, it's fine to fail on violation, and since it's not enforced by the kernel, it's a good thing to do explicit checks.
If you want to read a bit more on the story, see http://tlfabian.blogspot.com/2012/01/linux-block-device-drivers-queue-and.html.

Linux Kernel - programmatically retrieve block numbers as they are written to

I want to maintain a list of block numbers as they are physically written to using the linux kernel source. I plan to modify the kernel source to do this. I just need to find the structure and functions in the kernel source that handle writing to physical partitions and get the block numbers as they write to the physical partition.
Any way of doing this? Any help is appreciated. If I can find where the kernel is actually writing to the partitions and returning the block numbers, that'd work.
I believe you could do this entirely from userspace, without modifying the kernel, using the blktrace interface.
It isn't just one place to check. For instance, if the block device was an iSCSI or AoE target, you would be looking for their respective drivers, then ultimately the same on the other end.
The same would go for normal SCSI, misc flash devices, etc, minus network interaction.
VFS just pulls these all together in a convenient, unified and consistent interface for calls like read() and write() to work while providing buffering. The actual magic, including ordering and write barriers are handled by the block dev drivers themselves.
In the case of using device mapper, the path alters slightly. It goes from vfs -> dm_(target) -> blockdev_driver.

Resources