When creating a logical device, we can specify multiple queue for one family queue index as we can see in the VkDeviceQueueCreateInfo, and we can pass an array of this struct in the VkDeviceCreateInfo.
So my first clue to, for instance, create a transfer queue and a graphics queue, is to use the same logical device with two different VkDeviceQueueCreateInfo at device creation.
But, can I create two logical devices from the same physical device with a different VkDeviceQueueCreateInfo (one for graphics, and one for transfer) ?
And if yes, what could be the benefit or the bad idea to make one or the other solution ?
Generally speaking, when assessing possible ways to do things in Vulkan, you should pick the way that seems to require the least stuff.
In this case, you're trying to select between multiple queues and multiple devices. Well, the multi-queue method obviously requires less stuff; you have one device with many queues (in theory), rather than many devices with many queues (one from each device). Same number of queues, but more devices. So pick the one with less.
The Vulkan API is not trying to trick you into taking the slower path. If using multiple devices with one queue per-device were the best option, then Vulkan wouldn't have multiple queues as an option at all.
To get into more detail, you say that you want to do memory transfers outside of graphics operations. OK, fine.
Physical devices do not have to provide multiple queues. Some devices provide exactly one queue family, which can have exactly one VkQueue created from it. Obviously if a device allows for multiple queues, you should use them. But if it only allows for one, then you might have reason to think that you should just create multiple devices and work that way.
Even in this case, do not do this.
Here's the thing: if a GPU could actually do multiple operations independently such that they overlap... they'd expose multiple queues. The fact that a physical device does not do so therefore suggests that independent execution of different operations simply is not possible at the GPU level.
This means that, even if you use multiple devices, the transfer and graphics operation will almost certainly execute in some order. That is, whichever vkQueueSubmit is issued first will be the one that does its work first.
So using multiple devices gives you no actual GPU execution overlap (in theory). You've gained nothing, and you've lost explicit control over the order in which these operations are issued.
Now, it may be that the execution of a transfer operation on the graphics queue will not inhibit the execution of rendering commands on that same queue. That is, transfer operations can start, then rendering commands can start while the transfer completes via DMA or something. So they start executing in order, but finish executing in any order.
Even if that's the case, working across devices doesn't give you any advantages here. As previously noted, you lose control over the order in which these commands are submitted. Graphics commands tend to hog the command queue, while a single transfer command could (on such a system) be processed and then execute in the background while processing unrelated commands. In such a case, it's important to send any transfer commands before graphics commands for a particular frame.
And if you have two devices, you have to have 2 vkQueueSubmit calls rather than one. And vkQueueSubmit calls are not known for being fast.
There are a host of other reasons not to try multi-device stuff for this. For example, if later rendering operations need access to the transferred data, this means you need external memory and external synchronization primitives to synchronize access between the devices. And so on.
Related
At work we are building an airline machine. It is a machine which holds bicycle frames and it has several stations.
Depending on how much stations there are, the amount of physical IO blocks on the ethercat bus may differ. This may differ per customer.
The amount of stations can be entered via a user interface. So the Beckhoff can calculate how much IO there should be present.... in theory that is.
We would like one single program for this machine which can work if not all IO is present on the ethercat bus. But we do not know how to.
We have found out about Conditional pragmas but that is our last resort.
This is possible to achieve. I've worked in a project where parts of the EtherCAT topology was changing every minute.
You achieve this by a combination of EtherCAT couplers/junctions with identity switches (such as the EK1101-0010) and the Hot Connect functionality of EtherCAT. Depending on your real-time requirements and how fast you want to be able to do the switching of the EtherCAT fieldbus slaves, you might also want to consider fast hot connect.
Using the above you can change your hardware configuration during runtime.
I don't think it is possible to change the number of IO links while the program is executing. Whenever a change is made to some IO links, you have to reactivate the configuration.
Like you mentioned you can use conditional pragma's in combination with TcLinkTo attributes to change IO links.
When using events in CUDA, I typically create an event and immediately record it on some stream. After synchronizing, I don't bother to hold on to that cudaEvent_t, to use it elsewhere - I just destroy it.
Other than avoiding the overhead of event creation and destruction, is there any other benefit to "recycling" events? If not, why did nVIDIA bother to separate cudaEventCreate() from cudaEventRecord() ?
First I'm trying to answer the question "what the overhead could be". As we don't have the source code of CUDA event. Everything is based on some reasonable guess. You could make totally different design decision to implement the CUDA event with same or similar behavior.
In the timing task we know that at least the time of the event is recorded somewhere. As the event happens on the device side, I think the time is recorded in the device side memory to avoid using PCIe (high overhead) during recording. As eventually you get the time from the host side, the recorded time must be transferred through PCIe at sometime (probably eventSync()).
You see during the whole procedure, you need some space both in host and device side memory to store the time. It looks good to me a perfect place to allocate/release the memory in eventCreate()/eventDestroy(), just like malloc()/free(). It also looks like a perfect overhead that you want to avoid when recording the time repeatedly (reusing the event).
So two types of overhead here, Allocating device and host space, and PCIe transfer. This is my guess. Maybe you could have another way to implement the timing functionality without involving these overheads.
Then finally, avoiding these overheads seems like a good reason that nVidia uses a separate eventCreate().
My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?
The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.
You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.
The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.
Which of these two different models would be more efficient (consider thrashing, utilization of processor cache, overall desgn, everything, etc)?
1 IOCP and spinning up X threads (where X is the number of processors the computer has). This would mean that my "server" would only have 1 IOCP (queue) for all requests and X Threads to serve/handle them. I have read many articles discussing the effeciency of this design. With this model I would have 1 listener that would also be associated to the IOCP. Lets assume that I could figure out how to keep the packets/requests synchronized.
X IOCP (where X is the number of processors the computer has) and each IOCP has 1 thread. This would mean that each Processor has its own queue and 1 thread to serve/handle them. With this model I would have a separate Listener (not using IOCP) that would handle incomming connections and would assign the SOCKET to the proper IOCP (one of the X that were created). Lets assume that I could figure out the Load Balancing.
Using an overly simplified analogy for the two designs (a bank):
One line with several cashiers to hand the transactions. Each person is in the same line and each cashier takes the next available person in line.
Each cashier has their own line and the people are "placed" into one of those lines
Between these two designs, which one is more efficient. In each model the Overlapped I/O structures would be using VirtualAlloc with MEM_COMMIT (as opposed to "new") so the swap-file should not be an issue (no paging). Based on how it has been described to me, using VirtualAlloc with MEM_COMMIT, the memory is reserved and is not paged out. This would allow the SOCKETS to write the incomming data right to my buffers without going through intermediate layers. So I don't think thrashing should be a factor but I might be wrong.
Someone was telling me that #2 would be more efficient but I have not heard of this model. Thanks in advance for your comments!
I assume that for #2 you plan to manually associate your sockets with an IOCP that you decide is 'best' based on some measure of 'goodness' at the time the socket is accepted? And that somehow this measure of 'goodness' will persist for the life of the socket?
With IOCP used the 'standard' way, i.e. your option number 1, the kernel works out how best to use the threads you have and allows more to run if any of them block. With your method, assuming you somehow work out how to distribute the work, you are going to end up with more threads running than with option 1.
Your #2 option also prevents you from using AcceptEx() for overlapped accepts and this is more efficient than using a normal accept loop as you remove a thread (and the resulting context switching and potential contention) from the scene.
Your analogy breaks down; it's actually more a case of either having 1 queue with X bank tellers where you join the queue and know that you'll be seen in an efficient order as opposed to each teller having their own queue and you having to guess that the queue you join doesn't contain a whole bunch of people who want to open new accounts and the one next to you contains a whole bunch of people who only want to do some paying in. The single queue ensures that you get handled efficiently.
I think you're confused about MEM_COMMIT. It doesn't mean that the memory isn't in the paging file and wont be paged. The usual reason for using VirtualAlloc for overlapped buffers is to ensure alignment on page boundaries and so reduce the number of pages that are locked for I/O (a page sized buffer can be allocated on a page boundary and so only take one page rather than happening to span two due to the memory manager deciding to use a block that doesn't start on a page boundary).
In general I think you're attempting to optimise something way ahead of schedule. Get an efficient server working using IOCP the normal way first and then profile it. I seriously doubt that you'll even need to worry about building your #2 version ... Likewise, use new to allocate your buffers to start with and then switch to the added complexity of VirtualAlloc() when you find that you server fails due to ENOBUFS and you're sure that's caused by the I/O locked page limit and not lack of non-paged pool (you do realise that you have to allocate in 'allocation granularity' sized chunks for VirtualAlloc()?).
Anyway, I have a free IOCP server framework that's available here: http://www.serverframework.com/products---the-free-framework.html which might help you get started.
Edited: The complex version that you suggest could be useful in some NUMA architectures where you use NIC teaming to have the switch spit your traffic across multiple NICs, bind each NIC to a different physical processor and then bind your IOCP threads to the same processor. You then allocate memory from that NUMA node and effectively have your network switch load balance your connections across your NUMA nodes. I'd still suggest that it's better, IMHO, to get a working server which you can profile using the "normal" method of using IOCP first and only once you know that cross NUMA node issues are actually affecting your performance move towards the more complex architecture...
Queuing theory tells us that a single queue has better characteristics than multiple queues. You could possibly get around this with work-stealing.
The multiple queues method should have better cache behavior. Whether it is significantly better depends on how many received packets are associated with a single transaction. If a request fits in a single incoming packet, then it'll be associated to a single thread even with the single IOCP approach.
My application of MPI has some process that generate some large data. Say we have N+1 process (one for master control, others are workers), each of worker processes generate large data, which is now simply write to normal file, named file1, file2, ..., fileN. The size of each file may be quite different. Now I need to send all fileM to rank M process to do the next job, So it's just like all to all data transfer.
My problem is how should I use MPI API to send these files efficiently? I used to use windows share folder to transfer these before, but I think it's not a good idea.
I have think about MPI_file and MPI_All_to_all, but these functions seems not to be so suitable for my case. Simple MPI_Send and MPI_Recv seems hard to be used because every process need to transfer large data, and I don't want to use distributed file system for now.
It's not possible to answer your question precisely without a lot more data, data that only you have right now. So here are some generalities, you'll have to think about them and see if and how to apply them in your situation.
If your processes are generating large data sets they are unlikely to be doing so instantaneously. Instead of thinking about waiting until the whole data set is created, you might want to think about transferring it chunk by chunk.
I don't think that MPI_Send and _Recv (or the variations on them) are hard to use for large amounts of data. But you need to give some thought to finding the right amount to transfer in each communication between processes. With MPI it is not a simple case of there being a message startup time plus a message transfer rate which apply to all messages sent. Some IBM implementations, for example, on some of their hardware had different latencies and bandwidths for small and large messages. However, you have to figure out for yourself what the tradeoffs between bandwidth and latency are for your platform. The only general advice I would give here is to parameterise the message sizes and experiment until you maximise the ratio of computation to communication.
As an aside, one of the tests you should already have done is measured message transfer rates for a wide range of sizes and communications patterns on your platform. That's kind of a basic shake-down test when you start work on a new system. If you don't have anything more suitable, the STREAMS benchmark will help you get started.
I think that a all-to-all transfers of large amounts of data is an unusual scenario in the kinds of programs for which MPI is typically used. You may want to give some serious thought to redesigning your application to avoid such transfers. Of course, only you know if that is feasible or worthwhile. From what little information your provide it seems as if you might be implementing some kind of pipeline; in such cases the usual pattern of communication is from process 0 to process 1, process 1 to process 2, 2 to 3, etc.
Finally, if you happen to be working on a computer with shared memory (such as a multicore PC) you might think about using a shared memory approach, such as OpenMP, to avoid passing large amounts of data around.