After reading this github issue I feel like I'm missing something in my understanding on queues:
https://github.com/tensorflow/tensorflow/issues/3009
I thought that when loading data into a queue, it will get pre-transferred to the GPU while the last batch is getting computed, so that there is virtually no bandwidth bottleneck, assuming computation takes longer than the time to load the next batch.
But the above link suggests that there is an expensive copy from queue into the graph (numpy <-> TF) and that it would be faster to load the files into the graph and do preprocessing there instead. But that doesn't make sense to me. Why does it matter if I load a 256x256 image from file vs a raw numpy array? If anything, I would think that the numpy version is faster. What am I missing?
There's no implementation of GPU queue, so it only loads stuff into main memory and there's no asynchronous prefetching into GPU. You could make something like a GPU-based queue using variables pinned to gpu:0
The documentation suggests that it is possible to pin a queue to a device:
N.B. Queue methods (such as q.enqueue(...)) must run on the same device as the queue. Incompatible device placement directives will be ignored when creating these operations.
But the above implies to me that any variables one is attempting to enqueue should already be on the GPU.
This comment suggests it may be possible to use tf.identity to perform the prefetch.
Related
I am currently using the function "getdata" from the imaqtool library to get my camera data, and make some postprocessing on my GPU.
Hence, I would like to get the data directly transfer from the buffer CPU memory to my GPU memory.
It is my understanding that "getdata" move data from CPU memory (buffer) to CPU memory. Hence, it should be trivial to transfer these data to my GPU directly.
However, I cannot find anything about it.
Any help is appreciated.
In short: MATLAB is not the right tool for your desires. MATLAB provides quite an easy interface, but that means you dont have full control on some things, and the main one is memory allocation and management. This is generally a good thing, as it is non-trivial to handle memory, but in your case, this is what you are asking for.
If you want to make a fast acquisition system where the memory is fully controlled by you, you will need to use low level languages such as C++/CUDA, and play with asynchronous operations and threads.
In MATLAB, the most flexibility you can get is using gpuArray(captured_data) once is on CPU.
The motivation of this quesion is to understand how software memory prefetching affects my program.
I'm building a multi-threaded data partitioner. Each thread sequencially read over a local source array and randomly write to another local destination array. As the content of the source array won't be used in near future, I'd like to use prefetchtnta instruction to avoid them growing inside caches. On the other hand, each thread has a local write combiner that combines writes and commits to the local destination array using _mm_stream_si64. The intuition and goal is to make sure each thread has a fixed size of data cache to work with and never being occupied by unused bits.
Is this design reasonable? I'm not familiar of how CPU works and cannot be sure if this strategy actually disables hardware prefetchers that presumably invalidate this approach. If this is just me being naive, what's the right way to achieve this goal?
I am new to boost geometry. In my case, I need handle a large mount of data nodes, so they cannot be saved in memory.
Is it possible to use boost geometry together with local file system?
A generic answer is: use a memory mapped file from Boost Interprocess (IPC) with (boost) containers that use the IPC allocators. [1]
This will make it possible to work with /virtually/ unlimited data sizes transparently (at least in 64bit processes).
However Paging Is Expensive.
Boost Geometry is likely not optimized for sequential access patterns, so you might need to play very tight with what algorithms work and in what order to apply them. Otherwise, scaling this kind of volume (I'm assuming >16Gb for simplicity) will in practice turn out unbearably slow due to paging.
In all usual circumstances, scaling to non-trivial volumes involves tuning the algorithms or even writing targeted ones for the purpose.
Without any knowledge of the actual task at hand you could try
starting with memory mapped data allocation
slowly start building the algorithmic steps, one by one at a time
each step, incrementally grow your data set while keeping a close eye on the profiler
Your profiler will tell what step introduces a performance bottle-neck and at what volume it becomes discernible.
[1] this gives you persistence for "free"; however, keep in mind you are responsible for transactions and fsync-ing at proper times. Also, contiguous/sequential containers work best.
I've always been wondering about the caching behaviour of global data in OpenCL.
Lets say I have a pointer to global memory in a kernel.
Now I read the location the pointer points to.
Later in the kernel I might need the same data again, so I read it again through the pointer.
Now the question is, will this data be cached, or will it be reread from global memory every single time because other threads could have modified it?
If it's not cached, then I'd have to make a local copy every time so I don't lose tons of performance by constantly accessing global memory.
I know this might be vendor specific, but what do the specs say about this?
There is some caching but the key to great GPU compute performance it is move "accessed many time" data to private or shared local memory and not re-read it. In a way, you can think of this as "you control the caching". In OpenCL this would be done in your kernel (in parallel!) and then you'd have a memory barrier (to ensure all work items have finished the copy) then your algorithm has access to the data in fast memory. See the matrix multiply example (since each column and row contributes to multiple output values, copying them to shared local memory accelerates the algorithm.
Those who want benefits of local cashing for work-items within a work-group, for example on FPGAs, can read this paper by Andrew Ling at IWOCL2017 https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf. It is a good example of having correct usage of local caches and clever communication for dataflow computing. Those who want convenience of cache in parallel peer-to-peer setting and still have hardware do this for them should consider POWER8 or POWER9 chips. These conveniences come at the cost: caching global or virtual memory cluster interconnect may have to have several TBs of bandwidth. Real question is: What is the value of caching for dataflow compute e.g. ML, especially on clusters, vs. reducing communication and increasing data reuse by other means.
I'm working on an application where I real-time process a video feed on my GPU and once in a while I need to do some resource extensive calculations on my GPU besides that. My problem now is that I want to keep my video processing at real-time speed while doing the extra work in parallel once it comes up.
The way I think this should be done is with two command-queues, one for the real time video processing and one for the extensive calculations. However, I have no idea how this will turn out with the computing resources of the GPU: will there be equally many workers assigned to the command-queues during parallel execution? (so I could expect a slowdown of about 50% of my real-time computations?) Or is it device dependent?
The OpenCL specification leaves it up to the vendor to decide how to balance execution resources between multiple command queues. So a vendor could implement OpenCL in such a way that causes the GPU to work on only one kernel at a time. That would be a legal implementation, in my opinion.
If you really want to solve your problem in a device-independent way, I think you need to figure out how to break up your large non-real-time computation into smaller computations.
AMD has some extensions (some of which I think got adopted in OpenCL 1.2) for device fission, which means you can reserve some portion of the device for one context and use the rest for others.