Does anyone know how to declare that an array of data in ArrayFire should be stored in shared memory instead of global memory? Is this possible? I have a small set of data that needs to be randomly accessible by all threads. It's a constant look-up table that should be available for the life of the application. Maybe I am just missing the obvious or something, but reading the ArrayFire docs and googling have not turned up any info on how I tell ArrayFire that my data needs to go into shared memory.
In CUDA Shared memory (Local memory in OpenCL) is a very fast type of memory that is located on the GPU. It has the same lifetime as on thread block and can only be accessed by threads in the same thread block. It therefore cannot be used to store persistent data which needs to be used by multiple kernels even in raw CUDA. You might want to look into constant or texture memory to implement a look up table(LUT). These memory types are usually more suited for the type of access you usually encounter with a LUT.
ArrayFire has a high level API which makes GPU programming easy with one of the fastest implementations of many commonly used functions. With ArrayFire you will not be able to specify which type of memory is created but you are free to use the data in your own kernel. If you are using one of our function then it is very likely we will make use of shared/texture/constant memory where it makes sense.
Umar
Disclosure: I am one of the developers of ArrayFire
Related
I am currently using the function "getdata" from the imaqtool library to get my camera data, and make some postprocessing on my GPU.
Hence, I would like to get the data directly transfer from the buffer CPU memory to my GPU memory.
It is my understanding that "getdata" move data from CPU memory (buffer) to CPU memory. Hence, it should be trivial to transfer these data to my GPU directly.
However, I cannot find anything about it.
Any help is appreciated.
In short: MATLAB is not the right tool for your desires. MATLAB provides quite an easy interface, but that means you dont have full control on some things, and the main one is memory allocation and management. This is generally a good thing, as it is non-trivial to handle memory, but in your case, this is what you are asking for.
If you want to make a fast acquisition system where the memory is fully controlled by you, you will need to use low level languages such as C++/CUDA, and play with asynchronous operations and threads.
In MATLAB, the most flexibility you can get is using gpuArray(captured_data) once is on CPU.
Recently I wrote a C-Application for a Microblaze and I used uC/OS-II. uC/OS-II offers memory pools to allocate and deallocate blocks of memory with fixed size. I'm now writing a C-Application for an STM32 where I use this time FreeRTOS. It seems like FreeRTOS doesn't offer the same mechanism or did I miss something? I think the five heap implementations are not what I am looking for.
If there are actually no memory pools, is there any specific reason why?
The original version of FreeRTOS used memory pools. However it was found that users struggled to dimension the pools, which led to a constant stream of support requests. Also, as the original versions of FreeRTOS were intended for very RAM constrained systems, it was found that the RAM wasted by the use of oversized pools was not acceptable. It was therefore decided to move memory allocation to the portable layer, on the understanding that no one scheme is suitable for more than a subset of applications, and allowing users to provide their own scheme. As you mention, there are five example implementations provided, which cover nearly all applications, but if you absolutely must use a memory pool implementation, then you can easily add this by providing your own pvPortMalloc() and vPortFree() implementations (memory pools being one of the easier ones to implement).
Also note that, in FreeRTOS V9, you need not have any memory allocation scheme as everything can be statically allocated.
I have a Go application which requires around 600GB of memory. The machine on which is will run has 128GB of RAM. I'm trying to decide how best to handle this.
The options are:
Just load everything into the memory (pretend like I have 600GB RAM) and let the OS page out the infrequently accessed part of the memory into virtual memory. I like this idea because I don't have to do anything special in the code, the OS will just handle everything. However, I'm not sure this is a good idea.
Have the data stored on disk and use mmap (memory mapped file) which I guess is similar to the above but will require a lot more coding. Also it appears to mean that the data will have to be stored as []byte and then parsed every time I need to use it, rather that being already in whatever type I need it for the actual calculations.
Build a caching system in which the data is kept on HDD and then loaded it when it's needed, with the most frequently accessed data being held in memory and the least frequently accessed data being purged whenever the memory limited is exceeded.
What are the advantages and disadvantages with these? I'd prefer to go with (1) if possible due to its simplicity... is there anything wrong with that?
It all depends on the nature of the data access. Will the accesses to those 600GB be uniformly distributed? If that's not the case then a solution where you cache part of your content in memory and keep the rest of it on the HDD will likely be sufficient since you have enough RAM to cache more than 20% of your data. Keeping everything in virtual memory space may come with surprising drawbacks such as the need for a huge swap partition.
To cache the data on disk you could use a DB engine as Dave suggests since they usually do a good job of caching the most frequently accessed content. You could also use memcached, a library and client for caching stuff in memory.
The bottom line is that optimizing performance without knowing the exact usage patterns is hard. Luckily, with Go you don't have to guess. You can test and measure.
You can define an interface similar to
type Index interface{
Lookup(query string) Result
}
And then try all of your solutions, starting with the easiest to implement.
type inMemoryIndex struct {...}
func (*inMemoryIndex) Lookup(query string) Result {...}
type memcachedIndex struct {...}
type dbIndex struct {...}
Then you can use Go's builtin benchmarking tools to benchmark your application and see if it lives up to your standards. You can even benchmark on that machine, using real data and mocked user queries.
You're correct to assume that mmap would require more coding so I would have saved that until I had tried all other options.
OpenCL is of course designed to abstract away the details of hardware implementation, so going down too much of a rabbit hole with respect to worrying about how the hardware is configured is probably a bad idea.
Having said that, I am wondering how much local memory is efficient to use for any particular kernel. For example if I have a work group which contains 64 work items then presumably more than one of these may simultaneously run within a compute unit. However it seems that the local memory size as returned by CL_DEVICE_LOCAL_MEM_SIZE queries is applicable to the whole compute unit, whereas it would be more useful if this information was for the work group. Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative? Is tuning by hand the only way to go? To me that means that you are only tuning for one GPU model.
Lastly, I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less? I hear that if you allocate too much then data is just placed in global memory. Is there a way of determining if this is the case?
Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
Not in one step, but you can compute it. First, you need to know how much local memory a workgroup will need. To do so, you can use clGetKernelWorkGroupInfo with the flag CL_KERNEL_LOCAL_MEM_SIZE (strictly speaking it's the local memory required by one kernel). Since you know how much local memory there is per compute unit, you can know the maximum number of workgroups that can coexist on one compute unit.
Actually, this is not that simple. You have to take into consideration other parameters, such as the max number of threads that can reside on one compute unit.
This is a problem of occupancy (that you should try to maximize). Unfortunately, occupancy will vary depending of the underlying architecture.
AMD publish an article on how to compute occupancy for different architectures here.
NVIDIA provide an xls sheet that compute the occupancy for the different architectures.
Not all the necessary information to do the calculation can be queried with OCL (if I recall correctly), but nothing stops you from storing info about different architectures in your application.
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative?
It is quite rigid, and with clGetKernelWorkGroupInfo you don't need to do that. However there is something about CL_KERNEL_LOCAL_MEM_SIZE that needs to be taken into account:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
Since you might need to compute dynamically the size of the necessary local memory per workgroup, here is a workaround based on the fact that the kernels are compiled in JIT.
You can define a constant in you kernel file and then use the -D option to set its value (previously computed) when calling clBuildProgram.
I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less?
Again CL_KERNEL_LOCAL_MEM_SIZE is the answer. the standard states:
This includes local memory that may be needed by an implementation to
execute the kernel...
If your work is fairly independent and doesn't re-use input data you can safely ignore everything about work groups and shared local memory. However, if your work items can share any input data (classic example is a 3x3 or 5x5 convolution that re-reads input data) then the optimal implementation will need shared local memory. Non-independent work can also benefit. One way to think of shared local memory is programmer-managed cache.
I've always been wondering about the caching behaviour of global data in OpenCL.
Lets say I have a pointer to global memory in a kernel.
Now I read the location the pointer points to.
Later in the kernel I might need the same data again, so I read it again through the pointer.
Now the question is, will this data be cached, or will it be reread from global memory every single time because other threads could have modified it?
If it's not cached, then I'd have to make a local copy every time so I don't lose tons of performance by constantly accessing global memory.
I know this might be vendor specific, but what do the specs say about this?
There is some caching but the key to great GPU compute performance it is move "accessed many time" data to private or shared local memory and not re-read it. In a way, you can think of this as "you control the caching". In OpenCL this would be done in your kernel (in parallel!) and then you'd have a memory barrier (to ensure all work items have finished the copy) then your algorithm has access to the data in fast memory. See the matrix multiply example (since each column and row contributes to multiple output values, copying them to shared local memory accelerates the algorithm.
Those who want benefits of local cashing for work-items within a work-group, for example on FPGAs, can read this paper by Andrew Ling at IWOCL2017 https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf. It is a good example of having correct usage of local caches and clever communication for dataflow computing. Those who want convenience of cache in parallel peer-to-peer setting and still have hardware do this for them should consider POWER8 or POWER9 chips. These conveniences come at the cost: caching global or virtual memory cluster interconnect may have to have several TBs of bandwidth. Real question is: What is the value of caching for dataflow compute e.g. ML, especially on clusters, vs. reducing communication and increasing data reuse by other means.