I have the following system parameters:
CL_DEVICE_TYPE_GPU
Device maximum compute units = 20
Device maximum Work Item Dimensions = 3
Device maximum Work Item Sizes = 512 x 512 x 512
Device maximum Work Group Size = 512
As I understand, if Item Dimensions = 1 -- there is an one-dimensional array of work-items in a work group. If Item Dimensions = 2 -- there is a two-dimensional array or matrix of work-items in a work group and so on. Work groups, in their turn, all together represent a set (NDRange).
But I cannot understand how to:
1) determine the maximum number of work-items in array or in matrix inside a work group
2) determine the maximum number of work-groups elements inside a set
I tried to find similar questions and clear answers, but unsuccessfully.
Thanks for help!
Just focus on the device maximum Work Item/Group limits.
Compute units is just for device fision functionalities.
The limits in work group size are given by:
Device maximum Work Group Size = 512
This is the maximum amount of work items in a work group. And it matches with the limit in the HW.
Then, you have to add an extra constrain to the "shape" of the group, in your case:
Device maximum Work Item Dimensions = 3
Device maximum Work Item Sizes = 512 x 512 x 512
Which means the limit is 3 dimensions, and in each dimension 512, so no limit for you! You can shape the 512 work items in the way you like 512x1x1 or 256x2x1, etc...
However the limit could be ie: 16x16x16. Therefore, even if you can run 512 in total, you are limited to 16x16x2, 8x8x8 or 16x8x4... etc. Not allowed values would be 32x16x1 or 512x1x1.
NOTE: It is not so uncommon to be limited by the shape. nVIDIA devices usually have 4096 work group sizes and 1024x1024x1024 limits.My guess is that they do it that way so they can store the work dimension id in a single register. While 4096x4096x4096 would need 2 registers.
Related
I am trying to do some operations on MIDI tracks, such as increasing/decreasing the playback speed.
For those who want more detail: To change playback speed, I need to divide 'delta times' of each track by the multiplier. Eg. If I want to speed up the track x2 then I divide the delta times by 2. Delta times are stored as variable length qualities so if I divide the delta times, I need to update the track's size by reducing the size in bytes so as to keep the state of the track consistent (because shorter delta times mean less number of bytes needed to store variable length quantity).
In my struct, the track length (size in bytes of the entire track) is stored as uint32_t. The problem occurs when I try to store the changed track size back. So lets say if my original track size was 3200 and after reducing the delta times the difference in bytes is 240, then I simply subtract this difference from the original length. However, when I use the 'du' command to check the new file size, the file size inflates heavily. Like it goes from somewhere like 16 kB to 2000 kB. I dont understand why.
I have a shader program in Open GL ES. I want to adjust local / global workgroup sizes to complete 1-dimensional task with Compute Shader.
I have a total size of a task (total number of threads that can change between different runs), say [task_size]. Say I specify local workgroup size, let it be [local_size]. And also I know how many workgroups I have, say [workgroups].
I specify local size as here:
layout(local_size_x = [local_size]) in;
And I specify number of workgroups in glDispatchCompute:
glDispatchCompute([workgroups], 1, 1);
If local_size * workgroups == task_size, I clearly understand what happens. Each part of the task is computed by separate group.
But what happens if task_size is not evenly divisible by local_size? I understand that minimum number of workgroups I need is task_size / local_size + 1. But how it works? Is last workgroup actually less than others? Does it affect performance? Is it a good idea to make a task_size evenly divisible by local_size?
I was reading Ext2 file system details, and I am not clear with the fact that the number of blocks in a block group is (b x 8) where b is the block size.
How have they arrived at this figure. What is the significance of 8.
For each group in a filesystem ext2 there is a block bitmap, which keeps track of which blocks are used (bit equals 1) and which are still free (bit equals 0). This structure is designed to occupy exactly one block. Hence, the number of bits in the block bitmap is equal to b x 8, where b is the block size expressed in bytes.
Blocks in the group must not outnumber bits in the block bitmap - otherwise we would not be able to keep information on their availability. At the same time we want groups to manage maximal possible number of blocks in order to limit space occupied by metadata. Therefore, the number of blocks in the group equals the maximum: b x 8.
I have known the ideas of block and grid in cuda, and I'm wondering if there is any helper function well written that can help me determine the best block and grid size for any given 2D image.
For example, for a 512x512 image mentioned in this thread. Grid is 64x64 and block is 8x8.
However sometimes my input image may not be power of 2, it may be 317x217 or something like that.In this case, maybe grid should be 317x1 and block should be 1x217.
So if I have an application that accepts an image from user, and use cuda to process it, how can it automatically determine the size and dimension of block and grid, where user can input any size of image.
Is there any existed helper function or class that handles this problem?
Usually you want to choose the size of your blocks based on your GPU architecture, with the goal of maintaining 100% occupancy on the Streaming Multiprocessor (SM). For example, the GPUs at my school can run 1536 threads per SM, and up to 8 blocks per SM, but each block can only have up to 1024 threads in each dimension. So if I were to launch a 1d kernel on the GPU, I could max out a block with 1024 threads, but then only 1 block would be on the SM (66% occupancy). If I instead chose a smaller number, like 192 threads or 256 threads per block, then I could have 100% occupancy with 6 and 8 blocks respectively on the SM.
Another thing to consider is the amount of memory that must be accessed vs the amount of computation to be done. In many imaging applications, you don't just need the value at a single pixel, rather you need the surrounding pixels as well. Cuda groups its threads into warps, which step through every instruction simultaneously (currently, there are 32 threads to a warp, though that may change). Making your blocks square generally minimizes the amount of memory that needs to be loaded vs the amount of computation that can be done, making the GPU more efficient. Likewise, blocks that are a power of 2 load memory more efficiently (if properly aligned with memory addresses) since Cuda loads memory lines at a time instead of by single values.
So for your example, even though it might seem more effective to have a grid that is 317x1 and blocks that are 1x217, your code will likely be more efficient if you launch blocks that are 16x16 on a grid that is 20x14 as it will lead to better computation/memory ratio and SM occupancy. This does mean, though, that you will have to check within the kernel to make sure the thread is not out of the picture before trying to access memory, something like
const int thread_id_x = blockIdx.x*blockDim.x+threadIdx.x;
const int thread_id_y = blockIdx.y*blockDim.y+threadIdx.y;
if(thread_id_x < pic_width && thread_id_y < pic_height)
{
//Do stuff
}
Lastly, you can determine the lowest number of blocks you need in each grid dimension that completely covers your image with (N+M-1)/M where N is the number of total threads in that dimension and you have M threads per block in that dimension.
It depends on how you deal with the image. If your thread only process each pixel separately, for example, adding 3 to each pixel value, you can just assign one dimension to your block size and the other to your grid size (just do not out of range). But if you want to do something like filter or erode, this kind of operation often need to access the pixels near the center pixel like 3*3 of 9*9. Then the block should be 8*8 as you mentioned, or some other value. And you'd better to use the texture memory. Because when the thread access to the global memory, there always be 32 thread to be a wrap in a block one time.
So there isn't function as you described. The number of threads and blocks depends on how you process the data, it is not universal.
Preface: There are many different design patterns that are important to cache's overall performance. Below are listed parameters for
different direct-mapped cache designs.
Cache data size: 32 kib
Cache block Size: 2 words
Cache access time: 1-cycle
Question: Calculate the number of bits required for the cache listed above, assuming a 32-bit address. Given that total size, find the
total size of the closest direct-mapped cache with 16-word blocks of
equal size or greater. Explain why the second cache, despite its
larger data size, might provide slower performance that the first
cache.
Here's the formula:
Number of bits in a cache 2^n X (block size + tag size + valid field size)
Here's what I got: 65536(1+14X(32X2)..
is this correct?
using: (2^index bits) * (valid bits + tag bits + (data bits * 2^offset bits))
for the first one i get:
total bits = 2^15 (1+14+(32*2^1)) = 2588672 bits
for the cache with 16 word blocks i get:
total bits = 2^13(1 +13+(32*2^4)) = 4308992
the next smallest cache with 16 word blocks and a 32 bit address works out to be 2158592 bits, smaller than the first cache.
I'm stuck on the same problem too but I have the answer to the first part.
To calculate the total number of bits required
You need to convert the KB to words and get the index bits.
Use the answer from part 1 to get your tag bits.
Plug them into this formula.
(2^(index bits)) * ((tag bits)+(valid bits)+(data size))
Hint: data size is 64 bits in this case and valid bit is 1. So just find the index and tag bits.
And I don't think your answer is right. I didn't check but I can see you are multiplying 1+14 and (32x2) instead of adding them.
I think the formula you were using is correct. According to my textbook "Computer Organization and Design The Hardware, 5th edition", the total number of bits in a direct-mapped cache is:
2^indext bits * (block size + tag size + valid field size).
block size was given by the question: 2 words = 32 bits
tag size: 32 - offset in bits - index in bits
valid field size is usually 1 valid bit