How are global and local work sizes distributed in this function?

This is from a sample program for OpenCL programming.
I am confused about how global and local work size are computed.
They are computed based on the image size.
Image size is 1920 x 1080 (w x h).
What I assumed is global_work_size[0] and global_work_size[1] are grids on image.
But now global_work_size is {128, 1088}.
Then local_work_size[0] and local_work_size[1] are grids on global_work_size.
local_work_size is {128, 32}.
But total groups, num_groups = 34, it is not 128 x 1088.
Max workgroup_size available at device is 4096.
How is the image distributed into such global and local work group sizes?
They are calculated in the following function.
clGetKernelWorkGroupInfo(histogram_rgba_unorm8, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &workgroup_size, NULL);
size_t gsize[2];
int w;
if (workgroup_size <= 256)
gsize[0] = 16;//workgroup_size is formed into row & col
gsize[1] = workgroup_size / 16;
else if (workgroup_size <= 1024)
gsize[0] = workgroup_size / 16;
gsize[1] = 16;
gsize[0] = workgroup_size / 32;
gsize[1] = 32;
local_work_size[0] = gsize[0];
local_work_size[1] = gsize[1];
w = (image_width + num_pixels_per_work_item - 1) / num_pixels_per_work_item;//to include all pixels, num_pixels_per_work_item is added first
global_work_size[0] = ((w + gsize[0] - 1) / gsize[0]);//col
global_work_size[1] = ((image_height + gsize[1] - 1) / gsize[1]);//row
num_groups = global_work_size[0] * global_work_size[1];
global_work_size[0] *= gsize[0];
global_work_size[1] *= gsize[1];
err = clEnqueueNDRangeKernel(queue, histogram_rgba_unorm8, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
if (err)
printf("clEnqueueNDRangeKernel() failed for histogram_rgba_unorm8 kernel. (%d)\n", err);

I don't see any great mystery here. If you follow the calculation, the values do indeed end up as you say. (Not that the group size is particularly efficient in my opinion.)
If workgroup_size is indeed 4096, gsize will end up as { 128, 32 } as it follows the else logic. (>1024)
w is the number of num_pixels_per_work_item = 32 wide columns, or the minimum number of work-items to cover the entire width, which for an image width of 1920 is 60. In other words, we require an absolute minimum of 60 x 1080 work-items to cover the entire image.
Next, the number of group columns and rows is calculated and temporarily stored in global_work_size. As group width has been set to 128, a w of 60 means we end up with 1 column of groups. (This seems a waste of resources, more than half of the 128 work-items in each group will not be doing anything.) The number of group rows is simply image_height divided by gsize[1] (32) and rounding up. (33.75 -> 34)
Total number of groups can now be determined by multiplying out the grid: num_groups = global_work_size[0] * global_work_size[1]
To get the true total number of work-items in each dimension, each dimension of global_work_size is now multiplied by the group size in this dimension. 1, 34 multiplied by 128, 32 yields 128, 1088.
This actually covers an area of 4096 x 1088 pixels so about 53% of that is wastage. This is mainly because the algorithm for group dimensions favours wide groups, and each work-item works on a 32x1 pixel slice of the image. It would be better to favour tall work groups to reduce the amount of rounding.
For example, if we reverse gsize[0] and gsize[1], in this case we'd get a group size of { 32, 128 }, giving us a global work size of { 64, 1152 } and only 12% wastage. It would also be worth checking if always picking the largest possible group size is even a good idea; it quite possibly isn't, but I've not looked into the kernel's computation in detail, let alone run any measurements, to say if that's the case or not.


How can I tile an image into patches of (constant) arbitrary size at (constant) arbitrary stride in MATLAB?

I have an image of arbitrary dimensions ROWS and COLS. I want to tile this image into patches of arbitrary, but constant size blockSize = [blockSizeR, blockSizeC], given an arbitrary, but constant stride stride = [strideR, strideC]. When the number of patches in row or column direction times the respective block size doesn't equal the number of rows or columns, respectively (i.e. if there were spare rows or columns), I don't care about them (i.e. they can be ignored). It's sufficient if the image is tiled into all possible patches that fit completely into the image starting from the top left pixel.
There is a bunch of possible solutions floating around the web, but some don't allow overlap, some don't allow outputs if there are spare rows or columns, some are making inefficient use of for loops.
The closest thing to what I need is probably the solution posted on
%img: source image
stride = [5, 5]; %height, width
blocksize = [11, 11]; %height, width
tilescount = (size(img(:, :, 1)) - blocksize - 1) / stride + 1;
assert(all(mod(tilescount, 1) == 0), 'cannot divide image into tile evenly')
tiles = cell(tilescount);
tileidx = 1;
for col = 1 : stride(2) : size(img, 2 ) - blocksize(2)
for row = 1 : stride(1) : size(img, 1) - blocksize(1)
tiles{tileidx} = img(row:row+stride(1)-1, col:col+stride(2)-1, :);
tileidx = tileidx + 1;
However, it also seems to work only if there are no spare rows or columns. How can I adapt that to an efficient solution for images with an arbitrary number of channels (I seek to apply it on both single-channel images and RGB images)?
The code above did not fully work, so I came up with the following solution based on it. Variable names are chosen such that they are self-explanatory.
tilesCountR = floor((ROWS - rowBlockSize - 1) / rowStride + 1);
tilesCountC = floor((COLS - colBlockSize - 1) / colStride + 1);
tiles = cell(tilesCountR * tilesCountC,1);
tileidx = 1;
for col = 1 : colStride : COLS - colBlockSize
for row = 1 : rowStride : ROWS - rowBlockSize
tiles{tileidx} = img(row:row+rowBlockSize-1, col:col+colBlockSize-1, :);
tileidx = tileidx + 1;

Sampling pixels from an Image - increasing performance

I am sampling some pixels from a reference image Ir and then moving them on a secondary image In. The first function I have written is as follows:
[r,c,d] = size(Ir);
rSample = fix(r * 0.4); % sample 40 percent of pixels
cSample = fix(c * 0.4); % sample 40 percent of pixels
rIdx = randi(r,rSample,1); % uniformly sample indices for rows
cIdx = randi(c,cSample,1); % uniformly sample indices for columns
kk = 1;
for ii = 1:length(rIdx)
for jj=1:length(cIdx)
In(rIdx(ii),cIdx(jj),:) = Ir(rIdx(ii),cIdx(jj),:) * fcn(rIdx(ii),cIdx(jj));
kk = kk + 1;
Another method to increase the performance (speed) of the code, that I came around is as follows:
nSample = fix(r*c*0.4);
Idx = randi(r*c,nSample,1);
for ii = 1:nSample
[I,J] = ind2sub([r,c],Idx(ii,1));
In(I,J,:) = Ir(I,J,:) * fcn(I,J);
In both codes, fcn(I,J) is a function that performs some computation on the pixel at [I,J] and the process can be different depending on the indices of the pixel.
Although I have removed one for-loop, I guess there is a better technique to increase the performance of the code even more.
As suggested by #Daniel the following line of the code does the job.
But the point is, I prefer to have only the sampled pixels to be able to process them faster. For instance having the samples in a vector format wit 3 layers for RGB.
Io = Ir(rIdx,cIdx,:);
Io1 = Io(:,:,1);
Io1v = Io1(:);
[r,c,d] = size(Ir);
rSamples = fix(r * 0.4); % sample 40 percent of pixels
cSamples = fix(c * 0.4); % sample 40 percent of pixels
rIdx = randi(r,rSamples,1); % uniformly sample indices for rows
cIdx = randi(c,cSamples,1); % uniformly sample indices for columns

How "bytesPerRow" is calculated from an NSBitmapImageRep

I would like to understand how "bytesPerRow" is calculated when building up an NSBitmapImageRep (in my case from mapping an array of floats to a grayscale bitmap).
Clarifying this detail will help me to understand how memory is being mapped from an array of floats to a byte array (0-255, unsigned char; neither of these arrays are shown in the code below).
The Apple documentation says that this number is calculated "from the width of the image, the number of bits per sample, and, if the data is in a meshed configuration, the number of samples per pixel."
I had trouble following this "calculation" so I setup a simple loop to find the results empirically. The following code runs just fine:
int Ny = 1; // Ny is arbitrary, note that BytesPerPlane is calculated as we would expect = Ny*BytesPerRow;
for (int Nx = 0; Nx<320; Nx+=64) {
// greyscale image representation:
NSBitmapImageRep *dataBitMapRep = [[NSBitmapImageRep alloc]
initWithBitmapDataPlanes: nil // allocate the pixel buffer for us
pixelsWide: Nx
pixelsHigh: Ny
bitsPerSample: 8
samplesPerPixel: 1
hasAlpha: NO
isPlanar: NO
colorSpaceName: NSCalibratedWhiteColorSpace // 0 = black, 1 = white
bytesPerRow: 0 // 0 means "you figure it out"
bitsPerPixel: 8]; // bitsPerSample must agree with samplesPerPixel
long rowBytes = [dataBitMapRep bytesPerRow];
printf("Nx = %d; bytes per row = %lu \n",Nx, rowBytes);
and produces the result:
Nx = 0; bytes per row = 0
Nx = 64; bytes per row = 64
Nx = 128; bytes per row = 128
Nx = 192; bytes per row = 192
Nx = 256; bytes per row = 256
So we see that the bytes/row jumps in 64 byte increments, even when Nx incrementally increases by 1 all the way to 320 (I didn't show all of those Nx values). Note also that Nx = 320 (max) is arbitrary for this discussion.
So from the perspective of allocating and mapping memory for a byte array, how are the "bytes per row" calculated from first principles? Is the result above so the data from a single scan-line can be aligned on a "word" length boundary (64 bit on my MacBook Pro)?
Thanks for any insights, having trouble picturing how this works.
Passing 0 for bytesPerRow: means more than you said in your comment. From the documentation:
If you pass in a rowBytes value of 0, the bitmap data allocated may be padded to fall on long word or larger boundaries for performance. … Passing in a non-zero value allows you to specify exact row advances.
So you're seeing it increase by 64 bytes at a time because that's how AppKit decided to round it up.
The minimum requirement for bytes per row is much simpler. It's bytes per pixel times pixels per row. That's all.
For a bitmap image rep backed by floats, you'd pass sizeof(float) * 8 for bitsPerSample, and bytes-per-pixel would be sizeof(float) * samplesPerPixel. Bytes-per-row follows from that; you multiply bytes-per-pixel by the width in pixels.
Likewise, if it's backed by unsigned bytes, you'd pass sizeof(unsigned char) * 8 for bitsPerSample, and bytes-per-pixel would be sizeof(unsigned char) * samplesPerPixel.

How to randomly fill a space in one dimension?

I would like to know how can I randomly fill a space with a set number of items and a target size, for example given the number of columns = 15 and a target size width = 320, how can I randomly distribute the columns width to fill the space? like shown in the image below if possible any sort of pseudo-code or algorithm will do
One way to partition your 320 pixels in 15 random "columns" is to do it uniformly, i.e., every column width follows the same distribution.
For this, your actually need a uniform distribution on the simplex. The first way to achieve is the one described by yi_H, and is probably the way to go:
Generate 14 uniform integers between 0 and 320.
Keep regenerating any number that has already been chosen, so that you end up with 14 distinct numbers
Sort them
Your columns bounds are given by two consecutive random numbers.
If you have a minimum width requirement (e.g., 1 for non-empty columns), remove it 15 times from your 320 pixels, generate the numbers in the new range and make the necessary adjustments.
The second way to achieve a uniform point on a simplex is a bit more involved, and not very well suited with discrete settings such as pixels, but here it is in brief anyway:
Generate 15 exponential random variables with same shape parameter (e.g. 1)
Divide each number by the total, so that each is in [0,1]
Rescale those number by multiplying them by 320, and round them. These are your column widths
This is not as nice as the first way, since with the rounding you may end with a total bigger or smaller than 320, and you may have columns with 0 width... The only advantage is that you don't need to perform any sort (but you have to compute logarithms... so all in all, the first way is the way to go).
I should add that if you do not necessarily want uniform random filling, then you have a lot more algorithms at your disposal.
Edit: Here is a quick implementation of the first algorithm in Mathematica. Note that in order to avoid generating points until they are all different, you can just consider that an empty column has a width of 1, and then a minimum width of 2 will give you columns with non-empty interior:
min = 2;
total = 320;
height = 50;
n = 15;
x = Sort[RandomInteger[total - n*min - 1, n - 1]] + Range[n - 1]*min
Graphics[{Rectangle[{-2, 0}, {0, height}], (*left margin*)
Rectangle[{#, 0}, {# + 1, height}] & /# x, (*columns borders*)
Rectangle[{total, 0}, {total + 2, height}]}, (*right margin*)
PlotRange -> {{-2, total + 2}, {0, height}},
ImageSize -> {total + 4, height}]
with gives the following example output:
Edit: Here is the modified javascript algorithm (beware, I have never written Javascript before, so there might be some errors\poor style):
function sortNumber(a,b)
return a - b;
function draw() {
var canvas = document.getElementById( "myCanvas" );
var numberOfStrips = 15;
var initPosX = 10;
var initPosY = 10;
var width = 320;
var height = 240;
var minColWidth = 2;
var reducedWidth = width - numberOfStrips * minColWidth;
var separators = new Array();
for ( var n = 0; n < numberOfStrips - 1; n++ ) {
separators[n] = Math.floor(Math.random() * reducedWidth);
for ( var n = 0; n < numberOfStrips - 1; n++ ) {
separators[n] += (n+1) * minColWidth;
if ( canvas.getContext ) {
var ctx = canvas.getContext( "2d" );
// Draw lines
ctx.lineWidth = 1;
ctx.strokeStyle = "rgb( 120, 120, 120 )";
for ( var n = 0; n < numberOfStrips - 1; n++ ) {
var newPosX = separators[n];
ctx.moveTo( initPosX + newPosX, initPosY );
ctx.lineTo( initPosX + newPosX, initPosY + height );
// Draw enclosing rectangle
ctx.lineWidth = 4;
ctx.strokeStyle = "rgb( 0, 0, 0 )";
ctx.strokeRect( initPosX, initPosY, width, height );
Additionally, note that minColWidth should not be bigger than a certain value (reducedWidth should not be negative...), but it is not tested in the algorithm. As stated before, us a value of 0 if you don't mind two lines on one another, a value of 1 if you don't mind two lines next to each other, and a value of 2 or more if you want non-empty columns only.
Create 14 unique numbers in the range (0,320). Those will be the x position of the bars.
Create random number, compare with previous ones, store it.
If consecutive lines aren't allowed, also check that it doesn't equal with any previous+-1.

Viola-Jones' face detection claims 180k features

I've been implementing an adaptation of Viola-Jones' face detection algorithm. The technique relies upon placing a subframe of 24x24 pixels within an image, and subsequently placing rectangular features inside it in every position with every size possible.
These features can consist of two, three or four rectangles. The following example is presented.
They claim the exhaustive set is more than 180k (section 2):
Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectangle
features is overcomplete.
The following statements are not explicitly stated in the paper, so they are assumptions on my part:
There are only 2 two-rectangle features, 2 three-rectangle features and 1 four-rectangle feature. The logic behind this is that we are observing the difference between the highlighted rectangles, not explicitly the color or luminance or anything of that sort.
We cannot define feature type A as a 1x1 pixel block; it must at least be at least 1x2 pixels. Also, type D must be at least 2x2 pixels, and this rule holds accordingly to the other features.
We cannot define feature type A as a 1x3 pixel block as the middle pixel cannot be partitioned, and subtracting it from itself is identical to a 1x2 pixel block; this feature type is only defined for even widths. Also, the width of feature type C must be divisible by 3, and this rule holds accordingly to the other features.
We cannot define a feature with a width and/or height of 0. Therefore, we iterate x and y to 24 minus the size of the feature.
Based upon these assumptions, I've counted the exhaustive set:
const int frameSize = 24;
const int features = 5;
// All five feature types:
const int feature[features][2] = {{2,1}, {1,2}, {3,1}, {1,3}, {2,2}};
int count = 0;
// Each feature:
for (int i = 0; i < features; i++) {
int sizeX = feature[i][0];
int sizeY = feature[i][1];
// Each position:
for (int x = 0; x <= frameSize-sizeX; x++) {
for (int y = 0; y <= frameSize-sizeY; y++) {
// Each size fitting within the frameSize:
for (int width = sizeX; width <= frameSize-x; width+=sizeX) {
for (int height = sizeY; height <= frameSize-y; height+=sizeY) {
The result is 162,336.
The only way I found to approximate the "over 180,000" Viola & Jones speak of, is dropping assumption #4 and by introducing bugs in the code. This involves changing four lines respectively to:
for (int width = 0; width < frameSize-x; width+=sizeX)
for (int height = 0; height < frameSize-y; height+=sizeY)
The result is then 180,625. (Note that this will effectively prevent the features from ever touching the right and/or bottom of the subframe.)
Now of course the question: have they made a mistake in their implementation? Does it make any sense to consider features with a surface of zero? Or am I seeing it the wrong way?
Upon closer look, your code looks correct to me; which makes one wonder whether the original authors had an off-by-one bug. I guess someone ought to look at how OpenCV implements it!
Nonetheless, one suggestion to make it easier to understand is to flip the order of the for loops by going over all sizes first, then looping over the possible locations given the size:
#include <stdio.h>
int main()
int i, x, y, sizeX, sizeY, width, height, count, c;
/* All five shape types */
const int features = 5;
const int feature[][2] = {{2,1}, {1,2}, {3,1}, {1,3}, {2,2}};
const int frameSize = 24;
count = 0;
/* Each shape */
for (i = 0; i < features; i++) {
sizeX = feature[i][0];
sizeY = feature[i][1];
printf("%dx%d shapes:\n", sizeX, sizeY);
/* each size (multiples of basic shapes) */
for (width = sizeX; width <= frameSize; width+=sizeX) {
for (height = sizeY; height <= frameSize; height+=sizeY) {
printf("\tsize: %dx%d => ", width, height);
/* each possible position given size */
for (x = 0; x <= frameSize-width; x++) {
for (y = 0; y <= frameSize-height; y++) {
printf("count: %d\n", count-c);
printf("%d\n", count);
return 0;
with the same results as the previous 162336
To verify it, I tested the case of a 4x4 window and manually checked all cases (easy to count since 1x2/2x1 and 1x3/3x1 shapes are the same only 90 degrees rotated):
2x1 shapes:
size: 2x1 => count: 12
size: 2x2 => count: 9
size: 2x3 => count: 6
size: 2x4 => count: 3
size: 4x1 => count: 4
size: 4x2 => count: 3
size: 4x3 => count: 2
size: 4x4 => count: 1
1x2 shapes:
size: 1x2 => count: 12 +-----------------------+
size: 1x4 => count: 4 | | | | |
size: 2x2 => count: 9 | | | | |
size: 2x4 => count: 3 +-----+-----+-----+-----+
size: 3x2 => count: 6 | | | | |
size: 3x4 => count: 2 | | | | |
size: 4x2 => count: 3 +-----+-----+-----+-----+
size: 4x4 => count: 1 | | | | |
3x1 shapes: | | | | |
size: 3x1 => count: 8 +-----+-----+-----+-----+
size: 3x2 => count: 6 | | | | |
size: 3x3 => count: 4 | | | | |
size: 3x4 => count: 2 +-----------------------+
1x3 shapes:
size: 1x3 => count: 8 Total Count = 136
size: 2x3 => count: 6
size: 3x3 => count: 4
size: 4x3 => count: 2
2x2 shapes:
size: 2x2 => count: 9
size: 2x4 => count: 3
size: 4x2 => count: 3
size: 4x4 => count: 1
all. There is still some confusion in Viola and Jones' papers.
In their CVPR'01 paper it is clearly stated that
"More specifically, we use three
kinds of features. The value of a
two-rectangle feature is the difference between the sum of the
pixels within two rectangular regions.
The regions have the same size and
shape and are horizontally or
vertically adjacent (see Figure 1).
A three-rectangle feature computes the sum within two outside
rectangles subtracted from the sum in
a center rectangle. Finally a
four-rectangle feature".
In the IJCV'04 paper, exactly the same thing is said. So altogether, 4 features. But strangely enough, they stated this time that the the exhaustive feature set is 45396! That does not seem to be the final version.Here I guess that some additional constraints were introduced there, such as min_width, min_height, width/height ratio, and even position.
Note that both papers are downloadable on his webpage.
Having not read the whole paper, the wording of your quote sticks out at me
Given that the base resolution of the
detector is 24x24, the exhaustive set
of rectangle features is quite large,
over 180,000 . Note that unlike the
Haar basis, the set of rectangle
features is overcomplete.
"The set of rectangle features is overcomplete"
"Exhaustive set"
it sounds to me like a set up, where I expect the paper writer to follow up with an explaination for how they cull the search space down to a more effective set, by, for example, getting rid of trivial cases such as rectangles with zero surface area.
edit: or using some kind of machine learning algorithm, as the abstract hints at. Exhaustive set implies all possibilities, not just "reasonable" ones.
There is no guarantee that any author of any paper is correct in all their assumptions and findings. If you think that assumption #4 is valid, then keep that assumption, and try out your theory. You may be more successful than the original authors.
Quite good observation, but they might implicitly zero-pad the 24x24 frame, or "overflow" and start using first pixels when it gets out of bounds, as in rotational shifts, or as Breton said they might consider some features as "trivial features" and then discard them with the AdaBoost.
In addition, I wrote Python and Matlab versions of your code so I can test the code myself (easier to debug and follow for me) and so I post them here if anyone find them useful sometime.
frameSize = 24;
features = 5;
# All five feature types:
feature = [[2,1], [1,2], [3,1], [1,3], [2,2]]
count = 0;
# Each feature:
for i in range(features):
sizeX = feature[i][0]
sizeY = feature[i][1]
# Each position:
for x in range(frameSize-sizeX+1):
for y in range(frameSize-sizeY+1):
# Each size fitting within the frameSize:
for width in range(sizeX,frameSize-x+1,sizeX):
for height in range(sizeY,frameSize-y+1,sizeY):
print (count)
frameSize = 24;
features = 5;
% All five feature types:
feature = [[2,1]; [1,2]; [3,1]; [1,3]; [2,2]];
count = 0;
% Each feature:
for ii = 1:features
sizeX = feature(ii,1);
sizeY = feature(ii,2);
% Each position:
for x = 0:frameSize-sizeX
for y = 0:frameSize-sizeY
% Each size fitting within the frameSize:
for width = sizeX:sizeX:frameSize-x
for height = sizeY:sizeY:frameSize-y
In their original 2001 paper they only state that they used three kinds of features:
we use three kinds of features
with two, three and four rectangles respectively.
Since each kind has two orientations (that differ by 90 degrees), perhaps for the computation of the total number of features they used 2*3 types of features: 2 two-rectangle features, 2 three-rectangle features and 2 four-rectangle features. With this assumption there are indeed over 180,000 features:
feature_types = [(1,2), (2,1), (1,3), (3,1), (2,2), (2,2)]
window_size = (24,24)
total_features = 0
for f_type in feature_types:
for f_height in range(f_type[0], window_size[0] + 1, f_type[0]):
for f_width in range(f_type[1], window_size[1] + 1, f_type[1]):
total_features += (window_size[0] - f_height + 1) * (window_size[1] - f_width + 1)
# 183072
The second four-rectangle feature differs from the first only by a sign, so there is no need to keep it and if we drop it then the total number of features reduces to 162,336.
