How "bytesPerRow" is calculated from an NSBitmapImageRep - cocoa

I would like to understand how "bytesPerRow" is calculated when building up an NSBitmapImageRep (in my case from mapping an array of floats to a grayscale bitmap).
Clarifying this detail will help me to understand how memory is being mapped from an array of floats to a byte array (0-255, unsigned char; neither of these arrays are shown in the code below).
The Apple documentation says that this number is calculated "from the width of the image, the number of bits per sample, and, if the data is in a meshed configuration, the number of samples per pixel."
I had trouble following this "calculation" so I setup a simple loop to find the results empirically. The following code runs just fine:
int Ny = 1; // Ny is arbitrary, note that BytesPerPlane is calculated as we would expect = Ny*BytesPerRow;
for (int Nx = 0; Nx<320; Nx+=64) {
// greyscale image representation:
NSBitmapImageRep *dataBitMapRep = [[NSBitmapImageRep alloc]
initWithBitmapDataPlanes: nil // allocate the pixel buffer for us
pixelsWide: Nx
pixelsHigh: Ny
bitsPerSample: 8
samplesPerPixel: 1
hasAlpha: NO
isPlanar: NO
colorSpaceName: NSCalibratedWhiteColorSpace // 0 = black, 1 = white
bytesPerRow: 0 // 0 means "you figure it out"
bitsPerPixel: 8]; // bitsPerSample must agree with samplesPerPixel
long rowBytes = [dataBitMapRep bytesPerRow];
printf("Nx = %d; bytes per row = %lu \n",Nx, rowBytes);
}
and produces the result:
Nx = 0; bytes per row = 0
Nx = 64; bytes per row = 64
Nx = 128; bytes per row = 128
Nx = 192; bytes per row = 192
Nx = 256; bytes per row = 256
So we see that the bytes/row jumps in 64 byte increments, even when Nx incrementally increases by 1 all the way to 320 (I didn't show all of those Nx values). Note also that Nx = 320 (max) is arbitrary for this discussion.
So from the perspective of allocating and mapping memory for a byte array, how are the "bytes per row" calculated from first principles? Is the result above so the data from a single scan-line can be aligned on a "word" length boundary (64 bit on my MacBook Pro)?
Thanks for any insights, having trouble picturing how this works.

Passing 0 for bytesPerRow: means more than you said in your comment. From the documentation:
If you pass in a rowBytes value of 0, the bitmap data allocated may be padded to fall on long word or larger boundaries for performance. … Passing in a non-zero value allows you to specify exact row advances.
So you're seeing it increase by 64 bytes at a time because that's how AppKit decided to round it up.
The minimum requirement for bytes per row is much simpler. It's bytes per pixel times pixels per row. That's all.
For a bitmap image rep backed by floats, you'd pass sizeof(float) * 8 for bitsPerSample, and bytes-per-pixel would be sizeof(float) * samplesPerPixel. Bytes-per-row follows from that; you multiply bytes-per-pixel by the width in pixels.
Likewise, if it's backed by unsigned bytes, you'd pass sizeof(unsigned char) * 8 for bitsPerSample, and bytes-per-pixel would be sizeof(unsigned char) * samplesPerPixel.

Related

Turn off sw_scale conversion to planar YUV 32 byte alignment requirements

I am experiencing artifacts on the right edge of scaled and converted images when converting into planar YUV pixel formats with sw_scale. I am reasonably sure (although I can not find it anywhere in the documentation) that this is because sw_scale is using an optimization for 32 byte aligned lines, in the destination. However I would like to turn this off because I am using sw_scale for image composition, so even though the destination lines may be 32 byte aligned, the output image may not be.
Example.
Full output frame is 1280x720 yuv422p10le. (this is 32 byte aligned)
However into the top left corner I am scaling an image with an outwidth of 1280 / 3 = 426.
426 in this format is not 32 byte aligned, but I believe sw_scale sees that the output linesize is 32 byte aligned and overwrites the width of 426 putting garbage in the next 22 bytes of data thinking this is simply padding when in my case this is displayable area.
This is why I need to actually disable this optimization or somehow trick sw_scale into believing it does not apply while keeping intact the way the program works, which is otherwise fine.
I have tried adding extra padding to the destination lines so they are no longer 32 byte aligned,
this did not help as far as I can tell.
Edit with code Example. Rendering omitted for ease of use.
Also here is a similar issue, unfortunately as I stated there fix will not work for my use case. https://github.com/obsproject/obs-studio/pull/2836
Use the commented line of code to swap between a output width which is and isnt 32 byte aligned.
#include "libswscale/swscale.h"
#include "libavutil/imgutils.h"
#include "libavutil/pixelutils.h"
#include "libavutil/pixfmt.h"
#include "libavutil/pixdesc.h"
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
/// Set up a 1280x720 window, and an item with 1/3 width and height of the window.
int window_width, window_height, item_width, item_height;
window_width = 1280;
window_height = 720;
item_width = (window_width / 3);
item_height = (window_height / 3);
int item_out_width = item_width;
/// This line sets the item width to be 32 byte aligned uncomment to see uncorrupted results
/// Note %16 because outformat is 2 bytes per component
//item_out_width -= (item_width % 16);
enum AVPixelFormat outformat = AV_PIX_FMT_YUV422P10LE;
enum AVPixelFormat informat = AV_PIX_FMT_UYVY422;
int window_lines[4] = {0};
av_image_fill_linesizes(window_lines, outformat, window_width);
uint8_t *window_planes[4] = {0};
window_planes[0] = calloc(1, window_lines[0] * window_height);
window_planes[1] = calloc(1, window_lines[1] * window_height);
window_planes[2] = calloc(1, window_lines[2] * window_height); /// Fill the window with all 0s, this is green in yuv.
int item_lines[4] = {0};
av_image_fill_linesizes(item_lines, informat, item_width);
uint8_t *item_planes[4] = {0};
item_planes[0] = malloc(item_lines[0] * item_height);
memset(item_planes[0], 100, item_lines[0] * item_height);
struct SwsContext *ctx;
ctx = sws_getContext(item_width, item_height, informat,
item_out_width, item_height, outformat, SWS_FAST_BILINEAR, NULL, NULL, NULL);
/// Check a block in the normal region
printf("Pre scale normal region %d %d %d\n", (int)((uint16_t*)window_planes[0])[0], (int)((uint16_t*)window_planes[1])[0],
(int)((uint16_t*)window_planes[2])[0]);
/// Check a block in the corrupted region (should be all zeros) These values should be out of the converted region
int corrupt_offset_y = (item_out_width + 3) * 2; ///(item_width + 3) * 2 bytes per component Y PLANE
int corrupt_offset_uv = (item_out_width + 3); ///(item_width + 3) * (2 bytes per component rshift 1 for horiz scaling) U and V PLANES
printf("Pre scale corrupted region %d %d %d\n", (int)(*((uint16_t*)(window_planes[0] + corrupt_offset_y))),
(int)(*((uint16_t*)(window_planes[1] + corrupt_offset_uv))), (int)(*((uint16_t*)(window_planes[2] + corrupt_offset_uv))));
sws_scale(ctx, (const uint8_t**)item_planes, item_lines, 0, item_height,window_planes, window_lines);
/// Preform same tests after scaling
printf("Post scale normal region %d %d %d\n", (int)((uint16_t*)window_planes[0])[0], (int)((uint16_t*)window_planes[1])[0],
(int)((uint16_t*)window_planes[2])[0]);
printf("Post scale corrupted region %d %d %d\n", (int)(*((uint16_t*)(window_planes[0] + corrupt_offset_y))),
(int)(*((uint16_t*)(window_planes[1] + corrupt_offset_uv))), (int)(*((uint16_t*)(window_planes[2] + corrupt_offset_uv))));
return 0;
}
Example Output:
//No alignment
Pre scale normal region 0 0 0
Pre scale corrupted region 0 0 0
Post scale normal region 400 400 400
Post scale corrupted region 512 36865 36865
//With alignment
Pre scale normal region 0 0 0
Pre scale corrupted region 0 0 0
Post scale normal region 400 400 400
Post scale corrupted region 0 0 0
I believe sw_scale sees that the output linesize is 32 byte aligned and overwrites the width of 426 putting garbage in the next 22 bytes of data thinking this is simply padding when in my case this is displayable area.
That's actually correct, swscale indeed does that, good analysis. There's two ways to get rid of this:
disable all SIMD code using av_set_cpu_flags_mask(0).
write the re-scaled 426xN image in a temporary buffer and then manually copy the pixels into the unpadded destination plane.
The reason ffmpeg/swscale overwrite the destination is for performance. If you don't care about runtime and want the simplest code, use the first solution. If you do want performance and don't mind slightly more complicated code, use the second solution.

How are global and local work sizes distributed in this function?

This is from a sample program for OpenCL programming.
I am confused about how global and local work size are computed.
They are computed based on the image size.
Image size is 1920 x 1080 (w x h).
What I assumed is global_work_size[0] and global_work_size[1] are grids on image.
But now global_work_size is {128, 1088}.
Then local_work_size[0] and local_work_size[1] are grids on global_work_size.
local_work_size is {128, 32}.
But total groups, num_groups = 34, it is not 128 x 1088.
Max workgroup_size available at device is 4096.
How is the image distributed into such global and local work group sizes?
They are calculated in the following function.
clGetKernelWorkGroupInfo(histogram_rgba_unorm8, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &workgroup_size, NULL);
{
size_t gsize[2];
int w;
if (workgroup_size <= 256)
{
gsize[0] = 16;//workgroup_size is formed into row & col
gsize[1] = workgroup_size / 16;
}
else if (workgroup_size <= 1024)
{
gsize[0] = workgroup_size / 16;
gsize[1] = 16;
}
else
{
gsize[0] = workgroup_size / 32;
gsize[1] = 32;
}
local_work_size[0] = gsize[0];
local_work_size[1] = gsize[1];
w = (image_width + num_pixels_per_work_item - 1) / num_pixels_per_work_item;//to include all pixels, num_pixels_per_work_item is added first
global_work_size[0] = ((w + gsize[0] - 1) / gsize[0]);//col
global_work_size[1] = ((image_height + gsize[1] - 1) / gsize[1]);//row
num_groups = global_work_size[0] * global_work_size[1];
global_work_size[0] *= gsize[0];
global_work_size[1] *= gsize[1];
}
err = clEnqueueNDRangeKernel(queue, histogram_rgba_unorm8, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
if (err)
{
printf("clEnqueueNDRangeKernel() failed for histogram_rgba_unorm8 kernel. (%d)\n", err);
return EXIT_FAILURE;
}
I don't see any great mystery here. If you follow the calculation, the values do indeed end up as you say. (Not that the group size is particularly efficient in my opinion.)
If workgroup_size is indeed 4096, gsize will end up as { 128, 32 } as it follows the else logic. (>1024)
w is the number of num_pixels_per_work_item = 32 wide columns, or the minimum number of work-items to cover the entire width, which for an image width of 1920 is 60. In other words, we require an absolute minimum of 60 x 1080 work-items to cover the entire image.
Next, the number of group columns and rows is calculated and temporarily stored in global_work_size. As group width has been set to 128, a w of 60 means we end up with 1 column of groups. (This seems a waste of resources, more than half of the 128 work-items in each group will not be doing anything.) The number of group rows is simply image_height divided by gsize[1] (32) and rounding up. (33.75 -> 34)
Total number of groups can now be determined by multiplying out the grid: num_groups = global_work_size[0] * global_work_size[1]
To get the true total number of work-items in each dimension, each dimension of global_work_size is now multiplied by the group size in this dimension. 1, 34 multiplied by 128, 32 yields 128, 1088.
This actually covers an area of 4096 x 1088 pixels so about 53% of that is wastage. This is mainly because the algorithm for group dimensions favours wide groups, and each work-item works on a 32x1 pixel slice of the image. It would be better to favour tall work groups to reduce the amount of rounding.
For example, if we reverse gsize[0] and gsize[1], in this case we'd get a group size of { 32, 128 }, giving us a global work size of { 64, 1152 } and only 12% wastage. It would also be worth checking if always picking the largest possible group size is even a good idea; it quite possibly isn't, but I've not looked into the kernel's computation in detail, let alone run any measurements, to say if that's the case or not.

Matlab: Reorder 5 bytes to 4 times 10 bits RAW10

I would like to Import a RAW10 file into Matlab. The infos are directly attachted to the jpeg file provided by the raspberry pi camera.
4 Pixels are saved as 5 bytes.
The first four bytes contain the bit 9-2 of a pixel.
The last byte contains the missing LSB.
sizeRAW = 6404096;
sizeHeader =32768;
I = ones(1944,2592);
fin=fopen('0.jpeg','r');
off1 = dir('0.jpeg');
offset = off1.bytes - sizeRAW + sizeHeader;
fseek(fin, offset,'bof');
pixel = ones(1944,2592);
I=fread(fin,1944,'ubit10','l');
for col=1:2592
I(:,col)=fread(fin,1944,'ubit8','l');
col = col+4;
end
fclose(fin);
This is as far as I came yet, but it's not right.

Sampling pixels from an Image - increasing performance

I am sampling some pixels from a reference image Ir and then moving them on a secondary image In. The first function I have written is as follows:
[r,c,d] = size(Ir);
rSample = fix(r * 0.4); % sample 40 percent of pixels
cSample = fix(c * 0.4); % sample 40 percent of pixels
rIdx = randi(r,rSample,1); % uniformly sample indices for rows
cIdx = randi(c,cSample,1); % uniformly sample indices for columns
kk = 1;
for ii = 1:length(rIdx)
for jj=1:length(cIdx)
In(rIdx(ii),cIdx(jj),:) = Ir(rIdx(ii),cIdx(jj),:) * fcn(rIdx(ii),cIdx(jj));
kk = kk + 1;
end
end
Another method to increase the performance (speed) of the code, that I came around is as follows:
nSample = fix(r*c*0.4);
Idx = randi(r*c,nSample,1);
for ii = 1:nSample
[I,J] = ind2sub([r,c],Idx(ii,1));
In(I,J,:) = Ir(I,J,:) * fcn(I,J);
end
In both codes, fcn(I,J) is a function that performs some computation on the pixel at [I,J] and the process can be different depending on the indices of the pixel.
Although I have removed one for-loop, I guess there is a better technique to increase the performance of the code even more.
Update:
As suggested by #Daniel the following line of the code does the job.
In(rIdx,cIdx,:)=Ir(rIdx,cIdx,:);
But the point is, I prefer to have only the sampled pixels to be able to process them faster. For instance having the samples in a vector format wit 3 layers for RGB.
Io = Ir(rIdx,cIdx,:);
Io1 = Io(:,:,1);
Io1v = Io1(:);
Ir=ones(30,30,3);
In=Ir*.5;
[r,c,d] = size(Ir);
rSamples = fix(r * 0.4); % sample 40 percent of pixels
cSamples = fix(c * 0.4); % sample 40 percent of pixels
rIdx = randi(r,rSamples,1); % uniformly sample indices for rows
cIdx = randi(c,cSamples,1); % uniformly sample indices for columns
In(rIdx,cIdx,:)=Ir(rIdx,cIdx,:);

Indexing pixels in a monochrome FreeType glyph buffer

I want to translate a monochrome FreeType glyph to an RGBA unsigned byte OpenGL texture. The colour of the texture at pixel (x, y) would be (255, 255, alpha), where
alpha = glyph->bitmap.buffer[pixelIndex(x, y)] * 255
I load my glyph using
FT_Load_Char(face, glyphChar, FT_LOAD_RENDER | FT_LOAD_MONOCHROME | FT_LOAD_TARGET_MONO)
The target texture has dimensions of glyph->bitmap.width * glyph->bitmap.rows. I've been able to index a greyscale glyph (loaded using just FT_Load_Char(face, glyphChar, FT_LOAD_RENDER)) with
glyph->bitmap.buffer[(glyph->bitmap.width * y) + x]
This does not appear work on a monochrome buffer though and the characters in my final texture are scrambled.
What is the correct way to get the value of pixel (x, y) in a monochrome glyph buffer?
Based on this thread I started on Gamedev.net, I've come up with the following function to get the filled/empty state of the pixel at (x, y):
bool glyphBit(const FT_GlyphSlot &glyph, const int x, const int y)
{
int pitch = abs(glyph->bitmap.pitch);
unsigned char *row = &glyph->bitmap.buffer[pitch * y];
char cValue = row[x >> 3];
return (cValue & (128 >> (x & 7))) != 0;
}
I have a similiar question some time ago. So I would to try help you.
The target texture has dimensions of glyph->bitmap.width * glyph->bitmap.rows
This is very specific dimension for OpenGl. Would be better if you round this to power of two.
In common way you make cycle where you get every glyph. Then cycle for every row from 0 to glyph->bitmap.rows. Then cycle for every byte (unsigned char) in row from 0 to glyph->pitch. Where you get byte by handling glyph->bitmap.buffer[pitch * row + i] (i is index of inner cycle and row is index of outer). For example:
if(s[i] == ' ') left += 20; else
for (int row = 0; row < g->bitmap.rows; ++row) {
if(kerning)
for(int b = 0; b < pitch; b++){
if(data[left + 64*(strSize*(row + 64 - g->bitmap_top)) + b] + g->bitmap.buffer[pitch * row + b] < UCHAR_MAX)
data[left + 64*(strSize*(row + 64 - g->bitmap_top)) + b] += g->bitmap.buffer[pitch * row + b];
else
data[left + 64*(strSize*(row + 64 - g->bitmap_top)) + b] = UCHAR_MAX;
} else
std::memcpy(data + left + 64*(strSize*(row + 64 - g->bitmap_top)) , g->bitmap.buffer + pitch * row, pitch);
}
left += g->advance.x >> 6;
This code is relevant to an 8-bit bitmap (standart FT_Load_Char(face, glyphChar, FT_LOAD_RENDER)).
Now I tried to use the monochrome flag and it caused me trouble. So my answer is not a solution to your problem. If you just want to display the letter then you should see my question.
The following Python function unpacks a FT_LOAD_TARGET_MONO glyph bitmap into a more convenient representation where each byte in the buffer maps to one pixel.
I've got some more info on monochrome font rendering with Python and FreeType plus additional example code on my blog: http://dbader.org/blog/monochrome-font-rendering-with-freetype-and-python
def unpack_mono_bitmap(bitmap):
"""
Unpack a freetype FT_LOAD_TARGET_MONO glyph bitmap into a bytearray where each
pixel is represented by a single byte.
"""
# Allocate a bytearray of sufficient size to hold the glyph bitmap.
data = bytearray(bitmap.rows * bitmap.width)
# Iterate over every byte in the glyph bitmap. Note that we're not
# iterating over every pixel in the resulting unpacked bitmap --
# we're iterating over the packed bytes in the input bitmap.
for y in range(bitmap.rows):
for byte_index in range(bitmap.pitch):
# Read the byte that contains the packed pixel data.
byte_value = bitmap.buffer[y * bitmap.pitch + byte_index]
# We've processed this many bits (=pixels) so far. This determines
# where we'll read the next batch of pixels from.
num_bits_done = byte_index * 8
# Pre-compute where to write the pixels that we're going
# to unpack from the current byte in the glyph bitmap.
rowstart = y * bitmap.width + byte_index * 8
# Iterate over every bit (=pixel) that's still a part of the
# output bitmap. Sometimes we're only unpacking a fraction of a byte
# because glyphs may not always fit on a byte boundary. So we make sure
# to stop if we unpack past the current row of pixels.
for bit_index in range(min(8, bitmap.width - num_bits_done)):
# Unpack the next pixel from the current glyph byte.
bit = byte_value & (1 << (7 - bit_index))
# Write the pixel to the output bytearray. We ensure that `off`
# pixels have a value of 0 and `on` pixels have a value of 1.
data[rowstart + bit_index] = 1 if bit else 0
return data

Resources