Turn off sw_scale conversion to planar YUV 32 byte alignment requirements - ffmpeg

I am experiencing artifacts on the right edge of scaled and converted images when converting into planar YUV pixel formats with sw_scale. I am reasonably sure (although I can not find it anywhere in the documentation) that this is because sw_scale is using an optimization for 32 byte aligned lines, in the destination. However I would like to turn this off because I am using sw_scale for image composition, so even though the destination lines may be 32 byte aligned, the output image may not be.
Example.
Full output frame is 1280x720 yuv422p10le. (this is 32 byte aligned)
However into the top left corner I am scaling an image with an outwidth of 1280 / 3 = 426.
426 in this format is not 32 byte aligned, but I believe sw_scale sees that the output linesize is 32 byte aligned and overwrites the width of 426 putting garbage in the next 22 bytes of data thinking this is simply padding when in my case this is displayable area.
This is why I need to actually disable this optimization or somehow trick sw_scale into believing it does not apply while keeping intact the way the program works, which is otherwise fine.
I have tried adding extra padding to the destination lines so they are no longer 32 byte aligned,
this did not help as far as I can tell.
Edit with code Example. Rendering omitted for ease of use.
Also here is a similar issue, unfortunately as I stated there fix will not work for my use case. https://github.com/obsproject/obs-studio/pull/2836
Use the commented line of code to swap between a output width which is and isnt 32 byte aligned.
#include "libswscale/swscale.h"
#include "libavutil/imgutils.h"
#include "libavutil/pixelutils.h"
#include "libavutil/pixfmt.h"
#include "libavutil/pixdesc.h"
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
/// Set up a 1280x720 window, and an item with 1/3 width and height of the window.
int window_width, window_height, item_width, item_height;
window_width = 1280;
window_height = 720;
item_width = (window_width / 3);
item_height = (window_height / 3);
int item_out_width = item_width;
/// This line sets the item width to be 32 byte aligned uncomment to see uncorrupted results
/// Note %16 because outformat is 2 bytes per component
//item_out_width -= (item_width % 16);
enum AVPixelFormat outformat = AV_PIX_FMT_YUV422P10LE;
enum AVPixelFormat informat = AV_PIX_FMT_UYVY422;
int window_lines[4] = {0};
av_image_fill_linesizes(window_lines, outformat, window_width);
uint8_t *window_planes[4] = {0};
window_planes[0] = calloc(1, window_lines[0] * window_height);
window_planes[1] = calloc(1, window_lines[1] * window_height);
window_planes[2] = calloc(1, window_lines[2] * window_height); /// Fill the window with all 0s, this is green in yuv.
int item_lines[4] = {0};
av_image_fill_linesizes(item_lines, informat, item_width);
uint8_t *item_planes[4] = {0};
item_planes[0] = malloc(item_lines[0] * item_height);
memset(item_planes[0], 100, item_lines[0] * item_height);
struct SwsContext *ctx;
ctx = sws_getContext(item_width, item_height, informat,
item_out_width, item_height, outformat, SWS_FAST_BILINEAR, NULL, NULL, NULL);
/// Check a block in the normal region
printf("Pre scale normal region %d %d %d\n", (int)((uint16_t*)window_planes[0])[0], (int)((uint16_t*)window_planes[1])[0],
(int)((uint16_t*)window_planes[2])[0]);
/// Check a block in the corrupted region (should be all zeros) These values should be out of the converted region
int corrupt_offset_y = (item_out_width + 3) * 2; ///(item_width + 3) * 2 bytes per component Y PLANE
int corrupt_offset_uv = (item_out_width + 3); ///(item_width + 3) * (2 bytes per component rshift 1 for horiz scaling) U and V PLANES
printf("Pre scale corrupted region %d %d %d\n", (int)(*((uint16_t*)(window_planes[0] + corrupt_offset_y))),
(int)(*((uint16_t*)(window_planes[1] + corrupt_offset_uv))), (int)(*((uint16_t*)(window_planes[2] + corrupt_offset_uv))));
sws_scale(ctx, (const uint8_t**)item_planes, item_lines, 0, item_height,window_planes, window_lines);
/// Preform same tests after scaling
printf("Post scale normal region %d %d %d\n", (int)((uint16_t*)window_planes[0])[0], (int)((uint16_t*)window_planes[1])[0],
(int)((uint16_t*)window_planes[2])[0]);
printf("Post scale corrupted region %d %d %d\n", (int)(*((uint16_t*)(window_planes[0] + corrupt_offset_y))),
(int)(*((uint16_t*)(window_planes[1] + corrupt_offset_uv))), (int)(*((uint16_t*)(window_planes[2] + corrupt_offset_uv))));
return 0;
}
Example Output:
//No alignment
Pre scale normal region 0 0 0
Pre scale corrupted region 0 0 0
Post scale normal region 400 400 400
Post scale corrupted region 512 36865 36865
//With alignment
Pre scale normal region 0 0 0
Pre scale corrupted region 0 0 0
Post scale normal region 400 400 400
Post scale corrupted region 0 0 0

I believe sw_scale sees that the output linesize is 32 byte aligned and overwrites the width of 426 putting garbage in the next 22 bytes of data thinking this is simply padding when in my case this is displayable area.
That's actually correct, swscale indeed does that, good analysis. There's two ways to get rid of this:
disable all SIMD code using av_set_cpu_flags_mask(0).
write the re-scaled 426xN image in a temporary buffer and then manually copy the pixels into the unpadded destination plane.
The reason ffmpeg/swscale overwrite the destination is for performance. If you don't care about runtime and want the simplest code, use the first solution. If you do want performance and don't mind slightly more complicated code, use the second solution.

Related

Very unexpected behavior of C++ win32 BitBlt

I noticed when I try to run BitBlt, the resulting data buffer is unexpected in two ways:
It is flipped along the y axis (the origin seems to be bottom left instead of top left)
In each RGBA grouping, the R and B values seem to be switched.
For the first issue, I noticed it when testing with my command prompt; if my command prompt was in the upper left portion of the screen, it would only say it was black when my cursor was in the lower left portion. I had to fix the inversion of the y axis by changing int offset = (y * monitor_width + x) * 4; to int offset = ((monitor_height - 1 - y) * monitor_width + x) * 4; this fixed the pixel location issue because it was showing black where I expected black.
However, the colors were still strong. I tested by trying to get the color of known pixels. I noticed every blue pixel had a very high R value and every red pixel had a very high blue value. That's when I compared with an existing tool I had and found out that the red and blue values seem to be switched in every pixel. At first I thought it was backwards or a byte alignment issue, but I also verified in a clustering of pixels that aren't uniform to make sure it's picking the right position of pixel, and it did perfectly well, just with the colors switched.
Full simplified code below (originally my tool was getting my cursor position and printing the pixel color via hotkey press; this is a simplified version that gets one specific point).
BYTE* my_pixel_data;
HDC hScreenDC = GetDC(GetDesktopWindow());
int BitsPerPixel = GetDeviceCaps(hScreenDC, BITSPIXEL);
HDC hMemoryDC = CreateCompatibleDC(hScreenDC);
int monitor_width = GetSystemMetrics(SM_CXSCREEN);
int monitor_height = GetSystemMetrics(SM_CYSCREEN);
std::cout << std::format("monitor width height: {}, {}\n", monitor_width, monitor_height);
BITMAPINFO info;
info.bmiHeader.biSize = sizeof(BITMAPINFOHEADER);
info.bmiHeader.biWidth = monitor_width; // client_width;
info.bmiHeader.biHeight = monitor_height; // client_height;
info.bmiHeader.biPlanes = 1;
info.bmiHeader.biBitCount = BitsPerPixel;
info.bmiHeader.biCompression = BI_RGB;
HBITMAP hbitmap = CreateDIBSection(hMemoryDC, &info, DIB_RGB_COLORS, (void**)&my_pixel_data, 0, 0);
SelectObject(hMemoryDC, hbitmap);
BitBlt(hMemoryDC, 0, 0, monitor_width, monitor_height, hScreenDC, 0, 0, SRCCOPY);
int x = 12, y = 12;
int offset = ((monitor_height - 1 - y) * monitor_width + x) * 4;
std::cout << std::format("debug: ({}, {}): ({}, {}, {})\n", x, y, (int)my_pixel_data[offset], (int)my_pixel_data[offset + 1], (int)my_pixel_data[offset + 2], (int)my_pixel_data[offset + 3]);
system("pause");
The output of this will be debug: (12, 12): (199, 76, 133) even though another program has verified the colors are actually (133, 76, 199).
I can easily fix this in my code by flipping the y axis and switching each R and B value and the program will work perfectly well. However, I am just baffled by how this happened and whether there's a more elegant fix.
I can answer the RGB (and it looks like Hans answered the inverted Y axis in a comment). Remember that RGB is stored 0xAARRGGBB, so in that 32 bit value BB is byte 0, GG is byte 1, and RR is byte 2 (alpha is byte 3 if you use it), so when you index in at +0, +1 and +2 you're actually getting the values correctly. When we say RGB we're saying the colors in opposite order of how they're stored in memory.

Transition shader

I have created a transition shader.
This is what is does:
On each update the color that should be alpha changes.
Then preform a check for each pixel.
If the color of the pixel is more that the 'alpha' value
Set this pixel to transparent.
Else If the color of the pixel is more that the 'alpha' value - 50
Set this pixel to partly transparent.
Else
Set the color to black.
EDIT (DELETED OLD PARTS):
I tried converting my GLSL into AGAL (using http://cmodule.org/glsl2agal):
Fragment shader:
const float alpha = 0.8;
varying vec2 TexCoord; //not used but required for converting
uniform sampler2D transition;//not used but required for converting
void main()
{
vec4 color = texture2D(transition, TexCoord.st);//not used but required for converting
color.a = float(color.r < alpha);
if(color.r >= (alpha - 0.1)){
color.a = 0.2 * (color.r - alpha - 0.1);
}
gl_FragColor = vec4(0, 0, 0, color.a);
}
And I've customized the output and added that to a (custom) Starling filter:
var fragmentShader:String =
"tex ft0, v0, fs0 <2d, clamp, linear, mipnone> \n" + // copy color to ft0
"slt ft0.w, ft0.x, fc0.x \n" + // alpha = red < inputAlpha
"mov ft0.xyz, fc1.xyzz \n" + // set color to black
"mov oc, ft0";
mShaderProgram = target.registerProgramFromSource(PROGRAM_NAME, vertexShader, fragmentShader);
It works and when I set the filters alpha, it will update the stuff. The only thing left is the partly transparent thing, but I have no idea how I can do that.
Swap the cycle on the Y and X coordinates. By using the X in the inner loop you optimize the L1 cache and the prefetcher of the CPU.
Some minor hints:
Remove the zeros for a cleaner code:
const c:uint = a << 24
Verify that 255/50 is collapsed into a single constant by the compiler.
Don't be crazy by doing it with BitmapData once you're using Starling.
I didn't get if you're grayscaling it by yourself or not. In not, just create a Starling filter for grayscale (pixel shader below will do the trick)
tex ft0, v0, fs0 <2d,linear,clamp>
add ft1.x, ft0.x, ft0.y
add ft1.x, ft1.x, ft0.z
div ft1.x, ft1.x, fc0.x
mov ft0.xyz, ft1.xxx
mov oc ft0
And for the alpha transition just extend the Image Class, implement IAnimatable add it to the Juggler. in the advanceTime just do a this.alpha -= VALUE;
Simple like that :)
Just going to elaborate a bit on #Paxel's answer. I discussed with another developer Jackson Dunstan about the L1 caching, where the speed improvement comes from, and what other improvements can be made to code like this to see performance gain.
After which Jackson posted a blog entry which can be read at here: Take Advantage of CPU caching
I'll post some the relative items. First the bitmap data is stored in memory by rows. The rows memory addresses might look something like this:
row 1: 0 1 2 3 4 5
row 2: 6 7 8 9 10 11
row 3: 12 13 14 15 16 17
Now running your inner loop through the rows will allow you leverage the L1 cache advantage since you can read the memory in order. So inner looping X first you'll read the first row as:
0 1 2 3 4 5
But if you were to do it Y first you'd read it as:
0 6 12 1 7 13
As you can see you are bouncing around memory addresses making it a slower process.
As for optimizations that could be made, the suggestion is to cache your width and height getters, storing the properties into local variables. Also using the Math.round() is pretty slow, replacing that would see a speed increase.

How to get raw frame data from AVFrame.data[] and AVFrame.linesize[] without specifying the pixel format?

I get the general idea that the frame.data[] is interpreted depending on which pixel format is the video (RGB or YUV). But is there any general way to get all the pixel data from the frame? I just want to compute the hash of the frame data, without interpret it to display the image.
According to AVFrame.h:
uint8_t* AVFrame::data[AV_NUM_DATA_POINTERS]
pointer to the picture/channel planes.
int AVFrame::linesize[AV_NUM_DATA_POINTERS]
For video, size in bytes of each picture line.
Does this mean that if I just extract from data[i] for linesize[i] bytes then I get the full pixel information about the frame?
linesize[i] contains stride for the i-th plane.
To obtain the whole buffer, use the function from avcodec.h
/**
* Copy pixel data from an AVPicture into a buffer, always assume a
* linesize alignment of 1. */
int avpicture_layout(const AVPicture* src, enum AVPixelFormat pix_fmt,
int width, int height,
unsigned char *dest, int dest_size);
Use
int avpicture_get_size(enum AVPixelFormat pix_fmt, int width, int height);
to calculate the required buffer size.
avpicture_* API is deprecated. Now you can use av_image_copy_to_buffer() and av_image_get_buffer_size() to get image buffer.
You can also avoid creating new buffer memory like above (av_image_copy_to_buffer()) by using AVFrame::data[] with the size of each array/plane can be get from av_image_fill_plane_sizes(). Only do this if you clearly understand the pixel format.
Find more here: https://www.ffmpeg.org/doxygen/trunk/group__lavu__picture.html

Upscaling images on Retina devices

I know images upscale by default on retina devices, but the default scaling makes the images blurry.
I was wondering if there was a way to scale it in nearest-neighbor mode, where there are no transparent pixels created, but rather each pixel multiplied by 4, so it looks like it would on a non retina device.
Example of what I'm talking about can be seen in the image below.
example http://cclloyd.com/downloads/sdfsdf.png
CoreGraphics will not do a 2x scale like that, you need to write a bit of explicit pixel mapping logic to do something like this. The following is some code I used to do this operation, you would of course need to fill in the details as this operates on an input buffer of pixels and writes to an output buffer of pixels that is 2x larger.
// Use special case "DOUBLE" logic that will simply duplicate the exact
// RGB value from the indicated pixel into the 2x sized output buffer.
int numOutputPixels = resizedFrameBuffer.width * resizedFrameBuffer.height;
uint32_t *inPixels32 = (uint32_t*)cgFrameBuffer.pixels;
uint32_t *outPixels32 = (uint32_t*)resizedFrameBuffer.pixels;
int outRow = 0;
int outColumn = 0;
for (int i=0; i < numOutputPixels; i++) {
if ((i > 0) && ((i % resizedFrameBuffer.width) == 0)) {
outRow += 1;
outColumn = 0;
}
// Divide by 2 to get the column/row in the input framebuffer
int inColumn = outColumn / 2;
int inRow = outRow / 2;
// Get the pixel for the row and column this output pixel corresponds to
int inOffset = (inRow * cgFrameBuffer.width) + inColumn;
uint32_t pixel = inPixels32[inOffset];
outPixels32[i] = pixel;
//fprintf(stdout, "Wrote 0x%.10X for 2x row/col %d %d (%d), read from row/col %d %d (%d)\n", pixel, outRow, outColumn, i, inRow, inColumn, inOffset);
outColumn += 1;
}
This code of course depends on you creating a buffer of pixels and then wrapping it back up into CFImageRef. But, you can find all the code to do that kind of thing easily.

How "bytesPerRow" is calculated from an NSBitmapImageRep

I would like to understand how "bytesPerRow" is calculated when building up an NSBitmapImageRep (in my case from mapping an array of floats to a grayscale bitmap).
Clarifying this detail will help me to understand how memory is being mapped from an array of floats to a byte array (0-255, unsigned char; neither of these arrays are shown in the code below).
The Apple documentation says that this number is calculated "from the width of the image, the number of bits per sample, and, if the data is in a meshed configuration, the number of samples per pixel."
I had trouble following this "calculation" so I setup a simple loop to find the results empirically. The following code runs just fine:
int Ny = 1; // Ny is arbitrary, note that BytesPerPlane is calculated as we would expect = Ny*BytesPerRow;
for (int Nx = 0; Nx<320; Nx+=64) {
// greyscale image representation:
NSBitmapImageRep *dataBitMapRep = [[NSBitmapImageRep alloc]
initWithBitmapDataPlanes: nil // allocate the pixel buffer for us
pixelsWide: Nx
pixelsHigh: Ny
bitsPerSample: 8
samplesPerPixel: 1
hasAlpha: NO
isPlanar: NO
colorSpaceName: NSCalibratedWhiteColorSpace // 0 = black, 1 = white
bytesPerRow: 0 // 0 means "you figure it out"
bitsPerPixel: 8]; // bitsPerSample must agree with samplesPerPixel
long rowBytes = [dataBitMapRep bytesPerRow];
printf("Nx = %d; bytes per row = %lu \n",Nx, rowBytes);
}
and produces the result:
Nx = 0; bytes per row = 0
Nx = 64; bytes per row = 64
Nx = 128; bytes per row = 128
Nx = 192; bytes per row = 192
Nx = 256; bytes per row = 256
So we see that the bytes/row jumps in 64 byte increments, even when Nx incrementally increases by 1 all the way to 320 (I didn't show all of those Nx values). Note also that Nx = 320 (max) is arbitrary for this discussion.
So from the perspective of allocating and mapping memory for a byte array, how are the "bytes per row" calculated from first principles? Is the result above so the data from a single scan-line can be aligned on a "word" length boundary (64 bit on my MacBook Pro)?
Thanks for any insights, having trouble picturing how this works.
Passing 0 for bytesPerRow: means more than you said in your comment. From the documentation:
If you pass in a rowBytes value of 0, the bitmap data allocated may be padded to fall on long word or larger boundaries for performance. … Passing in a non-zero value allows you to specify exact row advances.
So you're seeing it increase by 64 bytes at a time because that's how AppKit decided to round it up.
The minimum requirement for bytes per row is much simpler. It's bytes per pixel times pixels per row. That's all.
For a bitmap image rep backed by floats, you'd pass sizeof(float) * 8 for bitsPerSample, and bytes-per-pixel would be sizeof(float) * samplesPerPixel. Bytes-per-row follows from that; you multiply bytes-per-pixel by the width in pixels.
Likewise, if it's backed by unsigned bytes, you'd pass sizeof(unsigned char) * 8 for bitsPerSample, and bytes-per-pixel would be sizeof(unsigned char) * samplesPerPixel.

Resources