How to store image in the FPGA for real time video processing? - fpga

I am implementing a video processing project in real time which comes from an HDMI input. The video input is going to have a green background, which will be replaced by an image stored in the FPGA in order to generate a new video with a different background. I am using PYNQ-Z2 board.
So far, I have tried the following:
Storing the whole image in BRAM is not possible because there is not enough space
Using a second stream for the image and then try to mix the 2 streams (video + image). Cannot synchronize the 2 streams.
Store the image in RAM and use a double buffering scheme to load part of the image in BRAM. The first buffer is used for the processing 1 row of the image. The second one is used for loading the next row from DDR memory via the DMA (DMA is controlled by the CPU). When a row is done, then an interrupt is sent from the FPGA to the CPU so that the next line can be sent from DDR memory. Also, I switch the buffers so that new data starts loading. This solution has too much latency in the DMA transfer and the image in the video output is broken.

Related

Camera and screen sync

Problem Description:
we have a camera that is sending video of a live sports game in 30 frames per second.
on the other side we have a screen that is representing immediately every fram that is coming.
Assumptions
*frames will arrive in order
1.what will be the experience for a person that is wathcing the screen?
2.what can we do in order to improve it?
Your playback will have very variable framerate which would cause visible artifacts during any smooth movement ...
To remedy this You need to implement image FIFO that will cover bigger time that is your worst delay difference (Idealy at least 2x times more). So if you got 300ms-100ms delay difference and 30 fps then minimal FIFO size is:
n = 2 * (300-100) * 0.001 * 30 = 12 images
Now the reproduction should be like this:
init playback
simply start obtaining images into FIFO until the FIFO is half FULL (contains images for biggest delay difference)
playback
so any incoming image is inserted into FIFO at time of receival (unless FIFO is full in which case you wait until you have room to place new image or skip frame). Meanwhile in some thread or timer (that runs in parallel) you fetch image from FIFO every 1/30 seconds and render it (if the FIFO is empty you use last image and you can even go to bullet #1 again).
playback stop
once FIFO is empty for longer duration then some threshold (no new frames are incoming) you stop the playback.
The FIFO size reserve and point when to start playback depends on the image source timing properties (so it does not overrun nor underrun the FIFO)...
In case you need to implement your own FIFO class then cyclic buffer of constant size is your friend (so you do not need to copy all the stored images on FIFO in/out operations).

FFMPEG Frame to DirectX Surface Hardware Accelerated

I use ffmpeg functions to decode h264 frames and display in a window on windows platform. The approach which I use is as below (from FFMPEG Frame to DirectX Surface):
AVFrame *frame;
avcodec_decode_video(_ffcontext, frame, etc...);
lockYourSurface();
uint8_t *buf = getPointerToYourSurfacePixels();
// Create an AVPicture structure which contains a pointer to the RGB surface.
AVPicture pict;
memset(&pict, 0, sizeof(pict));
avpicture_fill(&pict, buf, PIX_FMT_RGB32,
_ffcontext->width, _ffcontext->height);
// Convert the image into RGB and copy to the surface.
img_convert(&pict, PIX_FMT_RGB32, (AVPicture *)frame,
_context->pix_fmt, _context->width, _context->height);
unlockYourSurface();
In the code, I use sws_scale instead of img_convert.
When I pass the surface data pointer to sws_scale (in fact in avpicture_fill), it seems that the data pointer is actually on RAM not on GPU memory, and when I want to display the surface, it seems that the data is moved to GPU and then displayed. As I know CPU utilization is high when data is copied between RAM and GPU memory.
How I can tel ffmpeg to render directly to a surface on GPU memory (not a data pointer on RAM)?
I have found the answer to this problem. To prevent extra cpu usage in displaying frames using ffmpeg, we must not decode the frame to RGB. Almost all of the video files are decoded to YUV (this is the original image format inside the video file). The point here is that GPU is able to display YUV data directly without need to convert it to RGB. As I know, using ffmpeg usual version, decoded data is always on RAM. For a frame, the amount of YUV data is very small as compared to RGB decoded equivalent of the same frame. So when we move YUV data to GPU instead of converting to RGB and then moving to GPU, we speed up operation from two sides:
No conversion to RGB
Amount of data moved between RAM and GPU is decreased
So finally the overall CPU usage is decreased.

Make DirectShow play sound from a memory buffer

I want to play sound "on-demand". A simple drum machine is what I want to program.
Is it possible to make DirectShow read from a memory buffer ?(object created by c++)
I am thinking:
Create a buffer of, lets say, 40000 positions, type double (I don't know the actual data type to use as sound, so I might be wrong with double).
40000 positions can be 1 second of playback.
The DirectShow object is supposed to read this buffer position by position, over and over again. and the buffer will contain the actual value of the output of the sound. For example (a sine-looking output):
{0, 0.4, 0.7, 0.9, 0.99, 0.9, 0.7, 0.4, 0, -0,4, -0.7, -0.9, -0.99, -0.9, -0.7, -0.4, 0}
The resolution of this sound sequence is probably not that good, but it is only to display what I mean.
Is this possible? I cannot find any examples or information about it on Google.
edit:
When working on DirectShow and streaming video (UBS camera), I used something called Sample Grabber. Which called a method for every frame from the cam. I am looking for something similar, but for music, and something that is called before the music is played.
Thanks
You want to stream your data through and injecting data into DirectShow pipeline is possible.
By design, outer DirectShow interface does not provide access to streamed data. Controlling code builds the topology, connects filters, sets them up and controls the state of the pipeline. All data is streamed behind the scenes, filters are passing pieces of data one to another and this adds up into data streaming.
Sample Grabber is the helper filter that allows to grab a copy of data being passed through certain graph point. Because otherwise payload data is not available to controlling code, Sample Grabber gained popularity, esp. for grabbing video frames out the the "inaccessible" stream, live or file backed playback.
Now when you want to do the opposite, put your own data into pipeline, the Sample Grabber concept does not work. Taking a copy of data is one thing, and proactive putting your own data into the stream is a different one.
To inject your own data you typically put your own custom filter into the pipeline that generates the data. You want to generate PCM audio data. You are choose where you take it from - generation, reading from file, memory, network, looping whatsoever. You fill buffers, you add time stamps and you deliver the audio buffers to the downstream filters. A typical starting point is PushSource Filters Sample which introduces the concept of a filter producing video data. In a similar way you want to produce PCM audio data.
A related question:
How do I inject custom audio buffers into a DirectX filter graph using DSPACK?

How to read video frame buffer in windows

I am trying to create a small project wherein I need to capture/read the video frame buffer and calculate the average RGB value of the screen.
I don't need to write anything on the screen. I'm doing this in Windows.
Can anyone help me with any Windows API which will read the video frame buffer and calculate the average RGB value?
What I came to know is that I need to write a kernel driver which will have access to read the frame buffer.
Is this the only solution?
Is there any other way of reading frame buffer?
Is there an algorithm to calculate the RGB value from frame buffer data?
If you want really good performance, you might have to use directx and capture the backbuffer to a texture. Using mipmaps, it will automatically create downsamples all the way to 1X1. Justgrab the color of that 1 pixel and you're good to go.
Good luck, though. I'm working on implimenting this as we speak. I'm creating an ambient light control for my room. I was getting about 15FPS using device contexts and StretchBLT. Only got decent performance if I grabbed 1 pixel with GetPixel(). That's an i5 3570K # 4.5GHz
But with the directx method, you could technically get hundreds if not thousands of frames per second. (when I make a spinning triangle, my 660 gets about 24,000 FPS. It couldn't be TOO much slower, minus the CPU calls.)

glTexSubImage2D extremely slow on Intel video card

My video card is Mobile Intel 4 Series. I'm updating a texture with changing data every frame, here's my main loop:
for(;;) {
Timer timer;
glBindTexture(GL_TEXTURE2D, tex);
glBegin(GL_QUADS); ... /* draw textured quad */ ... glEnd();
glTexSubImage2D(GL_TEXTURE2D, 0, 0, 0, 512, 512,
GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, data);
swapBuffers();
cout << timer.Elapsed();
}
Every iteration takes 120ms. However, inserting glFlush before glTexSubImage2D brings the iteration time to 2ms.
The issue is not in the pixel format. I've tried the pixel formats BGRA, RGBA and ABGR_EXT together with the pixel types UNSIGNED_BYTE, BYTE, UNSIGNED_INT_8_8_8_8 and UNSIGNED_INT_8_8_8_8_EXT. The texture's internal pixel format is RGBA.
The order of calls matters. Moving the texture upload before the quad drawing, for example, fixes the slowness.
I also tried this on an GeForce GT 420M card, and it works fast there. My real app does have performance problems on non-Intel cards that are fixed by glFlush calls, but I haven't distilled those to a test case yet.
Any ideas on how to debug this?
One issue is that glTexImage2D performs a full reinitialization of the texture object. If only the data changes, but the format remains the same, use glTexSubImage2D to speed things up (just a reminder).
The other issue is, that despite its name the immediate mode, i.e. glBegin(…) … glEnd() the drawing calls are not synchronous, i.e. the calls return long before the GPU is done drawing. Adding a glFinish() will synchronize. But as well will do calls to anything that modifies data still required by queued operations. So in your case glTexImage2D (and glTexSubImage2D) must wait for the drawing to finish.
Usually it's best to do all volatile resource uploads at either the beginning of the drawing function, or during the SwapBuffers block in a separate thread through buffer objects. Buffer objects have been introduced for that very reason, to allow for asynchronous, yet tight operation.
I assume you're actually using that texture for one or more of your quads?
Uploading textures is one of the most expensive operations possible. Since your texture data changes every frame, the upload is unavoidable, but you should try to do it when the texture isn't in use by shaders. Remember that glBegin(GL_QUADS); ... glEnd(); doesn't actually draw quads, it requests that the GPU render the quads. Until the rendering completes, the texture will be locked. Depending on the implementation, this might cause the texture upload to wait (ala glFlush), but it could also cause the upload to fail, in which case you've wasted megabytes of PCIe bandwidth and the driver has to retry.
It sounds like you already have a solution: upload all new textures at the beginning of the frame. So what's your question?
NOTE: Intel integrated graphics are horribly slow anyway.
When you make a Draw Call ( glDrawElements, other ), the driver simply add this call in a buffer, and let the GPU consume these commands when it can.
If this buffer had to be consumed entirely at glSwapBuffers, this would mean that the GPU would be idle after that, waiting for you to send new commands.
Drivers solve this by letting the GPU lag one frame behind. This is the first reason why glTexSubImage2D blocks : the driver waits for the GPU not to use it anymore (in the previous frame) to begin the transfer, so that you never get half-updated data.
The other reason is that glTexSubImage2D is synchronous. Il will also block during the whole transfer.
You can solve the first issue by keeping 2 textures : one for the current frame, one for the previous frame. Upload the texture in the former, but draw with the latter.
You can solve the second issue by using a GL_TEXTURE_BUFFER Buffer Object, which allows asynchronous transfers.
In your case, I suspect that calling glTexSubImage2D just before glSwapBuffer adds an extra synchronization in the driver, whereas drawing the quad just before glSwapBuffer simply appends the command in the buffer. 120ms is probably a driver bug, though : even an Intel GMA doesn't need 120ms to upload a 512x512 texture.

Resources