Are B, P frames results of the motion estimation/compression? - mpeg

Confused about the relationship between MPEG frames and motion estimation/compensation :(
Are B, P frames results of the motion estimation/compression? But motion estimation/ compression use those frames. Then how and who decide the frames?

There are actually 3 types of frames:
I-frame Intra-coded (independent) frame, the least compressed
P-frame Predicted frame: uses precedent frames to improve compression
B-frame Bi-directional frame: uses both previous and next frames for the best compression
These frames are indeed the result of compression by the encoder, which is done in different steps, roughly (not exhaustive):
Reduce color nuances & resolution of images
Remove imperceptible details
Compare adjacent images and remove redundant informations (i.e. unchanged items between 2 images)
[...]
There is a good article on wikipedia. You can find more asking google on I/P/B frames ;)
Also checkout this answer related to your question.

Related

Please clarify the gif image format's intended behavior

If I have a gif89a which has multiple image blocks that are identical (and small, say 40x40 or 1600 pixels in size), should these continue to increase the final size of the gif file (assuming a sane encoder)?
I'm trying to understand how the LZW compression works. According to the W3C spec, I thought the entire data stream itself (consisting of multiple image blocks) should be be compressed, and thus repeating the same image frame multiple times would incur very little overhead (just the size of the symbol for the the repeated image block). This does not seem to be the case, and I've tested with several encoders (Gimp, Photoshop).
Is this to be expected with all encoders, or are these two just doing it poorly?
With gimp, my test gif was 23k in size when it had 240 identical image blocks, and 58k in size with 500 image blocks, which seems less impressive than my intuition is telling me (my intuition's pretty dumb, so I won't be shocked if/when someone tells me it's incredibly wrong).
[edit]
I need to expand on what it is I'm getting at, I think, to receive a proper answer. I am wanting to handcraft a gif image (and possibly write an encoder if I'm up to it) that will take advantage of some quirks to compress it better than would happen otherwise.
I would like to include multiple sub-images in the gif that are used repeatedly in a tiling fashion. If the image is large (in this case, 1700x2200), gif can't compress the tiles well because it doesn't see them as tiles, it rasters from the top left to the bottom right, and at most a 30 pixel horizontal slice of any given tile will be given a symbol and compressed, and not the 30x35 tile itself.
The tiles themselves are just the alphabet and some punctuation in this case, from a scan of a magazine. Of course in the original scan, each "a" is slightly different than every other, which doesn't help for compression, and there's plenty of noise in the scan too, and that can't help.
As each tile will be repeated somewhere in the image anywhere from dozens to hundreds of times, and each is 30 or 40 times as large as any given slice of a tile, it looks like there are some gains to be had (supposing the gif file format can be bent towards my goals).
I've hand-created another gif in gimp, that uses 25 sub-images repeatedly (about 700 times, but I lost count). It is 90k in size unzipped, but zipping it drops it back down to 11k. This is true even though each sub-image has a different top/left coordinate (but that's only what, 4 bytes up in the header of the sub-image).
In comparison, a visually identical image with a single frame is 75k. This image gains nothing from being zipped.
There are other problems I've yet to figure out with the file (it's gif89a, and treats this as an animation even though I've set each frame to be 0ms in length, so you can't see it all immediately). I can't even begin to think how you might construct an encoder to do this... it would have to select the best-looking (or at least one of the better-looking) versions of any glyph, and then figure out the best x,y to overlay it even though it doesn't always line up very well.
It's primary use (I believe) would be for magazines scanned in as cbr/cbz ebooks.
I'm also going to embed my hand-crafted gif, it's easier to see what I'm getting at than to read my writing as I stumble over the explanation:
LZW (and GIF) compression is one-dimensional. An image is treated as a stream of symbols where any area-to-area (blocks in your terminology) symmetry is not used. An animated GIF image is just a series of images that are compressed independently and can be applied to the "main" image with various merging options. Animated GIF was more like a hack than a standard and it wasn't well thought out for efficiency in image size.
There is a good explanation for why you see smaller files after ZIP'ing your GIF with repeated blocks. ZIP files utilize several techniques which include a "repeated block" type of compression which could do well with small (<32K) blocks (or small distances separating) identical LZW data.
GIF-generating software can't overcome the basic limitation of how GIF images are compressed without writing a new standard. A slightly better approach is used by PNG which uses simple 2-dimensional filters to take advantage of horizontal and vertical symmetries and then compresses the result with FLATE compression. It sounds like what you're looking for is a more fractal or video approach which can have the concept of a set of compressed primitives that can be repeated at different positions in the final image. GIF and PNG cannot accomplish this.
GIF compression is stream-based. That means to maximize compression, you need to maximize the repeatability of the stream. Rather than square tiles, I'd use narrow strips to minimize the amount of data that passes before it starts repeating then keep the repeats within the same stream.
The LZW code size is capped at 12 bits, which means the compression table fills up relatively quickly. A typical encoder will output a clear code when this happens so that the compression can start over, giving good adaptability to fresh content. If you do your own custom encoder you can skip the clear code and keep reusing the existing table for higher compression results.
The GIF spec does not specify the behavior when a delay time of 0 is given, so you're at the mercy of the decoder implementation. For consistent results you should use a delay of 1 and accept that the entire image won't show up immediately.

How to estimate GIF file size?

We're building an online video editing service. One of the features allows users to export a short segment from their video as an animated gif. Imgur has a file size limit of 2Mb per uploaded animated gif.
Gif file size depends on number of frames, color depth and the image contents itself: a solid flat color result in a very lightweight gif, while some random colors tv-noise animation would be quite heavy.
First I export each video frame as a PNG of the final GIF frame size (fixed, 384x216).
Then, to maximize gif quality I undertake several gif render attempts with slightly different parameters - varying number of frames and number of colors in the gif palette. The render that has the best quality while staying under the file size limit gets uploaded to Imgur.
Each render takes time and CPU resources — this I am looking to optimize.
Question: what could be a smart way to estimate the best render settings depending on the actual images, to fit as close as possible to the filesize limit, and at least minimize the number of render attempts to 2–3?
The GIF image format uses LZW compression. Infamous for the owner of the algorithm patent, Unisys, aggressively pursuing royalty payments just as the image format got popular. Turned out well, we got PNG to thank for that.
The amount by which LZW can compress the image is extremely non-deterministic and greatly depends on the image content. You, at best, can provide the user with a heuristic that estimates the final image file size. Displaying, say, a success prediction with a colored bar. You'd can color it pretty quickly by converting just the first frame. That won't take long on 384x216 image, that runs in human time, a fraction of a second.
And then extrapolate the effective compression rate of that first image to the subsequent frames. Which ought to encode only small differences from the original frame so ought to have comparable compression rates.
You can't truly know whether it exceeds the site's size limit until you've encoded the entire sequence. So be sure to emphasize in your UI design that your prediction is just an estimate so your user isn't going to disappointed too much. And of course provide him with the tools to get the size lowered, something like a nearest-neighbor interpolation that makes the pixels in the image bigger. Focusing on making the later frames smaller can pay off handsomely as well, GIF encoders don't normally do this well by themselves. YMMV.
There's no simple answer to this. Single-frame GIF size mainly depends on image entropy after quantization, and you could try using stddev as an estimator using e.g. ImageMagick:
identify -format "%[fx:standard_deviation]" imagename.png
You can very probably get better results by running a smoothing kernel on the image in order to eliminate some high-frequency noise that's unlikely to be informational, and very likely to mess up compression performance. This goes much better with JPEG than with GIF, anyway.
Then, in general, you want to run a great many samples in order to come up with something of the kind (let's say you have a single compression parameter Q)
STDDEV SIZE W/Q=1 SIZE W/Q=2 SIZE W/Q=3 ...
value1 v1,1 v1,2 v1,3
After running several dozens of tests (but you need do this only once, not "at runtime"), you will get both an estimate of, say, , and a measurement of its error. You'll then see that an image with stddev 0.45 that compresses to 108 Kb when Q=1 will compress to 91 Kb plus or minus 5 when Q=2, and 88 Kb plus or minus 3 when Q=3, and so on.
At that point you get an unknown image, get its stddev and compression #Q=1, and you can interpolate the probable size when Q equals, say, 4, without actually running the encoding.
While your service is active, you can store statistical data (i.e., after you really do the encoding, you store the actual results) to further improve estimation; after all you'd only store some numbers, not any potentially sensitive or personal information that might be in the video. And acquiring and storing those numbers would come nearly for free.
Backgrounds
It might be worthwhile to recognize images with a fixed background; in that case you can run some adaptations to make all the frames identical in some areas, and have the GIF animation algorithm not store that information. This, when and if you get such a video (e.g. a talking head), could lead to huge savings (but would throw completely off the parameter estimation thing, unless you could estimate also the actual extent of the background area. In that case, let this area be B, let the frame area be A, the compressed "image" size for five frames would be A+(A-B)*(5-1) instead of A*5, and you could apply this correction factor to the estimate).
Compression optimization
Then there are optimization techniques that slightly modify the image and adapt it for a better compression, but we'd stray from the topic at hand. I had several algorithms that worked very well with paletted PNG, which is similar to GIF in many regards, but I'd need to check out whether and which of them may be freely used.
Some thoughts: LZW algorithm goes on in lines. So whenever a sequence of N pixels is "less than X%" different (perceptually or arithmetically) from an already encountered sequence, rewrite the sequence:
018298765676523456789876543456787654
987678656755234292837683929836567273
here the 656765234 sequence in the first row is almost matched by the 656755234 sequence in the second row. By changing the mismatched 5 to 6, the LZW algorithm is likely to pick up the whole sequence and store it with one symbol instead of three (6567,5,5234) or more.
Also, LZW works with bits, not bytes. This means, very roughly speaking, that the more the 0's and 1's are balanced, the worse the compression will be. The more unpredictable their sequence, the worse the results.
So if we can find out a way of making the distribution more **a**symmetrical, we win.
And we can do it, and we can do it losslessly (the same works with PNG). We choose the most common colour in the image, once we have quantized it. Let that color be color index 0. That's 00000000, eight fat zeroes. Now we choose the most common colour that follows that one, or the second most common colour; and we give it index 1, that is, 00000001. Another seven zeroes and a single one. The next colours will be indexed 2, 4, 8, 16, 32, 64 and 128; each of these has only a single bit 1, all others are zeroes.
Since colors will be very likely distributed following a power law, it's reasonable to assume that around 20% of the pixels will be painted with the first nine most common colours; and that 20% of the data stream can be made to be at least 87.5% zeroes. Most of them will be consecutive zeroes, which is something that LZW will appreciate no end.
Best of all, this intervention is completely lossless; the reindexed pixels will still be the same colour, it's only the palette that will be shifted accordingly. I developed such a codec for PNG some years ago, and in my use case scenario (PNG street maps) it yielded very good results, ~20% gain in compression. With more varied palettes and with LZW algorithm the results will be probably not so good, but the processing is fast and not too difficult to implement.

C#: Fast and Smart algorithm to Compare two byte array (image)

I'm running a process on a WebCam Image. I'd like to Wake Up that process only if there is major changes.
Something moving in the image
Lights turn on
...
So i'm looking for a fast efficient algorithm in C# to compare 2 byte[] (kinect image) of the same size.
I just need kind of "diff size" with a threashold
I found some motion detection algorithm but it's "too much"
I found some XOR algorithm but it might be too simple ? Would be great If I could ignore small change like sunlight, vibration, etc, ...
Mark all pixels which are different from previous image (based on threshold i.e. if Pixel has been changed only slightly - ignore it as noise) as 'changed'
Filter out noise pixels - i.e. if pixel was marked as changed but all its neighbors are not - consider it as noise and unmark as changed
Calculate how many pixels are changed on the image and compare with Threshold (you need to calibrate it manually)
Make sure you are operating on Greyscale images (not RGB). I.e. convert to YUV image space and do comparison only on Y.
This would be simplest and fastest algorithm - you just need to tune these two thresholds.
A concept: MPEG standards involve motion detections. Maybe you can monitor the MPEG stream's bandwidth. If there's no motion, than the bandwidth is very low (except during key frames (I frames)). If something changes and any move is going on, the bandwidth increases.
So what you can do is grab the JPEGs and feed it into an MPEG encoder codec. Then you can just look at the encoded stream. You can tune the frame-rate and the bandwidth too in a range, plus you decide what is the threshold for the output stream of the codec which means "motion".
Advantage: very generic and there are libraries available, often they offer hardware acceleration (VGAs/GPUs help with JPEG en/decoding and some or more MPEG). It's also pretty standard.
Disadvantage: more computation demanding than a XOR.

What is 6-tap filter and how they differ across codecs?

I found in one research on VP8 decoding phrase "6-tap filter in any case will be a 6-tap filter, and the difference is usually only in the coefficients". So what is 6-tap filter, how it works?
So can any one please explain what is 6-tap filter and how they differ across codecs?
There are two places in video codecs where these filters are typically used:
Motion estimation/compensation
Video codecs compress much better than still image codecs because they also remove the redundancy between frames. They do this using motion estimation and motion compensation. The encoder splits up the image into rectangular blocks of image data (typically 16x16) and then tries to find the block in a previously coded frame that is as similar as possible to the block that's currently being coded. The encoder then only transmits the difference, and a pointer to where it found this good match. This is the main reason why video codecs get about 1:100 compression, where image codecs get 1:10 compression. Now, you can imagine that sometimes camera or object in the scene didn't move by a full pixel, but actually by half or a quarter pixel. There's then a better match found if the image is scaled/interpolated, and these filters are used to do that. The exact way they do this filtering often differs per codec.
Deblocking
Another reason for using such a filter is to remove artifacts from the transform that's being used. Just like in still-image coding, there's a transform that transforms the image data into a different space that "compacts the energy". For instance, after this transform, those image sections that have a uniform color, like a blue sky, will result into data that has just a single number for the color, and then all zeros for the rest of the data. Comparing this to the original data, which stores blue for all of the pixels, a lot of redundancy has been removed. After the transform (Google for DCT, KLT, integer transform), the zeros are typically thrown away, and the other not so relevant data that's left is coded with fewer bits than in the original. During image decoding, since data has been thrown away, this often results in edges between the 8x8 or 16x16 of neighboring blocks. There's a separate smoothing filter that then smoothens these edges again.
A 6 tap filter is a 6th order FIR or IIR filter (probably FIR). The coefficients will determine the frequency response of the filter. Without knowing the structure, coefficients and the sample rate you can't really say much more about the filter.

How does MPEG4 compression work?

Can anyone explain in a simple clear way how MPEG4 works to compress data. I'm mostly interested in video. I know there are different standards or parts to it. I'm just looking for the predominant overall compression method, if there is one with MPEG4.
MPEG-4 is a huge standard, and employs many techniques to achieve the high compression rates that it is capable of.
In general, video compression is concerned with throwing away as much information as possible whilst having a minimal effect on the viewing experience for an end user. For example, using subsampled YUV instead of RGB cuts the video size in half straight away. This is possible as the human eye is less sensitive to colour than it is to brightness. In YUV, the Y value is brightness, and the U and V values represent colour. Therefore, you can throw away some of the colour information which reduces the file size, without the viewer noticing any difference.
After that, most compression techniques take advantage of 2 redundancies in particular. The first is temporal redundancy and the second is spatial redundancy.
Temporal redundancy notes that successive frames in a video sequence are very similar. Typically a video would be in the order of 20-30 frames per second, and nothing much changes in 1/30 of a second. Take any DVD and pause it, then move it on one frame and note how similar the 2 images are. So, instead of encoding each frame independently, MPEG-4 (and other compression standards) only encode the difference between successive frames (using motion estimation to find the difference between frames)
Spatial redundancy takes advantage of the fact that in general the colour spread across images tends to be quite low frequency. By this I mean that neighbouring pixels tend to have similar colours. For example, in an image of you wearing a red jumper, all of the pixels that represent your jumper would have very similar colour. It is possible to use the DCT to transform the pixel values into the frequency space, where some high frequency information can be thrown away. Then, when the reverse DCT is performed (during decoding), the image is now without the thrown away high-frequency information.
To view the effects of throwing away high frequency information, open MS paint and draw a series of overlapping horizontal and vertical black lines. Save the image as a JPEG (which also uses DCT for compression). Now zoom in on the pattern, notice how the edges of the lines are not as sharp anymore and are kinda blurry. This is because some high frequency information (the transition from black to white) has been thrown away during compression. Read this for an explanation with nice pictures
For further reading, this book is quite good, if a little heavy on the maths.
Like any other popular video codec, MPEG4 uses a variation of discrete cosine transform and a variety of motion-compensation techniques (which you can think of as motion-prediction if that helps) that reduce the amount of data needed for subsequent frames. This page has an overview of what is done by plain MPEG4.
It's not totally dissimilar to the techniques used by JPEG.
MPEG4 uses a variety of techniques to compress video.
If you haven't already looked at wikipedia, this would be a good starting point.
There is also this article from the IEEE which explains these techniques in more detail.
Sharp edges certainly DO contain high frequencies. Reducing or eliminating high frequencies reduces the sharpness of edges. Fine detail including sharp edges is removed with high frequency removal - bility to resolve 2 small objects is removed with high frequencies - then you see just one.

Resources