How does MPEG4 compression work?

How does MPEG4 compression work? - algorithm

Can anyone explain in a simple clear way how MPEG4 works to compress data. I'm mostly interested in video. I know there are different standards or parts to it. I'm just looking for the predominant overall compression method, if there is one with MPEG4.

MPEG-4 is a huge standard, and employs many techniques to achieve the high compression rates that it is capable of.
In general, video compression is concerned with throwing away as much information as possible whilst having a minimal effect on the viewing experience for an end user. For example, using subsampled YUV instead of RGB cuts the video size in half straight away. This is possible as the human eye is less sensitive to colour than it is to brightness. In YUV, the Y value is brightness, and the U and V values represent colour. Therefore, you can throw away some of the colour information which reduces the file size, without the viewer noticing any difference.
After that, most compression techniques take advantage of 2 redundancies in particular. The first is temporal redundancy and the second is spatial redundancy.
Temporal redundancy notes that successive frames in a video sequence are very similar. Typically a video would be in the order of 20-30 frames per second, and nothing much changes in 1/30 of a second. Take any DVD and pause it, then move it on one frame and note how similar the 2 images are. So, instead of encoding each frame independently, MPEG-4 (and other compression standards) only encode the difference between successive frames (using motion estimation to find the difference between frames)
Spatial redundancy takes advantage of the fact that in general the colour spread across images tends to be quite low frequency. By this I mean that neighbouring pixels tend to have similar colours. For example, in an image of you wearing a red jumper, all of the pixels that represent your jumper would have very similar colour. It is possible to use the DCT to transform the pixel values into the frequency space, where some high frequency information can be thrown away. Then, when the reverse DCT is performed (during decoding), the image is now without the thrown away high-frequency information.
To view the effects of throwing away high frequency information, open MS paint and draw a series of overlapping horizontal and vertical black lines. Save the image as a JPEG (which also uses DCT for compression). Now zoom in on the pattern, notice how the edges of the lines are not as sharp anymore and are kinda blurry. This is because some high frequency information (the transition from black to white) has been thrown away during compression. Read this for an explanation with nice pictures
For further reading, this book is quite good, if a little heavy on the maths.

Like any other popular video codec, MPEG4 uses a variation of discrete cosine transform and a variety of motion-compensation techniques (which you can think of as motion-prediction if that helps) that reduce the amount of data needed for subsequent frames. This page has an overview of what is done by plain MPEG4.
It's not totally dissimilar to the techniques used by JPEG.

MPEG4 uses a variety of techniques to compress video.
If you haven't already looked at wikipedia, this would be a good starting point.
There is also this article from the IEEE which explains these techniques in more detail.

Sharp edges certainly DO contain high frequencies. Reducing or eliminating high frequencies reduces the sharpness of edges. Fine detail including sharp edges is removed with high frequency removal - bility to resolve 2 small objects is removed with high frequencies - then you see just one.

Related

How to estimate GIF file size?

We're building an online video editing service. One of the features allows users to export a short segment from their video as an animated gif. Imgur has a file size limit of 2Mb per uploaded animated gif.
Gif file size depends on number of frames, color depth and the image contents itself: a solid flat color result in a very lightweight gif, while some random colors tv-noise animation would be quite heavy.
First I export each video frame as a PNG of the final GIF frame size (fixed, 384x216).
Then, to maximize gif quality I undertake several gif render attempts with slightly different parameters - varying number of frames and number of colors in the gif palette. The render that has the best quality while staying under the file size limit gets uploaded to Imgur.
Each render takes time and CPU resources — this I am looking to optimize.
Question: what could be a smart way to estimate the best render settings depending on the actual images, to fit as close as possible to the filesize limit, and at least minimize the number of render attempts to 2–3?

The GIF image format uses LZW compression. Infamous for the owner of the algorithm patent, Unisys, aggressively pursuing royalty payments just as the image format got popular. Turned out well, we got PNG to thank for that.
The amount by which LZW can compress the image is extremely non-deterministic and greatly depends on the image content. You, at best, can provide the user with a heuristic that estimates the final image file size. Displaying, say, a success prediction with a colored bar. You'd can color it pretty quickly by converting just the first frame. That won't take long on 384x216 image, that runs in human time, a fraction of a second.
And then extrapolate the effective compression rate of that first image to the subsequent frames. Which ought to encode only small differences from the original frame so ought to have comparable compression rates.
You can't truly know whether it exceeds the site's size limit until you've encoded the entire sequence. So be sure to emphasize in your UI design that your prediction is just an estimate so your user isn't going to disappointed too much. And of course provide him with the tools to get the size lowered, something like a nearest-neighbor interpolation that makes the pixels in the image bigger. Focusing on making the later frames smaller can pay off handsomely as well, GIF encoders don't normally do this well by themselves. YMMV.

There's no simple answer to this. Single-frame GIF size mainly depends on image entropy after quantization, and you could try using stddev as an estimator using e.g. ImageMagick:
identify -format "%[fx:standard_deviation]" imagename.png
You can very probably get better results by running a smoothing kernel on the image in order to eliminate some high-frequency noise that's unlikely to be informational, and very likely to mess up compression performance. This goes much better with JPEG than with GIF, anyway.
Then, in general, you want to run a great many samples in order to come up with something of the kind (let's say you have a single compression parameter Q)
STDDEV SIZE W/Q=1 SIZE W/Q=2 SIZE W/Q=3 ...
value1 v1,1 v1,2 v1,3
After running several dozens of tests (but you need do this only once, not "at runtime"), you will get both an estimate of, say, , and a measurement of its error. You'll then see that an image with stddev 0.45 that compresses to 108 Kb when Q=1 will compress to 91 Kb plus or minus 5 when Q=2, and 88 Kb plus or minus 3 when Q=3, and so on.
At that point you get an unknown image, get its stddev and compression #Q=1, and you can interpolate the probable size when Q equals, say, 4, without actually running the encoding.
While your service is active, you can store statistical data (i.e., after you really do the encoding, you store the actual results) to further improve estimation; after all you'd only store some numbers, not any potentially sensitive or personal information that might be in the video. And acquiring and storing those numbers would come nearly for free.
Backgrounds
It might be worthwhile to recognize images with a fixed background; in that case you can run some adaptations to make all the frames identical in some areas, and have the GIF animation algorithm not store that information. This, when and if you get such a video (e.g. a talking head), could lead to huge savings (but would throw completely off the parameter estimation thing, unless you could estimate also the actual extent of the background area. In that case, let this area be B, let the frame area be A, the compressed "image" size for five frames would be A+(A-B)*(5-1) instead of A*5, and you could apply this correction factor to the estimate).
Compression optimization
Then there are optimization techniques that slightly modify the image and adapt it for a better compression, but we'd stray from the topic at hand. I had several algorithms that worked very well with paletted PNG, which is similar to GIF in many regards, but I'd need to check out whether and which of them may be freely used.
Some thoughts: LZW algorithm goes on in lines. So whenever a sequence of N pixels is "less than X%" different (perceptually or arithmetically) from an already encountered sequence, rewrite the sequence:
018298765676523456789876543456787654
987678656755234292837683929836567273
here the 656765234 sequence in the first row is almost matched by the 656755234 sequence in the second row. By changing the mismatched 5 to 6, the LZW algorithm is likely to pick up the whole sequence and store it with one symbol instead of three (6567,5,5234) or more.
Also, LZW works with bits, not bytes. This means, very roughly speaking, that the more the 0's and 1's are balanced, the worse the compression will be. The more unpredictable their sequence, the worse the results.
So if we can find out a way of making the distribution more **a**symmetrical, we win.
And we can do it, and we can do it losslessly (the same works with PNG). We choose the most common colour in the image, once we have quantized it. Let that color be color index 0. That's 00000000, eight fat zeroes. Now we choose the most common colour that follows that one, or the second most common colour; and we give it index 1, that is, 00000001. Another seven zeroes and a single one. The next colours will be indexed 2, 4, 8, 16, 32, 64 and 128; each of these has only a single bit 1, all others are zeroes.
Since colors will be very likely distributed following a power law, it's reasonable to assume that around 20% of the pixels will be painted with the first nine most common colours; and that 20% of the data stream can be made to be at least 87.5% zeroes. Most of them will be consecutive zeroes, which is something that LZW will appreciate no end.
Best of all, this intervention is completely lossless; the reindexed pixels will still be the same colour, it's only the palette that will be shifted accordingly. I developed such a codec for PNG some years ago, and in my use case scenario (PNG street maps) it yielded very good results, ~20% gain in compression. With more varied palettes and with LZW algorithm the results will be probably not so good, but the processing is fast and not too difficult to implement.

C#: Fast and Smart algorithm to Compare two byte array (image)

I'm running a process on a WebCam Image. I'd like to Wake Up that process only if there is major changes.
Something moving in the image
Lights turn on
...
So i'm looking for a fast efficient algorithm in C# to compare 2 byte[] (kinect image) of the same size.
I just need kind of "diff size" with a threashold
I found some motion detection algorithm but it's "too much"
I found some XOR algorithm but it might be too simple ? Would be great If I could ignore small change like sunlight, vibration, etc, ...

Mark all pixels which are different from previous image (based on threshold i.e. if Pixel has been changed only slightly - ignore it as noise) as 'changed'
Filter out noise pixels - i.e. if pixel was marked as changed but all its neighbors are not - consider it as noise and unmark as changed
Calculate how many pixels are changed on the image and compare with Threshold (you need to calibrate it manually)
Make sure you are operating on Greyscale images (not RGB). I.e. convert to YUV image space and do comparison only on Y.
This would be simplest and fastest algorithm - you just need to tune these two thresholds.

A concept: MPEG standards involve motion detections. Maybe you can monitor the MPEG stream's bandwidth. If there's no motion, than the bandwidth is very low (except during key frames (I frames)). If something changes and any move is going on, the bandwidth increases.
So what you can do is grab the JPEGs and feed it into an MPEG encoder codec. Then you can just look at the encoded stream. You can tune the frame-rate and the bandwidth too in a range, plus you decide what is the threshold for the output stream of the codec which means "motion".
Advantage: very generic and there are libraries available, often they offer hardware acceleration (VGAs/GPUs help with JPEG en/decoding and some or more MPEG). It's also pretty standard.
Disadvantage: more computation demanding than a XOR.

is jpg format good for image processing algorithms

most non-serious cameras (cameras on phones and webcams) provide lossy JPEG image as output.
while for a human eye they may not be noticed but the data loss could be critical for image processing algorithms.
If I am correct what is general approach you take when analyzing input images ?
(please note: using a industry standard camera may not be an option for hobbyist programmers)

JPG is an entire family of implementations, there are actually 4 methods. The most common method is the "normal" method, based on the Discrete Cosine Transform. This simply divides the image in 8x8 blocks and calculates the DCT of this. This results in a list of coefficients. To store these coefficients efficiently, they are multiplied by some other matrix (quantization matrix), such that the higher frequencies are usually rounded to zero. This is the only lossy step in the process. The reason this is done is to be able to store the coefficients more efficiently than before.
So, your question is not answered very easily. It also depends on the size of the input, if you have a sufficiently large image (say 3000x2000), stored at a relatively high precision, you will have no trouble with artefacts. A small image with a high compression rate might cause troubles.
Remember though that an image taken with a camera contains a lot of noise, which in itself is probably far more troubling than the jpg compression.
In my work I usually converted all images to pgm format, which is a raw format. This ensures that if I process the image in a pipeline fashion, all intermediate steps do not suffer from jpg compression.
Keep in mind that operations such as rotation, scaling, and repeated saving of JPG cause data loss each iteration.

Scaling an Image Based on Detail

I'm curious about whether there are approaches or algorithms one might use to downscale an image based on the amount of detail or entropy in the image such that the new size is determined to be a resolution at which most of the detail of the original image would be preserved.
For example, if one takes an out-of-focus or shaky image with a camera, there would be less detail or high frequency content than if the camera had taken the image in focus or from a fixed position relative to the scene being depicted. The size of the lower entropy image could be reduced significantly and still maintain most of the detail if one were to scale this image back up to the original size. However, in the case of the more detailed image, one wouldn't be able to reduce the image size as much without losing significant detail.
I certainly understand that many lossy image formats including JPEG do something similar in the sense that the amount of data needed to store an image of a given resolution is proportional to the entropy of the image data, but I'm curious, mostly for my own interest, if there might be a computationally efficient approach for scaling resolution to image content.

It's possible, and one could argue that most lossy image compression schemes from JPEG-style DCT stuff to fractal compression are essentially doing this in their own particular ways.
Note that such methods almost always operate on small image chunks rather than the big picture, so as to maximise compression in lower detail regions rather than being constrained to apply the same settings everywhere. The latter would likely make for poor compression and/or high loss on most "real" images, which commonly contain a mixture of detail levels, though there are exceptions like your out-of-focus example.
You would need to define what constitutes "most of the detail of the original image", since perfect recovery would only be possible for fairly contrived images. And you would also need to specify the exact form of rescaling to be used each way, since that would have a rather dramatic impact on the quality of the recovery. Eg, simple pixel repetition would better preserve hard edges but ruin smooth gradients, while linear interpolation should reproduce gradients better but could wreak havoc with edges.
A simplistic, off-the-cuff approach might be to calculate a 2D power spectrum and choose a scaling (perhaps different vertically and horizontally) that preserves the frequencies that contain "most" of the content. Basically this would be equivalent to choosing a low-pass filter that keeps "most" of the detail. Whether such an approach counts as "computationally efficient" may be a moot point...

Finding an interesting frame in a video

Does anyone know of an algorithm that I could use to find an "interesting" representative thumbnail for a video?
I have say 30 bitmaps and I would like to choose the most representative one as the video thumbnail.
The obvious first step would be eliminate all black frames. Then perhaps look for the "distance" between the various frames and choose something that is close to the avg.
Any ideas here or published papers that could help out?

If the video contains structure, i.e. several shots, then the standard techniques for video summarisation involve (a) shot detection, then (b) use the first, mid, or nth frame to represent each shot. See [1].
However, let us assume you wish to find an interesting frame in a single continuous stream of frames taken from a single camera source. I.e. a shot. This is the "key frame detection" problem that is widely discussed in IR/CV (Information Retrieval, Computer Vision) texts. Some illustrative approaches:
In [2] a mean colour histogram is computed for all frames and the key-frame is that with the closest histogram. I.e. we select the best frame in terms of it's colour distribution.
In [3] we assume that camera stillness is an indicator of frame importance. As suggested by Beds, above. We pick the still frames using optic-flow and use that.
In [4] each frame is projected into some high dimensional content space, we find those frames at the corners of the space and use them to represent the video.
In [5] frames are evaluated for importance using their length and novelty in content space.
In general, this is a large field and there are lots of approaches. You can look at the academic conferences such as The International Conference on Image and Video Retrieval (CIVR) for the latest ideas. I find that [6] presents a useful detailed summary of video abstraction (key-frame detection and summarisation).
For your "find the best of 30 bitmaps" problem I would use an approach like [2]. Compute a frame representation space (e.g. a colour histogram for the frame), compute a histogram to represent all frames, and use the frame with the minimum distance between the two (e.g. pick a distance metric that's best for your space. I would try Earth Mover's Distance).
M.S. Lew. Principles of Visual Information Retrieval. Springer Verlag, 2001.
B. Gunsel, Y. Fu, and A.M. Tekalp. Hierarchical temporal video segmentation and content characterization. Multimedia Storage and Archiving Systems II, SPIE, 3229:46-55, 1997.
W. Wolf. Key frame selection by motion analysis. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1228-1231, 1996.
L. Zhao, W. Qi, S.Z. Li, S.Q. Yang, and H.J. Zhang. Key-frame extraction and shot retrieval using Nearest Feature Line. In IW-MIR, ACM MM, pages 217-220, 2000.
S. Uchihashi. Video Manga: Generating semantically meaningful video summaries.
In Proc. ACM Multimedia 99, Orlando, FL, Nov., pages 383-292, 1999.
Y. Li, T. Zhang, and D. Tretter. An overview of video abstraction techniques. Technical report, HP Laboratory, July 2001.

You asked for papers so I found a few. If you are not on campus or on VPN connection to campus these papers might be hard to reach.
PanoramaExcerpts: extracting and packing panoramas for video browsing
http://portal.acm.org/citation.cfm?id=266396
This one explains a method for generating a comicbook style keyframe representation.
Abstract:
This paper presents methods for automatically creating pictorial video summaries that resem- ble comic books. The relative importance of video segments is computed from their length and novelty. Image and audio analysis is used to automatically detect and emphasize mean- ingful events. Based on this importance mea- sure, we choose relevant keyframes. Selected keyframes are sized by importance, and then efficiently packed into a pictorial summary. We present a quantitative measure of how well a summary captures the salient events in a video, and show how it can be used to improve our summaries. The result is a compact and visually pleasing summary that captures semantically important events, and is suitable for printing or Web access. Such a summary can be further enhanced by including text cap- tions derived from OCR or other methods. We describe how the automatically generated sum- maries are used to simplify access to a large collection of videos.
Automatic extraction of representative keyframes based on scenecontent
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=751008
Abstract:
Generating indices for movies is a tedious and expensive process which we seek to automate. While algorithms for finding scene boundaries are readily available, there has been little work performed on selecting individual frames to concisely represent the scene. In this paper we present novel algorithms for automated selection of representative keyframes, based on scene content. Detailed description of several algorithms is followed by an analysis of how well humans feel the selected frames represent the scene. Finally we address how these algorithms can be integrated with existing algorithms for finding scene boundaries.

I think you should only look at key frames.
If the video is not encoded using a compression which is based on key frames, you create an algorithm based on the following article: Key frame selection by motion analysis.
Depending on the compression of the video you can have key frames every 2 seconds or 30 seconds. Than I think you should use the algorithm in the article to find the "most" keyframe out of all the key frames.

It may also be beneficial to favor frames that are aesthetically pleasing. That is, look for common attributes of photography-- aspect ratio, contrast, balance, etc.
It would be hard to find a representative shot if you don't know what you're looking for. But with some heuristics and my suggestion, at least you could come up with something good looking.

I worked on a project recently where we did some video processing, and we used OpenCV to do the heavy lifting as far as video processing was concerned. We had to extract frames, calculate differences, extract faces, etc. OpenCV has some built-in algorithms that will calculate differences between frames. It works with a variety of video and image formats.

Wow, what a great question - I guess a second step would be to iteratively remove frames where there's little or no change between it and it's successors. But all you're really doing there is reducing the set of potentially interesting frames. How exactly you determine "interestingness" is the special sauce I suppose as you don't have the user interaction statistics to rely on like Flickr does.

Directors will sometimes linger on a particularly 'insteresting' or beautiful shot so how about finding a 5 second section that doesn't change and then eliminating those sections that are almost black?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio