Does anyone know of an algorithm that I could use to find an "interesting" representative thumbnail for a video?
I have say 30 bitmaps and I would like to choose the most representative one as the video thumbnail.
The obvious first step would be eliminate all black frames. Then perhaps look for the "distance" between the various frames and choose something that is close to the avg.
Any ideas here or published papers that could help out?
If the video contains structure, i.e. several shots, then the standard techniques for video summarisation involve (a) shot detection, then (b) use the first, mid, or nth frame to represent each shot. See [1].
However, let us assume you wish to find an interesting frame in a single continuous stream of frames taken from a single camera source. I.e. a shot. This is the "key frame detection" problem that is widely discussed in IR/CV (Information Retrieval, Computer Vision) texts. Some illustrative approaches:
In [2] a mean colour histogram is computed for all frames and the key-frame is that with the closest histogram. I.e. we select the best frame in terms of it's colour distribution.
In [3] we assume that camera stillness is an indicator of frame importance. As suggested by Beds, above. We pick the still frames using optic-flow and use that.
In [4] each frame is projected into some high dimensional content space, we find those frames at the corners of the space and use them to represent the video.
In [5] frames are evaluated for importance using their length and novelty in content space.
In general, this is a large field and there are lots of approaches. You can look at the academic conferences such as The International Conference on Image and Video Retrieval (CIVR) for the latest ideas. I find that [6] presents a useful detailed summary of video abstraction (key-frame detection and summarisation).
For your "find the best of 30 bitmaps" problem I would use an approach like [2]. Compute a frame representation space (e.g. a colour histogram for the frame), compute a histogram to represent all frames, and use the frame with the minimum distance between the two (e.g. pick a distance metric that's best for your space. I would try Earth Mover's Distance).
M.S. Lew. Principles of Visual Information Retrieval. Springer Verlag, 2001.
B. Gunsel, Y. Fu, and A.M. Tekalp. Hierarchical temporal video segmentation and content characterization. Multimedia Storage and Archiving Systems II, SPIE, 3229:46-55, 1997.
W. Wolf. Key frame selection by motion analysis. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1228-1231, 1996.
L. Zhao, W. Qi, S.Z. Li, S.Q. Yang, and H.J. Zhang. Key-frame extraction and shot retrieval using Nearest Feature Line. In IW-MIR, ACM MM, pages 217-220, 2000.
S. Uchihashi. Video Manga: Generating semantically meaningful video summaries.
In Proc. ACM Multimedia 99, Orlando, FL, Nov., pages 383-292, 1999.
Y. Li, T. Zhang, and D. Tretter. An overview of video abstraction techniques. Technical report, HP Laboratory, July 2001.
You asked for papers so I found a few. If you are not on campus or on VPN connection to campus these papers might be hard to reach.
PanoramaExcerpts: extracting and packing panoramas for video browsing
http://portal.acm.org/citation.cfm?id=266396
This one explains a method for generating a comicbook style keyframe representation.
Abstract:
This paper presents methods for automatically creating pictorial video summaries that resem- ble comic books. The relative importance of video segments is computed from their length and novelty. Image and audio analysis is used to automatically detect and emphasize mean- ingful events. Based on this importance mea- sure, we choose relevant keyframes. Selected keyframes are sized by importance, and then efficiently packed into a pictorial summary. We present a quantitative measure of how well a summary captures the salient events in a video, and show how it can be used to improve our summaries. The result is a compact and visually pleasing summary that captures semantically important events, and is suitable for printing or Web access. Such a summary can be further enhanced by including text cap- tions derived from OCR or other methods. We describe how the automatically generated sum- maries are used to simplify access to a large collection of videos.
Automatic extraction of representative keyframes based on scenecontent
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=751008
Abstract:
Generating indices for movies is a tedious and expensive process which we seek to automate. While algorithms for finding scene boundaries are readily available, there has been little work performed on selecting individual frames to concisely represent the scene. In this paper we present novel algorithms for automated selection of representative keyframes, based on scene content. Detailed description of several algorithms is followed by an analysis of how well humans feel the selected frames represent the scene. Finally we address how these algorithms can be integrated with existing algorithms for finding scene boundaries.
I think you should only look at key frames.
If the video is not encoded using a compression which is based on key frames, you create an algorithm based on the following article: Key frame selection by motion analysis.
Depending on the compression of the video you can have key frames every 2 seconds or 30 seconds. Than I think you should use the algorithm in the article to find the "most" keyframe out of all the key frames.
It may also be beneficial to favor frames that are aesthetically pleasing. That is, look for common attributes of photography-- aspect ratio, contrast, balance, etc.
It would be hard to find a representative shot if you don't know what you're looking for. But with some heuristics and my suggestion, at least you could come up with something good looking.
I worked on a project recently where we did some video processing, and we used OpenCV to do the heavy lifting as far as video processing was concerned. We had to extract frames, calculate differences, extract faces, etc. OpenCV has some built-in algorithms that will calculate differences between frames. It works with a variety of video and image formats.
Wow, what a great question - I guess a second step would be to iteratively remove frames where there's little or no change between it and it's successors. But all you're really doing there is reducing the set of potentially interesting frames. How exactly you determine "interestingness" is the special sauce I suppose as you don't have the user interaction statistics to rely on like Flickr does.
Directors will sometimes linger on a particularly 'insteresting' or beautiful shot so how about finding a 5 second section that doesn't change and then eliminating those sections that are almost black?
Related
I have 2 image:
1 Initial image when I detected car.
2 IPM image when I transformed image to another plan.
I don't have any information about camera parameters.
I know the location of the car in original image and in IPM image and I wish to know how can I determinate the distance between the car and the camera. Can we assume the height of the camera is 1 m.
Exist any formula or algorithm for that?
From your question, it appears that you are in the monocular case, where depth/scale information is naturally unavailable. I'm afraid there is no easy way to achieve what you want. Off the top of my head, here are a few options:
1- Using neural networks. This is the least expensive option (from a material and development effort point of view.). Plus, it will work if you do not have a video stream but only single images (although I'm guessing that is not your case). If performance is not an issue, you can take more or less any convolutional neural network, and train it on depth data. Otherwise, a quick search will lead you to faster state of the art networks which you can taylor to your needs. However, databases which contain depth map ground truth are a bit scarce, and you usually have to build the depth map for your training data yourself. In that case many of the open source methods listed here at the bottom of here come to mind. Once you have the depth maps you can train for monocular depth estimation.
2- You can use stereo cameras. Those naturally give you the depth, via for example simple triangulation.
3- You have a video stream and can use IMUs or car odometry. In that case, you can use many (multisensor) Simultaneous Localization And Mapping (SLAM) method . The litterature on the subject is rich, but IMU calibration is usually a bit nightmarish.
4- You can use a cheap GPS receiver (the ublox EVK* series come to mind). In that case, you still need to use some SLAM variant (e.g constrained bundle adjustment or any kalman-based method). Assuming low GPS bias (since you show only peri-urban images which are not affected by multipath) this will give you a decent approximation of the scale.
Note that methods three and four will give you an estimate of the reconstruction, so if you use sparse (i.e feature based) SLAM methods, you will end up having to densify the region where your car is detected (unless crude estimates are ok).
I'm trying to figure out what are currently the two most efficent algorithms that permit, starting from a L/R pair of stereo images created using a traditional camera (so affected by some epipolar lines misalignment), to produce a pair of adjusted images plus their depth information by looking at their disparity.
Actually I've found lots of papers about these two methods, like:
"Computing Rectifying Homographies for Stereo Vision" (Zhang - seems one of the best for rectification only)
"Three-step image rectification" (Monasse)
"Rectification and Disparity" (slideshow by Navab)
"A fast area-based stereo matching algorithm" (Di Stefano - seems a bit inaccurate)
"Computing Visual Correspondence with Occlusions via Graph Cuts" (Kolmogorov - this one produces a very good disparity map, with also occlusion informations, but is it efficient?)
"Dense Disparity Map Estimation Respecting Image Discontinuities" (Alvarez - toooo long for a first review)
Anyone could please give me some advices for orienting into this wide topic?
What kind of algorithm/method should I treat first, considering that I'll work on a very simple input: a pair of left and right images and nothing else, no more information (some papers are based on additional, pre-taken, calibration infos)?
Speaking about working implementations, the only interesting results I've seen so far belongs to this piece of software, but only for automatic rectification, not disparity: http://stereo.jpn.org/eng/stphmkr/index.html
I tried the "auto-adjustment" feature and seems really effective. Too bad there is no source code...
I have a set of videos of someone talking, I'm building a lip recognition system therefore I need to perform some image processing on specific region of the image (the lower chin and lips).
I have over 200 videos, each containing a sentence. It is natural conversation therefore the head constantly moves so the lips aren't in a fixed place. I'm having difficulty specifying my region of interest in the image as it is very tiresome having to watch through every video and mark out how big my box to ensure the lips are cropped within the ROI.
I was wondering if there would be an easier way to check this, perhaps using MATLAB? I was thinking I could crop the video frame by frame and output an image for each frame. And then physically go through the images to see if the lips go out of frame?
I had to solve a similar problem dealing with tracking the heads and limbs of students participating in class discussions on video. We experimented with using state of the art optical flow tracking, from Thomas Brox ( link, see the part about large-displacement optical flow.) In our case, we had nearly 20 terabytes of videos to work through, so we had no choice but to use a C++ and GPU implementation of optical flow code; I think you will discover too that Matlab is impossibly slow for doing video analysis.
Optical flow returns to you detailed motion vectors. Then, if you can just mark the original bounding box for the mouth and chin in the first frame of the video, you can follow the tracks given by the optical flow of those pixels and this will usually give you a good sequence of bounding boxes. You will probably have errors that you have to clean up, though. You could write a Python script that plays back the sequence of bounding boxes for you to quickly check for errors though.
The code I wrote for this is in Python, and it's probably not easy to adapt to your data set-up or your problem, but you can find my affine-transformation based optical flow tracking code linked here, in the part called 'Object tracker using dense optical flow.'
The short answer is that this is a very difficult and annoying problem for vision researchers. Most people "solve" it by placing their videos, frame by frame, onto Mechanical Turk, and paying human workers about 2 cents per frame that they analyze. This gives you pretty good results (you'll still have to clean them after collecting it from the Mechanical Turkers), but it's not very helpful when you have tons o' videos and you cannot wait for enough of them to randomly get analyzed on Mechanical Turk.
There definitely isn't any 'out of the box' solution to region-of-interest annotation, though. You'd probably have to pay quite a lot for third-party software that did this automatically. My best guess for that is to check out what face.com would charge you and how well it would perform. Be careful that you don't violate any researcher confidentiality agreements with your data set though, for this or Mechanical Turk.
I'd like to know if there is any good (and freely available) text, on how to obtain motion vectors of macro blocks in raw video stream. This is often used in video compression, although my application of it is not video encoding.
Code that does this is available in OSS codecs, but understanding the method by reading the code is kinda hard.
My actual goal is to determine camera motion in 2D projection space, assuming the camera is only changing it's orientation (NOT the position). What I'd like to do is divide the frames into macro blocks, obtain their motion vectors, and get the camera motion by averaging those vectors.
I guess OpenCV could help with this problem, but it's not available on my target platform.
The usual way is simple brute force: Compare a macro block to each macro block from the reference frame and use the one that gives the smallest residual error. The code gets complex primarily because this is usually the slowest part of mv-based compression, so they put a lot of work into optimizing it, often at the expense of anything even approaching readability.
Especially for real-time compression, some reduce the workload a bit by (for example) restricting the search to the original position +/- some maximum delta. This can often gain quite a bit of compression speed in exchange for a fairly small loss of compression.
If you assume only camera motion, I suspect there is something possible with analysis of the FFT of successive images. For frequencies whose amplitudes have not changed much, the phase information will indicate the camera motion. Not sure if this will help with camera rotation, but lateral and vertical motion can probably be computed. There will be difficulties due to new information appearing on one edge and disappearing on the other and I'm not sure how much that will hurt. This is speculative thinking in response to your question, so I have no proof or references :-)
Sounds like you're doing a very limited SLAM project?
Lots of reading matter at Bristol University, Imperial College, Oxford University for example - you might find their approaches to finding and matching candidate features from frame to frame of interest - much more robust than simple sums of absolute differences.
For the most low-level algorithms of this type the term you are looking for is optical flow and one of the easiest algorithms of that class is the Lucas Kanade algorithm.
This is a pretty good overview presentation that should give you plenty of ideas for an algorithm that does what you need
Can anyone explain in a simple clear way how MPEG4 works to compress data. I'm mostly interested in video. I know there are different standards or parts to it. I'm just looking for the predominant overall compression method, if there is one with MPEG4.
MPEG-4 is a huge standard, and employs many techniques to achieve the high compression rates that it is capable of.
In general, video compression is concerned with throwing away as much information as possible whilst having a minimal effect on the viewing experience for an end user. For example, using subsampled YUV instead of RGB cuts the video size in half straight away. This is possible as the human eye is less sensitive to colour than it is to brightness. In YUV, the Y value is brightness, and the U and V values represent colour. Therefore, you can throw away some of the colour information which reduces the file size, without the viewer noticing any difference.
After that, most compression techniques take advantage of 2 redundancies in particular. The first is temporal redundancy and the second is spatial redundancy.
Temporal redundancy notes that successive frames in a video sequence are very similar. Typically a video would be in the order of 20-30 frames per second, and nothing much changes in 1/30 of a second. Take any DVD and pause it, then move it on one frame and note how similar the 2 images are. So, instead of encoding each frame independently, MPEG-4 (and other compression standards) only encode the difference between successive frames (using motion estimation to find the difference between frames)
Spatial redundancy takes advantage of the fact that in general the colour spread across images tends to be quite low frequency. By this I mean that neighbouring pixels tend to have similar colours. For example, in an image of you wearing a red jumper, all of the pixels that represent your jumper would have very similar colour. It is possible to use the DCT to transform the pixel values into the frequency space, where some high frequency information can be thrown away. Then, when the reverse DCT is performed (during decoding), the image is now without the thrown away high-frequency information.
To view the effects of throwing away high frequency information, open MS paint and draw a series of overlapping horizontal and vertical black lines. Save the image as a JPEG (which also uses DCT for compression). Now zoom in on the pattern, notice how the edges of the lines are not as sharp anymore and are kinda blurry. This is because some high frequency information (the transition from black to white) has been thrown away during compression. Read this for an explanation with nice pictures
For further reading, this book is quite good, if a little heavy on the maths.
Like any other popular video codec, MPEG4 uses a variation of discrete cosine transform and a variety of motion-compensation techniques (which you can think of as motion-prediction if that helps) that reduce the amount of data needed for subsequent frames. This page has an overview of what is done by plain MPEG4.
It's not totally dissimilar to the techniques used by JPEG.
MPEG4 uses a variety of techniques to compress video.
If you haven't already looked at wikipedia, this would be a good starting point.
There is also this article from the IEEE which explains these techniques in more detail.
Sharp edges certainly DO contain high frequencies. Reducing or eliminating high frequencies reduces the sharpness of edges. Fine detail including sharp edges is removed with high frequency removal - bility to resolve 2 small objects is removed with high frequencies - then you see just one.