Real-time video(image) stitching - image

I'm thinking of stitching images from 2 or more(currently maybe 3 or 4) cameras in real-time using OpenCV 2.3.1 on Visual Studio 2008.
However, I'm curious about how it is done.
Recently I've studied some techniques of feature-based image stitching method.
Most of them requires at least the following step:
1.Feature detection
2.Feature matching
3.Finding Homography
4.Transformation of target images to reference images
...etc
Now most of the techniques I've read only deal with images "ONCE", while I would like it to deal with a series of images captured from a few cameras and I want it to be "REAL-TIME".
So far it may still sound confusing. I'm describing the detail:
Put 3 cameras at different angles and positions, while each of them must have overlapping areas with its adjacent one so as to build a REAL-TIME video stitching.
What I would like to do is similiar to the content in the following link, where ASIFT is used.
http://www.youtube.com/watch?v=a5OK6bwke3I
I tried to consult the owner of that video but I got no reply from him:(.
Can I use image-stitching methods to deal with video stitching?
Video itself is composed of a series of images so I wonder if this is possible.
However, detecting feature points seems to be very time-consuming whatever feature detector(SURF, SIFT, ASIFT...etc) you use. This makes me doubt the possibility of doing Real-time video stitching.

I have worked on a real-time video stitching system and it is a difficult problem. I can't disclose the full solution we used due to an NDA, but I implemented something similar to the one described in this paper. The biggest problem is coping with objects at different depths (simple homographies are not sufficient); depth disparities must be determined and the video frames appropriately warped so that common features are aligned. This essentially is a stereo vision problem. The images must first be rectified so that common features appear on the same scan line.

You might also be interested in my project from a few years back. It's a program which lets you experiment with different stitching parameters and watch the results in real-time.
Project page - https://github.com/lukeyeager/StitcHD
Demo video - https://youtu.be/mMcrOpVx9aY?t=3m38s

Related

Pass modified video stream into Vuforia Engine?

Is it possible to modify the Vuforia video stream for better tracking performance?
Step 1: Get the raw pixel data from the VuforiaBehaviour.Instance.CameraDevice.GetCameraImage();
Step 2: Modify the pixels with post processing via custom shaders in Unity. For example apply a threshold or edge detection.
Step 3: Vuforia Engine uses the modified video input to track images.
That´s the idea but I´m not sure if Vuforia is gonna pass the modified video into the Vuforia Engine then or still uses the unmodified video input for tracking?
If anybody has experience with that I would be thankful for your help! :)
Vuforia Engine assumes that the input images look like "natural" images. Passing an image belonging to a different domain (e.g., the result of an edge detector) is unlikely to improve tracking performance.
That said, tracking performance is affected by image quality. For example, if images are blurry, tracking robustness is going to suffer. If this is the case you might want to look at trying to adjust system camera parameters via the platform API (iOS, Android, etc.). However, please note that this might or might not be possible depending on the platform. Also, on some platforms when a device tracker like ARKit or ARCore is used, the platform tracker itself adjusts the camera parameters for good tracking performance. For example it might keep the exposure time low to reduce blur.

Suggests or methods of tv logo auto finding/locating/detection

Usually the logo detection means find the logo and recognize the logo. Some common works do the two steps together using SIFT/SURF matching method, detailed in
(1) Logo recognition in images
(2) Logo detection using OpenCV
But, if the logo is tiny and blur, the result is poor, and kind of time consuming; I want to split the two steps, firstly finding where the logo is in video; then recognize the logo using template matching or other method, like:
(3) Logo recognition - how to improve performance
(4) OpenCV logo recognition
My problem is mainly focused on finding the logo automatically in video. I tried two methods:
Brightness method. The logo on tv screen usually always there when the show goes on, I select a list of frames randomly and do difference between frames, the logo area tend to be 0; I do some statistics of 0 brightness with threshold to determine whether the pix is logo or not. This method usually do well but failed while the show has static background.
Edge method. Likely, if the logo is there, the border tends to be obvious. I do the statistic work like Brightness method, but edge sometimes unstable,such as very bright background.
Are there any suggestions or state of art methods to auto finding logo areas and any other logo recognition method except sift or template matching ?
Let's assume your list of logos known before hand and you have access to examples (video streams/frames) of all logos.
The 2017 answer to your question is to train a logo classifier, and most likely a deep neural network.
With sufficient training data, if it is identifiable to the TV viewers it will be able to detect it. It will be able to handle local blurring and intensity changes (which may thwart "classic" image processing methods of brightness and edges).
OpenCV can load and run network models from multiple frameworks like Caffe, Torch and TensorFlow, so you can use one of their pre-trainined models or train one yourself.
You could also try the Tensorflow's object detection API here: https://github.com/tensorflow/models/tree/master/research/object_detection
The good thing about this API is that it contains State-of-the-art models in Object Detection & Classification. These models that tensorflow provide are free to train and some of them promise quite astonishing results. I have already trained a model for the company I am working on, that does quite amazing job in LOGO detection from Images & Video Streams. You can check more about my work here: https://github.com/kochlisGit/LogoLens
The problem with the TV is that the LOGOs will probably be not-static and move along the frames. This will result in a motion blur effect, which will probably make your classifier to get confused or not see the LOGOs. However, once you find a logo You can use an object tracking algorithm to keep track of the logo (e.g. deepsort)

Feature matching difficulty

Im working on a 3d reconstruction project where i have trouble matching the features in order to proceed with the reconstruction. To be more specific when im matching feature of matlab's examples images i have a high correct to wrong matches ratio but when im matching features of my own photos taken by a phone camera i have almost only wrong matches. I 've tried tuning the threshold but the problem still remains. Any ideas/sugestions of what is going wrong?
The descriptor im using is the sift descriptor from the vlfeat toolbox
edit: here is a dropbox link with the original images, the detected salient/corner points and the matches.
I think your main problems here are significant difference in lighting between the images, and specular reflections off the plastic casing. You are also looking at the inside of the USB drive through the transparent plastic, which doesn't help.
What feature detectors/descriptors have you tried? I would start with SURF, and then I would try MSER. It is also possible to use multiple detectors and descriptors, but you should be careful to keep them separate. Of course, there are also lots of parameters for you to tune.
Another thing that may be helpful is to take higher-resolution images.
If you are trying to do 3D reconstruction, can you assume that the camera does not move much between the images? In that case, try using vision.PointTracker to track points from one frame into the other instead of matching them.

Image Cropping - Region of Interest Query

I have a set of videos of someone talking, I'm building a lip recognition system therefore I need to perform some image processing on specific region of the image (the lower chin and lips).
I have over 200 videos, each containing a sentence. It is natural conversation therefore the head constantly moves so the lips aren't in a fixed place. I'm having difficulty specifying my region of interest in the image as it is very tiresome having to watch through every video and mark out how big my box to ensure the lips are cropped within the ROI.
I was wondering if there would be an easier way to check this, perhaps using MATLAB? I was thinking I could crop the video frame by frame and output an image for each frame. And then physically go through the images to see if the lips go out of frame?
I had to solve a similar problem dealing with tracking the heads and limbs of students participating in class discussions on video. We experimented with using state of the art optical flow tracking, from Thomas Brox ( link, see the part about large-displacement optical flow.) In our case, we had nearly 20 terabytes of videos to work through, so we had no choice but to use a C++ and GPU implementation of optical flow code; I think you will discover too that Matlab is impossibly slow for doing video analysis.
Optical flow returns to you detailed motion vectors. Then, if you can just mark the original bounding box for the mouth and chin in the first frame of the video, you can follow the tracks given by the optical flow of those pixels and this will usually give you a good sequence of bounding boxes. You will probably have errors that you have to clean up, though. You could write a Python script that plays back the sequence of bounding boxes for you to quickly check for errors though.
The code I wrote for this is in Python, and it's probably not easy to adapt to your data set-up or your problem, but you can find my affine-transformation based optical flow tracking code linked here, in the part called 'Object tracker using dense optical flow.'
The short answer is that this is a very difficult and annoying problem for vision researchers. Most people "solve" it by placing their videos, frame by frame, onto Mechanical Turk, and paying human workers about 2 cents per frame that they analyze. This gives you pretty good results (you'll still have to clean them after collecting it from the Mechanical Turkers), but it's not very helpful when you have tons o' videos and you cannot wait for enough of them to randomly get analyzed on Mechanical Turk.
There definitely isn't any 'out of the box' solution to region-of-interest annotation, though. You'd probably have to pay quite a lot for third-party software that did this automatically. My best guess for that is to check out what face.com would charge you and how well it would perform. Be careful that you don't violate any researcher confidentiality agreements with your data set though, for this or Mechanical Turk.

SURF and SIFT Alternative Object Tracking Algorithm for Augmented Reality

After asking here and trying both SURF and SIFT, neither of them seams to be efficient enough to generate interest points fast enough to track a stream from the camera.
SURF, for example, takes around 3 seconds to generate interest points for an image, that's way too slow to track a video coming from a web cam, and it'll be even worse when using it on a mobile phone.
I just need an algorithm that tracks a certain area, its scale, tilt, etc.. and I can build on top of that.
Thanks
I suspect your SURF usage may need some alteration?
Here is a link to an MIT paper on using SURF for augmented reality applications on mobile devices.
Excerpt:
In this section, we present our
implementation of the SURF al- gorithm
and its adaptation to the mobile
phone. Next, we discuss the impact
that accuracy has on the speed of the
nearest-neighbor search and show that
we can achieve an order of magnitude
speed- up with minimal impact on
matching accuracy. Finally, we dis-
cuss the details of the phone
implementation of the image matching
pipeline. We study the performance,
memory use, and bandwidth consumption
on the phone.
You might also want to look into OpenCV's algorithms because they are tried and tested.
Depending on the constraints of your application, you may be able to reduce the genericness of those algorithms to look for known POIs and markers within the image.
Part of tracking a POI is estimating its vector from one point in the 2D image to another, and then optionally confirming that it still exists there (through pixel characteristics). The same approach can be used to track (not re-scan the entire image) for POI and POI group/object perspective and rotation changes.
There are tons of papers online for tracking objects on a 2D projection (up to a servere skew in many cases).
Good Luck!
You should try FAST detector
http://svr-www.eng.cam.ac.uk/~er258/work/fast.html
We are using SURF for a project and we found OpenSURF to outmatch OpenCV's SURF implementation in raw speed and performance. We still haven´t tested repeatability and accuracy, but it is way faster.
Update:
I just wanted to point out that you needn't perform a SURF match step in each frame, you could simply do it every other frame and interpolate the position of the object in the frame you don't execute SURF on.
You can use a simpler algorithm if you would make stricter restrictions on the area you would like to be tracked. As you surely know, ARToolKit is pretty fast, but only tracks black and white markers with a very distinct frame.
If you want a (somewhat) general purpose tracker, you may want to check PTAM. The site (http://www.robots.ox.ac.uk/~gk/PTAM/) is currently down, but here's a snazzy video of it working on an iPhone (http://www.youtube.com/watch?v=pBI5HwitBX4)
As others have mentioned, three seconds seems unusually long. While testing the SURF implementation in the Mahotas library, I found that it took on average 0.36sec, even with some fairly large images (e.g. 1024x768). And that's with a mix of Python and C, so I'd imagine some other pure-C implementations would be even faster.
I found this nice comparison of each feature detection algorithms at http://computer-vision-talks.com/2011/01/comparison-of-the-opencvs-feature-detection-algorithms-2/
Have a look. It might be useful!
According to that comparison, and as mirror2image has also suggested, FAST is the best choice. But it depends on what you really want to achieve.
One option I've used in constrained embedded systems is to use a simpler interest point detector: FAST or Shi-Tomasi for example. I used Shi-Tomasi, as I was targetting an FPGA and could easily run it at pixel rate with no significant buffering required.
Then use SURF to generate the descriptors for the image patch around the identified features and use those for matching and tracking purposes.

Resources