Problem statement:
I want to grab a smartphone, take a series of photos (or a video) of an object, and convert it to a 360 degree photo.
Some Research:
If we look at Facebook 360 Photos, this is exactly what I'm looking for, except that Facebook's solution is outward-facing 360 photos, and I'm looking for inward facing 360 photos.
This objective seems to be similar to 360 degree product photography. Important difference: I do not want to use any special equipment other than a smartphone. Just like you can create a 360 degree outward facing photo without needing a tripod or a turntable.
I want to understand from the community:
Does a solution like this exist? What's the best we can do at the moment?
What kind of technological expertise would a person require to create something like this? Consider yourself an investor or a CEO who needs to get this built. Who do you hire? Who do you consult?
Thanks a lot for the help.
There is a fundamental difference between the two cases:
"Outward": the translation of the camera is small. If the scene is far enough away, it can be ignored, and the camera motion can be approximated with a rotation about its focal point (there is almost no parallax between views). The mapping from one image to any other image is well approximated a homography, and the image set maps naturally to the inner surface of a sphere (or, aproximately, a cylinder, a cube, etc.). A scene far away will also appear to move slowly, therefore capture time is less of a factor when stitching images.
"Inward": the translation is large and cannot be ignored. There is parallax, the scene objects may self-occlude or mutually occlude each other in some of the images, making "stitching" highly nontrivial - mapping of one image onto the other depends on the scene content, unlike the outward case. If the content of the scene moves, stitching becomes an even harder problem.
In both cases, however, one normally relies on bundle adjustment for the final refinement of the camera poses/positions. In the second case the 3D geometry of the scene may need to be reconstructed, depending on the application.
To your questions:
Of course a solution exists: have you seen "The Matrix" with its "bullet-time" effect? Doing a google search of "bullet time" shows several more or less successful attempts at reproduction - the easiest involves tying an iPhone on a string and swinging it around.
Someone with background and expertize in photogrammetry, 3D computer vision (roughly, they have read and internalized Hartley & Zisserman's book or equivalents), and nontrivial image processing - there is some art involved in stitching correctly once you have solved the photogrammetry, it's not "just graphcut it and then multiband-blend it"
Related
I want to remove background and get deer as a foreground image.
This is my source image captured by trail camera:
This is what I want to get. This output image can be a binary image or RGB.
I worked on it and try many methods to get solution but every time it failed at specific point. So please first understand what is my exact problem.
Image are captured by a trail camera and camera is motion detector. when deer come in front of camera it capture image.
Scene mode change with respect to weather changing or day and night etc. So I can't use frame difference or some thing like this.
Segmentation may be not work correctly because Foreground (deer) and Background have same color in many cases.
If anyone still have any ambiguity in my question then please first ask me to clear and then answer, it will be appreciated.
Thanks in advance.
Here's what I would do:
As was commented to your question, you can detect the dear and then perform grabcut to segment it from the picture.
To detect the dear, I would couple a classifier with a sliding window approach. That would mean that you'll have a classifier that given a patch (can be a large patch) in the image, output's a score of how much that patch is similar to a dear. The sliding window approach means that you loop on the window size and then loop on the window location. For each position of the window in the image, you should apply the classifier on that window and get a score of how much that window "looks like" a dear. Once you've done that, threshold all the scores to get the "best windows", i.e. the windows that are most similar to a dear. The rational behind this is that if we a dear is present at some location in the image, the classifier will output a high score at all windows that are close/overlap with the actual dear location. We would like to merge all that locations to a single location. That can be done by applying the functions groupRectangles from OpenCV:
http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html#grouprectangles
Take a look at some face detection example from OpenCV, it basically does the same (sliding window + classifier) where the classifier is a Haar cascade.
Now, I didn't mention what that "dear classifier" can be. You can use HOG+SVM (which are both included in OpenCV) or use a much powerful approach of running a deep convulutional neural network (deep CNN). Luckily, you don't need to train a deep CNN. You can use the following packages with their "off the shelf" ImageNet networks (which are very powerful and might even be able to identify a dear without further training):
Decaf- which can be used only for research purposes:
https://github.com/UCB-ICSI-Vision-Group/decaf-release/
Or Caffe - which is BSD licensed:
http://caffe.berkeleyvision.org/
There are other packages of which you can read about here:
http://deeplearning.net/software_links/
The most common ones are Theano, Cuda ConvNet's and OverFeat (but that's really opinion based, you should chose the best package from the list that I linked to).
The "off the shelf" ImageNet network were trained on roughly 10M images from 1000 categories. If those categories contain "dear", that you can just use them as is. If not, you can use them to extract features (as a 4096 dimensional vector in the case of Decaf) and train a classifier on positive and negative images to build a "dear classifier".
Now, once you detected the dear, meaning you have a bounding box around it, you can apply grabcut:
http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_grabcut/py_grabcut.html
You'll need an initial scribble on the dear to perform grabcu. You can just take a horizontal line in the middle of the bounding box and hope that it will be on the dear's torso. More elaborate approaches would be to find the symmetry axis of the dear and use that as a scribble, but you would have to google, research an implement some method to extract symmetry axis from the image.
That's about it. Not straightforward, but so is the problem.
Please let me know if you have any questions.
Try OpenCV Background Substraction with Mixture of Gaussians models. They should be adaptable enough for your scenes. Of course, the final performance will depend on the scenario, but it is worth trying.
Since you just want to separate the background from the foreground I think you do not need to recognize the deer. You need to recognize an object in motion in the scene. You just need to separate what is static in a significant interval of time (background) from what is not static: the deer.
There are algorithms that combine multiple frames from the same scene in order to determine the background, like THIS ONE.
You mentioned that the scene mode changes with respect to weather changing or day and night considering photos of different deers.
You could implement a solution when motion is detected, instead of taking a single photo, it could take a few ones with some interval of time.
This interval has to be long as to get the deer in different positions or out of the scene and at the same time short enough to not be much affected by scene variations. Perhaps you need to deal with some brightness variation, but I think it is feasible to determine the background using these frames and finally segment the deer in the "motion frame".
Is it possible to make a 3D representation of an object by capturing many different angles using a webcam? If it is, how is it possible and how is the image-processing done?
My plan is to make a 3D representation of a person using a webcam, then from the 3D representation, i will be able to tell the person's vital statistics.
As Bart said (but did not post as an actual answer) this is entirely possible.
The research topic you are interested in is often called multi view stereo or something similar.
The basic idea resolves around using point correspondences between two (or more) images and then try to find the best matching camera positions. When the positions are found you can use stereo algorithms to back project the image points into a 3D coordinate system and form a point cloud.
From that point cloud you can then further process it to get the measurements you are looking for.
If you are completely new to the subject you have some fascinating reading to look forward to!
Bart proposed Multiple view geometry by Hartley and Zisserman, which is a very nice book indeed.
As Bart and Kigurai pointed out, this process has been studied under the title of "stereo" or "multi-view stereo" techniques. To be able to get a 3D model from a set of pictures, you need to do the following:
a) You need to know the "internal" parameters of a camera. This includes the focal length of the camera, the principal point of the image and account for radial distortion in the image.
b) You also need to know the position and orientation of each camera with respect to each other or a "world" co-ordinate system. This is called the "pose" of the camera.
There are algorithms to perform (a) and (b) which are described in Hartley and Zisserman's "Multiple View Geometry" book. Alternatively, you can use Noah Snavely's "Bundler" http://phototour.cs.washington.edu/bundler/ software to also do the same thing in a very robust manner.
Once you have the camera parameters, you essentially know how a 3D point (X,Y,Z) in the world maps to an image co-ordinate (u,v) on the photo. You also know how to map an image co-ordinate to the world. You can create a dense point cloud by searching for a match for each pixel on one photo in a photo taken from a different view-point. This requires a two-dimensional search. You can simplify this procedure by making the search 1-dimensional. This is called "rectification". You essentially take two photos and transform then so that their rows correspond to the same line in the world (simplified statement). Now you only have to search along image rows.
An algorithm for this can be also found in Hartley and Zisserman.
Finally, you need to do the matching based on some measure. There is a lot of literature out there on "stereo matching". Another word used is "disparity estimation". This is basically searching for the match of pixel (u,v) on one photo to its match (u, v') on the other photo. Once you have the match, the difference between them can be used to map back to a 3D point.
You can use Yasutaka Furukawa's "CMVS" or "PMVS2" software to do this. Or if you want to experiment by yourself, openCV is a open-source computer vision toolbox to do many of the sub-tasks required for this.
This can be done with two webcams in the same ways your eyes work. It is called stereoscopic vision.
Have a look at this:
http://opencv.willowgarage.com/documentation/camera_calibration_and_3d_reconstruction.html
An affordable alternative to get 3D data would be the Kinect camera system.
Maybe not the answer you are hoping for but Microsoft's Kinect is doing that exact thing, there are some open source drivers out there that allow you to connect it to your windows/linux box.
I have a web cam that takes a picture every N seconds. This gives me a collection of images of the same scene over time. I want to process that collection of images as they are created to identify events like someone entering into the frame, or something else large happening. I will be comparing images that are adjacent in time and fixed in space - the same scene at different moments of time.
I want a reasonably sophisticated approach. For example, naive approaches fail for outdoor applications. If you count the number of pixels that change, for example, or the percentage of the picture that has a different color or grayscale value, that will give false positive reports every time the sun goes behind a cloud or the wind shakes a tree.
I want to be able to positively detect a truck parking in the scene, for example, while ignoring lighting changes from sun/cloud transitions, etc.
I've done a number of searches, and found a few survey papers (Radke et al, for example) but nothing that actually gives algorithms that I can put into a program I can write.
Use color spectroanalisys, without luminance: when the Sun goes down for a while, you will get similar result, colors does not change (too much).
Don't go for big changes, but quick changes. If the luminance of the image changes -10% during 10 min, it means the usual evening effect. But when the change is -5%, 0, +5% within seconds, its a quick change.
Don't forget to adjust the reference values.
Split the image to smaller regions. Then, when all the regions change same way, you know, it's a global change, like an eclypse or what, but if only one region's parameters are changing, then something happens there.
Use masks to create smart regions. If you're watching a street, filter out the sky, the trees (blown by wind), etc. You may set up different trigger values for different regions. The regions should overlap.
A special case of the region is the line. A line (a narrow region) contains less and more homogeneous pixels than a flat area. Mark, say, a green fence, it's easy to detect wheter someone crosses it, it makes bigger change in the line than in a flat area.
If you can, change the IRL world. Repaint the fence to a strange color to create a color spectrum, which can be identified easier. Paint tags to the floor and wall, which can be OCRed by the program, so you can detect wheter something hides it.
I believe you are looking for Template Matching
Also i would suggest you to look on to Open CV
We had to contend with many of these issues in our interactive installations. It's tough to not get false positives without being able to control some of your environment (sounds like you will have some degree of control). In the end we looked at combining some techniques and we created an open piece of software named OpenTSPS (Open Toolkit for Sensing People in Spaces - http://www.opentsps.com). You can look at the C++ source in github (https://github.com/labatrockwell/openTSPS/).
We use ‘progressive background relearn’ to adjust to the changing background over time. Progressive relearning is particularly useful in variable lighting conditions – e.g. if lighting in a space changes from day to night. This in combination with blob detection works pretty well and the only way we have found to improve is to use 3D cameras like the kinect which cast out IR and measure it.
There are other algorithms that might be relevant, like SURF (http://achuwilson.wordpress.com/2011/08/05/object-detection-using-surf-in-opencv-part-1/ and http://en.wikipedia.org/wiki/SURF) but I don't think it will help in your situation unless you know exactly the type of thing you are looking for in the image.
Sounds like a fun project. Best of luck.
The problem you are trying to solve is very interesting indeed!
I think that you would need to attack it in parts:
As you already pointed out, a sudden change in illumination can be problematic. This is an indicator that you probably need to achieve some sort of illumination-invariant representation of the images you are trying to analyze.
There are plenty of techniques lying around, one I have found very useful for illumination invariance (applied to face recognition) is DoG filtering (Difference of Gaussians)
The idea is that you first convert the image to gray-scale. Then you generate two blurred versions of this image by applying a gaussian filter, one a little bit more blurry than the first one. (you could use a 1.0 sigma and a 2.0 sigma in a gaussian filter respectively) Then you subtract from the less-blury image, the pixel intensities of the more-blurry image. This operation enhances edges and produces a similar image regardless of strong illumination intensity variations. These steps can be very easily performed using OpenCV (as others have stated). This technique has been applied and documented here.
This paper adds an extra step involving contrast equalization, In my experience this is only needed if you want to obtain "visible" images from the DoG operation (pixel values tend to be very low after the DoG filter and are veiwed as black rectangles onscreen), and performing a histogram equalization is an acceptable substitution if you want to be able to see the effect of the DoG filter.
Once you have illumination-invariant images you could focus on the detection part. If your problem can afford having a static camera that can be trained for a certain amount of time, then you could use a strategy similar to alarm motion detectors. Most of them work with an average thermal image - basically they record the average temperature of the "pixels" of a room view, and trigger an alarm when the heat signature varies greatly from one "frame" to the next. Here you wouldn't be working with temperatures, but with average, light-normalized pixel values. This would allow you to build up with time which areas of the image tend to have movement (e.g. the leaves of a tree in a windy environment), and which areas are fairly stable in the image. Then you could trigger an alarm when a large number of pixles already flagged as stable have a strong variation from one frame to the next one.
If you can't afford training your camera view, then I would suggest you take a look at the TLD tracker of Zdenek Kalal. His research is focused on object tracking with a single frame as training. You could probably use the semistatic view of the camera (with no foreign objects present) as a starting point for the tracker and flag a detection when the TLD tracker (a grid of points where local motion flow is estimated using the Lucas-Kanade algorithm) fails to track a large amount of gridpoints from one frame to the next. This scenario would probably allow even a panning camera to work as the algorithm is very resilient to motion disturbances.
Hope this pointers are of some help. Good Luck and enjoy the journey! =D
Use one of the standard measures like Mean Squared Error, for eg. to find out the difference between two consecutive images. If the MSE is beyond a certain threshold, you know that there is some motion.
Also read about Motion Estimation.
if you know that the image will remain reletivly static I would reccomend:
1) look into neural networks. you can use them to learn what defines someone within the image or what is a non-something in the image.
2) look into motion detection algorithms, they are used all over the place.
3) is you camera capable of thermal imaging? if so it may be worthwile to look for hotspots in the images. There may be existing algorithms to turn your webcam into a thermal imager.
is it possible to construct a 3d model of a still object if various images along with depth data was gathered from various angles, what I was thinking was have a sort of a circular conveyor belt where a kinect would be placed and the conveyor belt while the real object that is to be reconstructed in 3d space sits in the middle. The conveyor belt thereafter rotates around the image in a circle and lots of images are captured (perhaps 10 image per second) which would allow the kinect to catch an image from every angle including the depth data, theoretically this is possible. The model would also have to be recreated with the textures.
What I would like to know is whether there are any similar projects/software already available and any links would be appreciated
Whether this is possible within perhaps 6 months
How would I proceed to do this? Such as any similar algorithm you could point me to and such
Thanks,
MilindaD
It is definitely possible and there are a lot of 3D scanners which work out there, with more or less the same principle of stereoscopy.
You probably know this, but just to contextualize: The idea is to get two images from the same point and to use triangulation to compute the 3d coordinates of the point in your scene. Although this is quite easy, the big issue is to find the correspondence between the points in your 2 images, and this is where you need a good software to extract and recognize similar points.
There is an open-source project called Meshlab for 3d vision, which includes 3d reconstruction* algorithms. I don't know the details of the algorithms, but the software is definitely a good entrance point if you want to play with 3d.
I used to know some other ones, I will try to find them and add them here:
Insight3d
(*Wiki page has no content, redirects to login for editing)
Check out https://bitbucket.org/tobin/kinect-point-cloud-demo/overview which is a code sample for the Kinect for Windows SDK that does specifically this. Currently it uses the bitmaps captured by the depth sensor, and iterates through the byte array to create a point cloud in a PLY format that can read by MeshLab. The next stage of us is to apply/refine a delanunay triangle algoirthim to form a mesh instead of points, which a texture can be applied. A third stage would then me a mesh merging formula to combine multiple caputres from the Kinect to form a full 3D object mesh.
This is based on some work I done in June using Kinect for the purposes of 3D printing capture.
The .NET code in this source code repository will however get you started with what you want to achieve.
Autodesk has a piece of software that will do what you are asking for it is called "Photofly". It is currently in the labs section. Using a series of images taken from multiple angles the 3d geometry is created and then photo mapped with your images to create the scene.
If you interested more in theoretical (i mean if you want to know how) part of this problem,
here is some document from Microsoft Research about moving depth camera and 3D reconstruction.
Try out VisualSfM (http://ccwu.me/vsfm/) by Changchang Wu (http://ccwu.me/)
It takes multiple images from different angles of the scene and outputs a 3D point cloud.
The algorithm is called "Structure from Motion".
Brief idea of the algorithm : It involves extracting feature points in each image; finding correspondences between them across images; building feature tracks, estimating camera matrices and thereby the 3D coordinates of the feature points.
How hard would it be to take an image of an object (in this case of a predefined object), and develop an algorithm to cut just that object out of a photo with a background of varying complexity.
Further to this, a photo's object (say a house, car, dog - but always of one type) would need to be transformed into a 3d render. I know there are 3d rendering engines available (at a cost, free, or with some clause), but for this to work the object (subject) would need to be measured in all sorts of ways - e.g. if this is a person, we need to measure height, the curvature of the shoulder, radius of the face, length of each finger, etc.
What would the feasibility of solving this problem be? Anyone know any good links specialing in this research area? I've seen open source solutions to this problem which leaves me with the question of the ease of measuring the object while tracing around it to crop it out.
Thanks
Essentially I want to take a 2d image (typical image:which is easier than a complex photo containing multiple objects, etc.)
,
But effectively I want to turn that into a 3d image, so wouldn't what I want to do involve building a 3d rendering/modelling engine?
Furthermore, that link I have provided goes into 3ds max, with a few properties set, and a render is made.
It sounds like you want to do several things, all in the domain of computer vision.
Object Recognition (i.e. find the predefined object)
3D Reconstruction (make the 3d model from the image)
Image Segmentation (cut out just the object you are worried about from the background)
I've ranked them in order of easiest to hardest (according to my limited understanding). All together I would say it is a very complicated problem. I would look at the following Wikipedia links for more information:
Computer Vision Overview (Wikipedia)
The Eight Point Algorithm (for 3d reconstruction)
Image Segmentation
You're right this is an extremely hard set of problems, particularly that of inferring 3D information from a 2D image. Only a very limited understanding exists of how our visual system extrapolates 3D information from 2D images, one such approach is known as "Shape from Shading" and the linked google search shows how much (and consequently how little) we know.
Rob
This is a very difficult task. The hardest part is not recognising or segmenting the object from the image, but rather inferring the 3-D geometry of the object from the 2-D image. You will have more success if you can use a stereoscopic camera (or a laser scanner, if you have access to one ;).
For the case of 2-D images, try googling for "shape-from-shading". This is a method for inferring 3-D shape from a 2-D image. It does make assumptions about illumination conditions and surface properties (BRDF and geometry) that may fail in many cases, but if you are using it for only a predefined class of objects (e.g. human faces) it can work reasonably well.
Assuming it's possible, that would be extremely difficult, especially with only one image of the object. The rasterizer has to guess at the depth and distances of objects.
What you describe sounds very similar to Microsoft PhotoSynth.