Java implementation of FastDTW for gesture recognition - algorithm

I have an android application which is getting gesture coordinates (3 axis - x,y,z) from an external source. I need to compare them with coordinates which I have in my DB and determine whether gesture is the same/similar or not.
I also need to add some tolerance, since accelerometer (device which captures gestures) is very sensitive. It would be easy, but I also want to consider e.g. "big circle" drawn in the air, same as "small circle" drawn in the air. meaning that there would be different values, but structure of the graph would be the same, right?
I did some research and I found out I could use Dynamic time wraping possibly in combination with Hidden Markov model.
I tried to do some further reading on it, but I didn't find much. I just found a Java library called FastDTW but can't figure out which methods are the right ones for my problem.
Would you please provide me any lead on this?
See these images, they are the same gesture but one is done in a slower motion than another.
Faster one:
Slower one:
I haven't captured images of the same gesture where one would be smaller than another, might add that later.

Related

Remove background and get deer as a fore ground?

I want to remove background and get deer as a foreground image.
This is my source image captured by trail camera:
This is what I want to get. This output image can be a binary image or RGB.
I worked on it and try many methods to get solution but every time it failed at specific point. So please first understand what is my exact problem.
Image are captured by a trail camera and camera is motion detector. when deer come in front of camera it capture image.
Scene mode change with respect to weather changing or day and night etc. So I can't use frame difference or some thing like this.
Segmentation may be not work correctly because Foreground (deer) and Background have same color in many cases.
If anyone still have any ambiguity in my question then please first ask me to clear and then answer, it will be appreciated.
Thanks in advance.
Here's what I would do:
As was commented to your question, you can detect the dear and then perform grabcut to segment it from the picture.
To detect the dear, I would couple a classifier with a sliding window approach. That would mean that you'll have a classifier that given a patch (can be a large patch) in the image, output's a score of how much that patch is similar to a dear. The sliding window approach means that you loop on the window size and then loop on the window location. For each position of the window in the image, you should apply the classifier on that window and get a score of how much that window "looks like" a dear. Once you've done that, threshold all the scores to get the "best windows", i.e. the windows that are most similar to a dear. The rational behind this is that if we a dear is present at some location in the image, the classifier will output a high score at all windows that are close/overlap with the actual dear location. We would like to merge all that locations to a single location. That can be done by applying the functions groupRectangles from OpenCV:
http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html#grouprectangles
Take a look at some face detection example from OpenCV, it basically does the same (sliding window + classifier) where the classifier is a Haar cascade.
Now, I didn't mention what that "dear classifier" can be. You can use HOG+SVM (which are both included in OpenCV) or use a much powerful approach of running a deep convulutional neural network (deep CNN). Luckily, you don't need to train a deep CNN. You can use the following packages with their "off the shelf" ImageNet networks (which are very powerful and might even be able to identify a dear without further training):
Decaf- which can be used only for research purposes:
https://github.com/UCB-ICSI-Vision-Group/decaf-release/
Or Caffe - which is BSD licensed:
http://caffe.berkeleyvision.org/
There are other packages of which you can read about here:
http://deeplearning.net/software_links/
The most common ones are Theano, Cuda ConvNet's and OverFeat (but that's really opinion based, you should chose the best package from the list that I linked to).
The "off the shelf" ImageNet network were trained on roughly 10M images from 1000 categories. If those categories contain "dear", that you can just use them as is. If not, you can use them to extract features (as a 4096 dimensional vector in the case of Decaf) and train a classifier on positive and negative images to build a "dear classifier".
Now, once you detected the dear, meaning you have a bounding box around it, you can apply grabcut:
http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_grabcut/py_grabcut.html
You'll need an initial scribble on the dear to perform grabcu. You can just take a horizontal line in the middle of the bounding box and hope that it will be on the dear's torso. More elaborate approaches would be to find the symmetry axis of the dear and use that as a scribble, but you would have to google, research an implement some method to extract symmetry axis from the image.
That's about it. Not straightforward, but so is the problem.
Please let me know if you have any questions.
Try OpenCV Background Substraction with Mixture of Gaussians models. They should be adaptable enough for your scenes. Of course, the final performance will depend on the scenario, but it is worth trying.
Since you just want to separate the background from the foreground I think you do not need to recognize the deer. You need to recognize an object in motion in the scene. You just need to separate what is static in a significant interval of time (background) from what is not static: the deer.
There are algorithms that combine multiple frames from the same scene in order to determine the background, like THIS ONE.
You mentioned that the scene mode changes with respect to weather changing or day and night considering photos of different deers.
You could implement a solution when motion is detected, instead of taking a single photo, it could take a few ones with some interval of time.
This interval has to be long as to get the deer in different positions or out of the scene and at the same time short enough to not be much affected by scene variations. Perhaps you need to deal with some brightness variation, but I think it is feasible to determine the background using these frames and finally segment the deer in the "motion frame".

How do you make and recognize gestures?

I am using a web cam to get video feed and then performing motion tracking on this video feed. The motion tracker returns (x,y) co-ordinates continuously.
I want to use these (x,y) to recognize gestures such as "swipe left", "swipe right", "swipe up" or "swipe down".
How do i make and store templates of these gestures and how do i figure out/recognize if one
of the gestures has happened ?
Thank you in advance :)
PS: I am using Flex 4 and ActionScript 3.0. If someone could help me out with the logic, i can write it in ActionScript.
An approach I could think of working would be to have a series of (X,Y) coordinates representing points along the gesture. On a small scale if a gesture that passed through your screen was graphed as such:
|1|1|1|
|1|0|0|
|1|0|0|
and represented (from the upper left corner representing 0,0):
(0,2)(0,1)(0,0)(0,1)(0,2)
Break the x,y coordinates up into individual 2d arrays with total distance traveled between the current coordinate and the first point (in all cases in this example it would increment by 1) so you would have two arrays:
X:(0,0)(1,1)(2,2)
Y:(0,1)(1,1)(2,2)
Now do a least square fit on each array to find the closest representation of the change in x and change in y as quadratic functions. Do the same to your per-deteremined gestures and then plug in the x,y coordinates of your per-determined gestures into to the user's gesture's quadratic function and the per-determined gestures you designed and see which one it matches the closest. This is your gesture.
(I've never tried processing gestures, but I don't see why this wouldn't work)
You should divide your task into smaller subtasks. In computer vision there is no thing like a generic gesture detection that works out of the box in all environments.
First of all, you need to be able to detect motion at all. There are several ways to do this, e.g. background subtraction or blob tracking.
Then you need to extract certain features from your image, e.g. a hand. Again, there is more than one way to do this. Starting from skin color approximation/evaluation, which is very error prone to different lighting conditions, to more advanced techniques which really try to "analyze" the shape of an object. Those algorithms "learn" over time how a hand should look like.
I can only recommend to buy a decent book about computer vision and to research the web for articles etc. There are also libraries like OpenCV you can use for learning more about the implementation side. There should be several ports of OpenCV to ActionScript 3. I also can recommend the articles and tools from Eugene Zatepyakin (http://blog.inspirit.ru). He's doing great CV stuff with ActionScript 3.
Long story short, you should research motion tracking and feature extraction.
The best place to start is to read about how sign language recognition or trackpad input works, such as creating reference images and comparing them to user input. Specific to Adobe, there's the FLARToolKit, which is detailed in an augmented reality article on their website.
References:
Trackpad Science
Hand Gesture Recognition
Sign Language Recognition Research - PDF
Gesture Recognition Walkthrough - Video

Object detection + segmentation

I 'm trying to find an efficient way of acceptable complexity to
detect an object in an image so I can isolate it from its surroundings
segment that object to its sub-parts and label them so I can then fetch them at will
It's been 3 weeks since I entered the image processing world and I've read about so many algorithms (sift, snakes, more snakes, fourier-related, etc.), and heuristics that I don't know where to start and which one is "best" for what I'm trying to achieve. Having in mind that the image dataset in interest is a pretty large one, I don't even know if I should use some algorithm implemented in OpenCV or if I should implement one my own.
Summarize:
Which methodology should I focus on? Why?
Should I use OpenCV for that kind of stuff or is there some other 'better' alternative?
Thank you in advance.
EDIT -- More info regarding the datasets
Each dataset consists of 80K images of products sharing the same
concept e.g. t-shirts, watches, shoes
size
orientation (90% of them)
background (95% of them)
All pictures in each datasets look almost identical apart from the product itself, apparently. To make things a little more clear, let's consider only the 'watch dataset':
All the pictures in the set look almost exactly like this:
(again, apart form the watch itself). I want to extract the strap and the dial. The thing is that there are lots of different watch styles and therefore shapes. From what I've read so far, I think I need a template algorithm that allows bending and stretching so as to be able to match straps and dials of different styles.
Instead of creating three distinct templates (upper part of strap, lower part of strap, dial), it would be reasonable to create only one and segment it into 3 parts. That way, I would be confident enough that each part was detected with respect to each other as intended to e.g. the dial would not be detected below the lower part of the strap.
From all the algorithms/methodologies I've encountered, active shape|appearance model seem to be the most promising ones. Unfortunately, I haven't managed to find a descent implementation and I'm not confident enough that that's the best approach so as to go ahead and write one myself.
If anyone could point out what I should be really looking for (algorithm/heuristic/library/etc.), I would be more than grateful. If again you think my description was a bit vague, feel free to ask for a more detailed one.
From what you've said, here are a few things that pop up at first glance:
Simplest thing to do it binarize the image and do Connected Components using OpenCV or CvBlob library. For simple images with non-complex background this usually yeilds objects
HOwever, looking at your sample image, texture-based segmentation techniques may work better - the watch dial, the straps and the background are wisely variant in texture/roughness, and this could be an ideal way to separate them.
The roughness of a portion can be easily found by the Eigen transform (explained a bit on SO, check the link to the research paper provided there), then the Mean Shift filter can be applied on the output of the Eigen transform. This will give regions clearly separated according to texture. Both the pyramidal Mean Shift and finding eigenvalues by SVD are implemented in OpenCV, so unless you can optimize your own code its better (and easier) to use inbuilt functions (if present) as far as speed and efficiency is concerned.
I think I would turn the problem around. Instead of hunting for the dial, I would use a set of robust features from the watch to 'stitch' the target image onto a template. The first watch has a set of squares in the dial that are white, the second watch has a number of white circles. I would per type of watch:
Segment out the squares or circles in the dial. Segmentation steps can be tricky as they are usually both scale and light dependent
Estimate the centers or corners of the above found feature areas. These are the new feature points.
Use the Hungarian algorithm to match features between the template watch and the target watch. Alternatively, one can take the surroundings of each feature point in the original image and match these using cross correlation
Use matching features between the template and the target to estimate scaling, rotation and translation
Stitch the image
As the image is now in a known form, one can extract the regions simply via pre set coordinates

Differentiate objects?

i want to identify a ball in the picture. I am thiking of using sobel edge detection algorithm,with this i can detect the round objects in the image.
But how do i differentiate between different objects. For example, a foot ball is there in one picture and in another picture i have a picture of moon.. how to differentiate what object has been detected.
When i use my algorithm i get ball in both the cases. Any ideas?
Well if all the objects you would like to differentiate are round, you could even use a hough transformation for round objects. This is a very good way of distinguishing round objects.
But your basic problem seems to be classification - sorting the objects on your image into different classes.
For this you don't really need a Neural Network, you could simply try with a Nearest Neighbor match. It's functionalities are a bit like neural networks since you can give it several reference pictures where you tell the system what can be seen there and it will optimize itself to the best average values for each attribute you detected. By this you get a dictionary of clusters for the different types of objects.
But for this you'll of course first need something that distinguishes a ball from a moon.
Since they are all real round objects (which appear as circles) it will be useless to compare for circularity, circumference, diameter or area (only if your camera is steady and if you know a moon will always have the same size on your images, other than a ball).
So basically you need to look inside the objects itself and you can try to compare their mean color value or grayscale value or the contrast inside the object (the moon will mostly have mid-gray values whereas a soccer ball consists of black and white parts)
You could also run edge filters on the segmented objects just to determine which is more "edgy" in its texture. But for this there are better methods I guess...
So basically what you need to do first:
Find several attributes that help you distinguish the different round objects (assuming they are already separated)
Implement something to get these values out of a picture of a round object (which is already segmented of course, so it has a background of 0)
Build a system that you feed several images and their class to have a supervised learning system and feed it several images of each type (there are many implementations of that online)
Now you have your system running and can give other objects to it to classify.
For this you need to segment the objects in the image, by i.e Edge filters or a Hough Transformation
For each of the segmented objects in an image, let it run through your classification system and it should tell you which class (type of object) it belongs to...
Hope that helps... if not, please keep asking...
When you apply an edge detection algorithm you lose information.
Thus the moon and the ball are the same.
The moon has a diiferent color, a different texture, ... you can use these informations to differnentiate what object has been detected.
That's a question in AI.
If you think about it, the reason you know it's a ball and not a moon, is because you've seen a lot of balls and moons in your life.
So, you need to teach the program what a ball is, and what a moon is. Give it some kind of dictionary or something.
The problem with a dictionary of course would be that to match the object with all the objects in the dictionary would take time.
So the best solution would probably using Neural networks. I don't know what programming language you're using, but there are Neural network implementations to most languages i've encountered.
You'll have to read a bit about it, decide what kind of neural network, and its architecture.
After you have it implemented it gets easy. You just give it a lot of pictures to learn (neural networks get a vector as input, so you can give it the whole picture).
For each picture you give it, you tell it what it is. So you give it like 20 different moon pictures, 20 different ball pictures. After that you tell it to learn (built in function usually).
The neural network will go over the data you gave it, and learn how to differentiate the 2 objects.
Later you can use that network you taught, give it a picture, and it a mark of what it thinks it is, like 30% ball, 85% moon.
This has been discussed before. Have a look at this question. More info here and here.

What methods/algorithms are used for gesture recognition in a multi-touch environment?

In a multi-touch environment, how does gesture recognition work? What mathematical methods or algorithms are utilized to recognize or reject data for possible gestures?
I've created some retro-reflective gloves and an IR LED array, coupled with a Wii remote. The Wii remote does internal blob detection and tracks 4 points of IR light and transmits this information to my computer via a bluetooth dongle.
This is based off Johnny Chung Lee's Wii Research. My precise setup is exactly like the graduate students from the Netherlands displayed here. I can easily track 4 point's positions in 2d space and I've written my basic software to receive and visualize these points.
The Netherlands students have gotten a lot of functionality out of their basic pinch-click recognition. I'd like to take it a step further if I could, and implement some other gestures.
How is gesture recognition usually implemented? Beyond anything trivial, how could I write software to recognize and identify a variety of gestures: various swipes, circular movements, letter tracing, etc.
Gesture recognition, as I've seen it anyway, is usually implemented using machine learning techniques similar to image recognition software. Here's a cool project on codeproject about doing mouse gesture recognition in c#. I'm sure the concepts are quite similar since you can likely reduce the problem down to 2D space. If you get something working with this, I'd love to see it. Great project idea!
One way to look at it is as a compression / recognition problem. Basically, you want to take a whole bunch of data, throw out most of it, and categorize the remainder. If I were doing this (from scratch) I'd probably proceed as follows:
work with a rolling history window
take the center of gravity of the four points in the start frame, save it, and subtract it out of all the positions in all frames.
factor each frame into two components: the shape of the constellation and the movement of it's CofG relative to the last frame's.
save the absolute CofG for the last frame too
the series of CofG changes gives you swipes, waves, etc.
the series of constellation morphing gives you pinches, etc.
After seeing your photo (two points on each hand, not four points on one, doh!) I'd modify the above as follows:
Do the CofG calculation on pairs, with the caveats that:
If there are four points visible, pairs are chosen to minimize the product of the intrapair distances
If there are three points visible, the closest two are one pair, the other one is the other
Use prior / following frames to override when needed
Instead of a constellation, you've got a nested structure of distance / orientation pairs (i.e., one D/O between the hands, and one more for each hand).
Pass the full reduced data to recognizers for each gesture, and let them sort out what they care about.
If you want to get cute, do a little DSL to recognize the patterns, and write things like:
fire when
in frame.final: rectangle(points)
and
over frames.final(5): points.all (p => p.jerk)
or
fire when
over frames.final(3): hands.all (h => h.click)
A video of what has been done with this sort of technology, if anyone is interested?
Pattie Maes demos the Sixth Sense - TED 2009
Most simple gesture-recognition tools I've looked at use a vector-based template to recognize them. For example, you can define right-swipe as "0", a checkmark as "-45, 45, 45", a clockwise circle as "0, -45, -90, -135, 180, 135, 90, 45, 0", and so on.
Err.. I've been working on gesture recognition for the past year or so now, but I don't want to say too much because I'm trying to patent my technology :) But... we've had some luck with adaptive boosting, although what you're doing looks fundamentally different. You only have 4 points of data to process, so I don't think you really need to "reduce" anything.
What I would investigate is how programs like Flash turn a freehand drawn circle into an actual circle. It seems like you could track the points for duration of about a second, and then "smooth" the path in some fashion, and then you could probably get away with hardcoding your gestures (if you make them simple enough). Otherwise, yes, you're going to want to use a learning algorithm. Neural nets might work... I don't know. Just tossing out ideas :) Maybe look at how OCR is done too... or even Hough transforms. It looks to me like this is a problem of recognizing shapes more than it is of recognizing gestures.
I'm not very well versed in this type of mathematics, but I have read somewhere that people sometimes use Markov Chains or Hidden Markov Models to do Gesture Recognition.
Perhaps someone with a little more background in this side of Computer Science can illuminate it further and provide some more details.

Resources