HMM algorithm for gesture recognition - algorithm

I want to develop an app for gesture recognition using Kinect and hidden Markov models. I watched a tutorial here: HMM lecture
But I don't know how to start. What is the state set and how to normalize the data to be able to realize HMM learning? I know (more or less) how it should be done for signals and for simple "left-to-right" cases, but 3D space makes me a little confused. Could anyone describe how it should be begun?
Could anyone describe the steps, how to do this? Especially I need to know how to do the model and what should be the steps of HMM algorithm.

One set of methods for applying HMMs to gesture recognition would be to apply a similar architecture as commonly used for speech recognition.
The HMM would not be over space but over time, and each video frame (or set of extracted features from the frame) would be an emission from an HMM state.
Unfortunately, HMM-based speech recognition is a rather large area. Many books and theses have been written describing different architectures. I recommend starting with Jelinek's "Statistical Methods for Speech Recognition" (http://books.google.ca/books?id=1C9dzcJTWowC&pg=PR5#v=onepage&q&f=false) then following the references from there. Another resource is the CMU sphinx webpage (http://cmusphinx.sourceforge.net).
Another thing to keep in mind is that HMM-based systems are probably less accurate than discriminative approaches like conditional random fields or max-margin recognizers (e.g. SVM-struct).
For an HMM-based recognizer the overall training process is usually something like the following:
1) Perform some sort of signal processing on the raw data
For speech this would involve converting raw audio into mel-cepstrum format, while for gestures, this might involve extracting image features (SIFT, GIST, etc.)
2) Apply vector quantization (VQ) (other dimensionality reduction techniques can also be used) to the processed data
Each cluster centroid is usually associated with a basic unit of the task. In speech recognition, for instance, each centroid could be associated with a phoneme. For a gesture recognition task, each VQ centroid could be associated with a pose or hand configuration.
3) Manually construct HMMs whose state transitions capture the sequence of different poses within a gesture.
Emission distributions of these HMM states will be centered on the VQ vector from step 2.
In speech recognition these HMMs are built from phoneme dictionaries that give the sequence of phonemes for each word.
4) Construct an single HMM that contains transitions between each individual gesture HMM (or in the case of speech recognition, each phoneme HMM). Then, train the composite HMM with videos of gestures.
It is also possible at this point to train each gesture HMM individually before the joint training step. This additional training step may result in better recognizers.
For the recognition process, apply the signal processing step, find the nearest VQ entry for each frame, then find a high scoring path through the HMM (either the Viterbi path, or one of a set of paths from an A* search) given the quantized vectors. This path gives the predicted gestures in the video.

I implemented the 2d version of this for the Coursera PGM class, which has kinect gestures as the final unit.
https://www.coursera.org/course/pgm
Basically, the idea is that you can't use HMM to actually decide poses very well. In our unit, I used some variation of K-means to segment the poses into probabilistic categories. The HMM was used to actually decide what sequences of poses were actually viable as gestures. But any clustering algorithm run on a set of poses is a good candidate- even if you don't know what kind of pose they are or something similar.
From there you can create a model which trains on the aggregate probabilities of each possible pose for each point of kinect data.
I know this is a bit of a sparse interview. That class gives an excellent overview of the state of the art but the problem in general is a bit too difficult to be condensed into an easy answer. (I'd recommend taking it in april if you're interested in this field)

Related

Match Sketch(Drawing) face photo to digital color photo

I'm going to match the sketch face (drawing photo) in to the color photo. so for the research i want to find out what are the challenges that matching sketch drawing in to color faces. for now i have find out that
resolution pixel difference
texture difference
distance difference
and color (not much effect)
I want to know (in technical terms) what are other challenges and what are available OPEN CV and JAVA CV method and algorithms to overcome that challenges?
Here is some example of the sketches and the photos that are known to match them:
This problem is called multi-modal face recognition. There has been a lot of interest in comparing a high quality mugshot (modality 1) to low quality surveillance images (modality 2), another is frontal images to profiles, or pictures to sketches like the OP is interested in. Partial Least Squares (PLS) and Tied Factor Analysis (TFA) have been used for this purpose.
A key difficulty is computing two linear projections from the image in modality 1 (and modality 2) to a space where two points being close means that the individual is the same. This is the key technical step. Here are some papers on this approach:
Abhishek Sharma, David W Jacobs : Bypassing Synthesis: PLS for
Face Recognition with Pose, Low-Resolution and Sketch. CVPR
2011.
S.J.D. Prince, J.H. Elder, J. Warrell, F.M. Felisberti, Tied Factor
Analysis for Face Recognition across Large Pose Differences, IEEE
Patt. Anal. Mach. Intell, 30(6), 970-984, 2008. Elder is a specialist in this area and has a variety of papers on the topic.
B. Klare, Z. Li and A. K. Jain, Matching forensic sketches to
mugshot photos, IEEE Pattern Analysis and Machine Intelligence, 29
Sept. 2010.
As you can understand this is an active research area/problem. In terms using OpenCV to overcome the difficulties, let me give you an analogy: you need to build build a house (match sketches to photos) and you're asking how will having a Stanley hammer (OpenCV) will help. Sure, it will probably help. But you'll also need a lot of other resources: wood, time/money, pipes, cable, etc.
I think that James Elder's old work on the completeness of the edge map (using reconstruction by solving the Laplace equation) is quite relevant here. See the results at the end of this paper: http://elderlab.yorku.ca/~elder/publications/journals/ElderIJCV99.pdf
You could give Eigenfaces a try, though i never tested them with sketches i think they could a least be a good starting point for your research.
See Wiki: http://en.wikipedia.org/wiki/Eigenface and the Tutorial for OpenCV: http://docs.opencv.org/modules/contrib/doc/facerec/facerec_tutorial.html (including not only Eigenfaces!)
OpenCV can be used for feature extraction and machine learning required for this task. I guess you can start with the papers in the answers above, start with some basic features and prototype a classifier with OpenCV.
I guess you might also want to detect and match feature points on the faces. If you use this approach, you will have to do the feature point detectors on your own (training the Viola-Jones detector in OpenCV with your own data is an option).

image registration(non-rigid \ nonlinear)

I'm looking for some algorithm (preferably if source code available)
for image registration.
Image deformation can't been described by homography matrix(because I think that distortion not symmetrical and not
homogeneous),more specifically deformations are like barrel/distortion and trapezoid distortion, maybe some rotation of image.
I want to obtain pairs of pixel of two images and so i can obtain representation of "deformation field".
I google a lot and find out that there are some algorithm base on some phisics ideas, but it seems that they can converge
to local maxima, but not global.
I can affort program to be semi-automatic, it means some simple user interation.
maybe some algorithms like SIFT will be appropriate?
but I think it can't provide "deformation field" with regular sufficient density.
if it important there is no scale changes.
example of complicated field
http://www.math.ucla.edu/~yanovsky/Research/ImageRegistration/2DMRI/2DMRI_lambda400_grid_only1.png
What you are looking for is "optical flow". Searching for these terms will yield you numerous results.
In OpenCV, there is a function called calcOpticalFlowFarneback() (in the video module) that does what you want.
The C API does still have an implementation of the classic paper by Horn & Schunck (1981) called "Determining optical flow".
You can also have a look at this work I've done, along with some code (but be careful, there are still some mysterious bugs in the opencl memory code. I will release a corrected version later this year.): http://lts2www.epfl.ch/people/dangelo/opticalflow
Besides OpenCV's optical flow (and mine ;-), you can have a look at ITK on itk.org for complete image registration chains (mostly aimed at medical imaging).
There's also a lot of optical flow code (matlab, C/C++...) that can be found thanks to google, for example cs.brown.edu/~dqsun/research/software.html, gpu4vision, etc
-- EDIT : about optical flow --
Optical flow is divided in two families of algorithms : the dense ones, and the others.
Dense algorithms give one motion vector per pixel, non-dense ones one vector per tracked feature.
Examples of the dense family include Horn-Schunck and Farneback (to stay with opencv), and more generally any algorithm that will minimize some cost function over the whole images (the various TV-L1 flows, etc).
An example for the non-dense family is the KLT, which is called Lucas-Kanade in opencv.
In the dense family, since the motion for each pixel is almost free, it can deal with scale changes. Keep in mind however that these algorithms can fail in the case of large motions / scales changes because they usually rely on linearizations (Taylor expansions of the motion and image changes). Furthermore, in the variational approach, each pixel contributes to the overall result. Hence, parts that are invisible in one image are likely to deviate the algorithm from the actual solution.
Anyway, techniques such as coarse-to-fine implementations are employed to bypass these limits, and these problems have usually only a small impact. Brutal illumination changes, or large occluded / unoccluded areas can also be explicitly dealt with by some algorithms, see for example this paper that computes a sparse image of "innovation" alongside the optical flow field.
i found some software medical specific, but it's complicate and it's not work with simple image formats, but seems that it do that I need.
http://www.csd.uoc.gr/~komod/FastPD/index.html
Drop - Deformable Registration using Discrete Optimization

Obtaining motion vectors from raw video

I'd like to know if there is any good (and freely available) text, on how to obtain motion vectors of macro blocks in raw video stream. This is often used in video compression, although my application of it is not video encoding.
Code that does this is available in OSS codecs, but understanding the method by reading the code is kinda hard.
My actual goal is to determine camera motion in 2D projection space, assuming the camera is only changing it's orientation (NOT the position). What I'd like to do is divide the frames into macro blocks, obtain their motion vectors, and get the camera motion by averaging those vectors.
I guess OpenCV could help with this problem, but it's not available on my target platform.
The usual way is simple brute force: Compare a macro block to each macro block from the reference frame and use the one that gives the smallest residual error. The code gets complex primarily because this is usually the slowest part of mv-based compression, so they put a lot of work into optimizing it, often at the expense of anything even approaching readability.
Especially for real-time compression, some reduce the workload a bit by (for example) restricting the search to the original position +/- some maximum delta. This can often gain quite a bit of compression speed in exchange for a fairly small loss of compression.
If you assume only camera motion, I suspect there is something possible with analysis of the FFT of successive images. For frequencies whose amplitudes have not changed much, the phase information will indicate the camera motion. Not sure if this will help with camera rotation, but lateral and vertical motion can probably be computed. There will be difficulties due to new information appearing on one edge and disappearing on the other and I'm not sure how much that will hurt. This is speculative thinking in response to your question, so I have no proof or references :-)
Sounds like you're doing a very limited SLAM project?
Lots of reading matter at Bristol University, Imperial College, Oxford University for example - you might find their approaches to finding and matching candidate features from frame to frame of interest - much more robust than simple sums of absolute differences.
For the most low-level algorithms of this type the term you are looking for is optical flow and one of the easiest algorithms of that class is the Lucas Kanade algorithm.
This is a pretty good overview presentation that should give you plenty of ideas for an algorithm that does what you need

Face identification with opencv

I'm using the libraries OpenCV for image processing in C + + and this is my question: can you think possible to do a facial recognition (saying the name of a person based on a database of photos) by comparing the frame of videocamera with images in a database using the technique of image histograms comparison? (Note that i compare only the facial region of an image using an example included in the opecv libraries).
I'm asking this because i've just tried to do a program like above but i have a lot of problem (often i detect the wrong person)
You might want to start with compiling the Face Detection using OpenCV example. As others have pointed out, general facial recognition isn't exactly an easy problem to solve. EigenFaces is one common technique for face recognition that is fairly easy to understand and implement.
As others have stated, it's a hard problem, but this gives you a place to start.
Some method I had experience with them are
metric learning for comparing faces
naming video characters: they use SIFT descriptors computed at specific feducial points on each face. Their code worked quite well for me in the past.
A dataset and benchmark that is dedicated for this task is labeled faces in the wild. You can find there references to working methods for comparing faces after detection.
UPDATE:
I have a description of an experiment on face clustering: unsupervised face identification.
The experiment is described in Section 4.4 of my thesis.
The basic flow is as follows
Metric learning: how to determine if two faces are of the same person or not.
This part is supervised, in the sense that it requires as input face images labeled with the identity of the person who appears in each photo.
a. Detect fiducial points (eyes, corner of mouth, nose).
You may use this code, or more recent versions such as this one.
b. Extract SIFT descriptors at the detected fiducial points.
c. Construct a "face descriptor": each face is described using a single vector.
This vector is a concatenation of the sqrt of all the SIFT descriptors.
d. Use the method described here to learn a mahalanobis distance between faces of different persons.
Unsupervised face identification: Once a metric was learned, you may use new photos of new people (these people need not be part of the training set, you may use photos of unseen-before people!).
a. Repeat stages a-c to construct the same "face descriptor" vector for each input face.
b. Compare the descriptor vectors using the learned mahalanobis distance.
I suggest using an existing algorithm such as the one available in the Luxand FaceSDK: http://www.luxand.com/facesdk/ rather than trying to develop your own.
there are 3 builtin techniques for face-recognition in opencv now, pca(eigenfaces), lda(fisherfaces) and lbph.
nice example code:
https://github.com/Itseez/opencv/blob/master/samples/cpp/facerec_demo.cpp

What are the algorithms used behind filters in image editing softwares?

For example: What algorithm is used to generate the image by the fresco filter in Adobe Photoshop?
Do you know some place where I can read about the algorithms implemented in these filters?
Lode's Computer Graphics Tutorial
The source code for GIMP would be a good place to start. If the code for some filter doesn't make sense, at least you'll find jargon in the code and comments that can be googled.
The Photoshop algorithms can get very complex, and beyond simple blurring and sharpening, each one is a topic unto itself.
For the fresco filter, you might want to start with an SO question on how to cartoon-ify and image.
I'd love to read a collection of the more interesting algorithms, but I don't know of such a compilation.
Digital image processing is the use of computer algorithms to perform image processing on digital images. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allows a much wider range of algorithms to be applied to the input data and can avoid problems such as the build-up of noise and signal distortion during processing. Since images are defined over two dimensions (perhaps more) digital image processing may be modeled in the form of multidimensional systems.
Digital image processing allows the use of much more complex algorithms, and hence, can offer both more sophisticated performance at simple tasks, and the implementation of methods which would be impossible by analog means.
In particular, digital image processing is the only practical technology for:
Classification
Feature extraction
Pattern recognition
Projection
Multi-scale signal analysis
Some techniques which are used in digital image processing include:
Pixelation,
Linear filtering,
Principal components analysis
Independent component analysis
Hidden Markov models
Anisotropic diffusion
Partial differential equations
Self-organizing maps
Neural networks
Wavelets

Resources