Reading a paper, I'm having difficulty understanding the algorithm described:
Given a black and white digital image of a handwriting sample, cut out a single character to analyze. Since this can be any size, the algorithm needs to take this into account (if it will be easier, we can assume the size is 2^n x 2^m).
Now, the description states given this image we will convert it to a 512-bit feature (a 512-bit hash) as follows:
(192 bits) computes the gradient of the image by convolving it with a 3x3 Sobel operator. The direction of the gradient at every edge is quantized to 12 directions.
(192 bits) The structural feature generator takes the gradient map and looks in a neighborhood for certain combinations of gradient values. (used to compute 8 distinct features that represent lines and corners in the image)
(128 bits) Concavity generator uses an 8-point star operator to find coarse concavities in 4 directions, holes, and lagrge-scale strokes.
The image feature maps are normalized with a 4x4 grid.
I'm for now struggling with how to take an arbitrary image, split into 16 sections, and using a 3x3 Sobel operator to come up with 12 bits for each section. (But if you have some insight into the other parts, feel free to comment :)
I'm struggling with the same paper by Srihari et al. (2002) for my Ph.D. thesis. I would say this text is not very specific, but the authors refer to a technical report (CEDAR-TR-01-1) for more details. This report seems not to be accessible on internet, so my suggestion is to contact the authors by e-mail and ask for this report. If something is not clear, you could ask for clarification as well.
I see the question is very old but maybe it can help someone - you apply sobel operators for horizontal and vertical edge detection on an image. From the result you can calculate image direction vectors for every point of an image. In your case you need to map these vectors to 12 directions. Then, in your case, you divide the image into 4x4 subimages (16 sections) and calculate intensity of each direction in each section. This gives you 12*16=192 features. I can give more detailed explanation if needed.
Related
I have read two references about the SIFT algorithm here and here and I am not truly understanding how just some key points are detected considering that the algorithm works on the difference of Gaussians calculated on several resolutions (they call them octaves). Here are the steps from the technique according to what I've understood from the paper.
Given the input image, blur them with Gaussian filters using different sigmas, resulting in Gaussian filtered images. In the paper, they used 5 Gaussian filters per octave (they tell that two adjacent Gaussian filtered images were filtered using sigma and k * sigma gaussian filter parameters), and they consider 4 octaves in the algorithm. So, there are a total of 20 gaussian filtered images (5 per octave), but we will act on 5 Gaussian filtered images from each octave individually.
For each octave, we calculate 4 Difference Of Gaussian images (DOG) from the 5 Gaussian filtered images by just subtracting adjacent Gaussian filtered images. So, now we have a total of 16 difference of Gaussian images, but we will consider 4 Difference of Gaussian images from each octave individually.
Find local extrema (maximum or minimum values) pixels by comparing a pixel in each DOG with 26 neighbor pixels. Among these, 8 pixels are in the same scale as the pixel (in a 3x3 window), 9 are in a 3x3 window in the scale above (Difference of Gaussian image from the same octave) and 9 others in a 3x3 window in the scale below.
Once finding these local extrema in different octaves, we must refine them, eliminating low-contrast points or weak edge points. They filter bad candidates using threshold in a Taylor expansion function and eigenvalue ratio threshold calculated in a Hessian matrix.
(this part I don't understand perfectly): For each interest point that survived (in each octave, I believe), they consider a neighborhood around it and calculate gradient magnitudes and orientation of each pixel from that region. They build a gradient orientation histogram covering 360 degrees and select the highest peak and also peaks that are higher than 80% of the highest peak. They define that the orientation of the keypoint is defined in a parabola (fitting function?) to the 3 histogram values closest to each peak to interpolate the peak position (I really dont understand this part perfectly).
What I am not understanding
1- The tutorial and even the original paper are not clear on how to detect a single key point as we are dealing with multiple octaves (images resolutions). For example, suppose I have detected 1000 key points in the first octave, 500 in the second, 250 in the third and 125 in the fourth octave. The SIFT algorithm will return me the following data about the key points: 1-(x,y) coordinates 2- scale (what is that?) 3-orientation and 4- the feature vector (which I easily understood how it is built). There are also Python functions from Opencv that can draw these keypoints using the original image (thus, the first octave), but how if the keypoints are detected in different octaves and thus the algorithm considers DOG images with different resolutions?
2- I don't understand the part 5 of the algorithm very well. Is it useful for defining the orientation of the keypoint, right? can somebody explain that to me with other words and maybe I can understand?
3- To find Local extrema per octave (step 3), they don't explain how to do that in the first and last DOG images. As we are considering 4 DOG images, It is possible to do that only in the second and third DOG images.
4- There is another thing that the author wrote that completely messed the understanding of the approach from me:
Figure 1: For each octave of scale space, the initial image is
repeatedly convolved with Gaussians to produce the set of scale space
images shown on the left. Adjacent Gaussian images are subtracted to
produce the difference-of-Gaussian images on the right. After each
octave, the Gaussian image is down-sampled by a factor of 2, and the
process repeated.
What? Does he downsample only one Gaussian image? how can the process be repeated by doing that? I mean, the difference of Gaussians is originally done by filtering the INPUT IMAGE differently. So, I believe the INPUT IMAGE, and not the Gaussian image must be resampled. Or the author forgot to write that THE GAUSSIAN IMAGES from a given octave are downsampled and the process is repeated for another octave?
I'm basically following a paper, " Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system".
Here the author has extracted a vector of 9 features from each sliding window. quoting the paper:
The first three features are the weight of the window, its centre of
gravity and the second order moment of the window.
Features four and five define the position of the upper and lower
contour in the window, features six and seven give the orientation of
the upper and lower contour by the gradient of the contour at the
windows position, feature eight gives the number of black to white
transitions in vertical direction, while feature nine gives the number
of black pixels between the upper and lower contour.
I managed to calculate the first three features the paper is talking about, but I seem to have trouble understanding the features 4,5,6,7,8.
I can calculate the contour of an image. Suppose, this is a window of one of the text lines(windows is of length 14 pixels, as suggested by paper):
And this is the extracted contour of the image:
So what exactly is the upper and lower contour here? from where can I consider the limits, if it refers to the top and bottom pixels then I could have extracted those without contour extraction? Similarly the orientation of these contours is equally confusing.
I would really appreciate some guidance here.
I gave a look at the paper, and I am pretty sure that "upper" and "lower" should be read as "uppest" and "lowest". This especially makes sense as the authors have a special focus on the preprocessing of their data that they normalize in both the horizontal and vertical directions. They take care to have a kind of robustness to scale, writing angle,...
I guess that features 4 and 5 can be the extremal ordinates of the contours, which, combined with features 6 & 7 which are the gradients = orientations, give a good idea of the shape of these parts of the contour.
Feature 9, will be mostly useful to make the difference between letters that can have similar vertical shapes I guess, such as "i", "l", "j".
This is my understanding. Hope this helps!
Fooling around with OCR.
I have a set of binary images of numbers 0-9 that can be used as training data, and another set of unknown numbers within the same range. I want to be able to classify the numbers in the unknown set by using the k nearest neighbour algorithm.
I've done some studying on the algorithm, and I've read that the best approach is to take quantity characteristics and plot each training data in a feature space with those characteristics as the axes, and for each image in the unknown set do the same, and using the k nearest neighbour algorithm find the closest points, something like what is done here.
What characteristics would be best suited to something like this?
In a simple case, as phs mentioned in his comment, pixel intensities are used. The images are resized to a standard size like 20x20, 10x10 etc, and express the whole image as a vector of 400 or 100 elements respectively.
Such an example is shown here: Simple Digit Recognition OCR in OpenCV-Python
Or you can look for features like moments, centroid, area, perimeter, euler number etc.
If your image is grayscale, you can for Histogram of Oriented Gradients. Here is an example with SVM. You can try adapting it to the kNN : http://docs.opencv.org/trunk/doc/py_tutorials/py_ml/py_svm/py_svm_opencv/py_svm_opencv.html#svm-opencv
In „Multiple View Geometry in Computer Vision” R.Hartley,A.Zisserman in chapter 11 - about computation Fundamental Matrix one can read:
„11.7.3 The calibrated case
In the case of calibrated cameras normalized image coordinates may be used, and the essential matrix E computed instead of the fundamental matrix”
Does it mean – if I have proper Intrinsic Cameras matrices (does it mean calibrated in this case?) I can calculate Essential Matrix directly (using 8 points algorithm) omitting calculating Fundamental Matrix?
And I can get matrices R and T from calculated Essential Matrix to reconstruction 3D model?
Regards,
Artik
Short answer, yes. See also longer explanation on Wikipedia.
From your correspondences, using the 8point Alg you obtain the Fundamental Matrix F.
From the relation E=K'^T F K, assuming that you know both K' and K (in case that both the images were taken by the same camera, you have K'=K), you can compute E.
From E you get 4 possible camera couple (P_0,P_0') (P_1,P_1')....(P_3,P_3'). Only one of this couple satisfy the positive depth constraint (i.e. the 3D points lie in front of both the cameras).
That couple will be your cameras.
Hope this help!
So, In general, a calibrated camera in visual odometry refers to a camera for which the intrinsic matrix is known.
In the case of a stereo visual odometry system, I typically take it to mean that the intrinsic matrix is known for both cameras, however, some of my co-works mean it to mean that the Rotation and Translation between the two cameras are known.
In Practice, there is hardly any distinction between the two as you can estimate the intrinsic matrix of a camera using various functions in MatLab or OpenCV, and given the instinct matrix, you can determine the rotation and translation between the two cameras.
Furthermore, the derivation of the fundamental matrix relies upon the Essential matrix and the intrinsic matrix of two cameras (the intrinsic matrix can be the same in the case of monocular visual odometry). This means that it is often the case that the essential matrix is estimated and the fundamental matrix is not.
For an explanation on getting the rotation and translation from the essential matrix, I recommend first watching a youtube video on Single Value Decomposition (SVD) and then reading: https://www.researchgate.net/publication/220556161_Visual_Odometry_Tutorial.
Good Luck with Your Studies Young Scholar.
Given a set of 2D points, I want to calculate a measure of how horizontally symmetrical and vertically symmetrical those points are.
Alternatively, for each set of points I will also have a rasterised image of the lines between those points, so is there any way to calculate a measure of symmetry for images?
BTW, this is for use in a feature vector that will be presented to a neural network.
Clarification
The image on the left is 'horizontally' symmetrical. If we imagine a vertical line running down the middle of it, the left and right parts are symmetrical. Likewise, the image on the right is 'vertically' symmetrical, if you imagine a horizontal line running across its center.
What I want is a measure of just how horizontally symmetrical they are, and another of just how vertically symmetrical they are.
This is just a guideline / idea, you'll need to work out the details:
To detect symmetry with respect to horizontal reflection:
reflect the image horizontally
pad the original (unreflected) image horizontally on both sides
compute the correlation of the padded and the reflected images
The position of the maximum in the result of the correlation will give you the location of the axis of symmetry. The value of the maximum will give you a measure of the symmetry, provided you do a suitable normalization first.
This will only work if your images are "symmetric enough", and it works for images only, not sets of points. But you can create an image from a set of points too.
Leonidas J. Guibas from Stanford University talked about it in ETVC'08.
Detection of Symmetries and Repeated Patterns in 3D Point Cloud Data.