Decision tree training for Multi-View face detection - algorithm

I am working for multi-view face detection and following the Jones's multiple-view face detection algorithm.
In the paper "Fast Multi-view Face Detection", Jones trained C4.5 decision tree with images of different face poses. In Section 3.3 of Decision Tree Training, it is mentioned as "the training algorithm is almost identical to the boosting algorithm. The two main differences are the criteria for feature selection and the splitting of the training set at each node"
I learned and understood C4.5 algorithm here.
I can't figure out how to train images of different face poses for C4.5 decision tree.
EDIT 1:
Stage 0 and Stage 1 features of training the ADABOOST algorithm for cascade classifier is shown below.
<!-- stage 0 -->
<_>
<maxWeakCount>3</maxWeakCount>
<stageThreshold>-0.7520892024040222</stageThreshold>
<weakClassifiers>
<!-- tree 0 -->
<_>
<internalNodes>
0 -1 46 -67130709 -21569 -1426120013 -1275125205 -21585
-16385 587145899 -24005</internalNodes>
<leafValues>
-0.6543210148811340 0.8888888955116272</leafValues></_>
<!-- tree 1 -->
<_>
<internalNodes>
0 -1 13 -163512766 -769593758 -10027009 -262145 -514457854
-193593353 -524289 -1</internalNodes>
<leafValues>
-0.7739216089248657 0.7278633713722229</leafValues></_>
<!-- tree 2 -->
<_>
<internalNodes>
0 -1 2 -363936790 -893203669 -1337948010 -136907894
1088782736 -134217726 -741544961 -1590337</internalNodes>
<leafValues>
-0.7068563103675842 0.6761534214019775</leafValues></_></weakClassifiers></_>
<!-- stage 1 -->
<_>
<maxWeakCount>4</maxWeakCount>
<stageThreshold>-0.4872078299522400</stageThreshold>
<weakClassifiers>
<!-- tree 0 -->
<_>
<internalNodes>
0 -1 84 2147483647 1946124287 -536870913 2147450879
738132490 1061101567 243204619 2147446655</internalNodes>
<leafValues>
-0.8083735704421997 0.7685696482658386</leafValues></_>
<!-- tree 1 -->
<_>
<internalNodes>
0 -1 21 2147483647 263176079 1879048191 254749487 1879048191
-134252545 -268435457 801111999</internalNodes>
<leafValues>
-0.7698410153388977 0.6592915654182434</leafValues></_>
<!-- tree 2 -->
<_>
<internalNodes>
0 -1 106 -98110272 1610939566 -285484400 -850010381
-189334372 -1671954433 -571026695 -262145</internalNodes>
<leafValues>
-0.7506558895111084 0.5444605946540833</leafValues></_>
<!-- tree 3 -->
<_>
<internalNodes>
0 -1 48 -798690576 -131075 1095771153 -237144073 -65569 -1
-216727745 -69206049</internalNodes>
<leafValues>
-0.7775990366935730 0.5465461611747742</leafValues></_></weakClassifiers></_>
EDIT2:
My consideration for how to train the decision is described in the following picture
I am still figuring out what are the features to use, but I think the training should be as shown in the attached image.
Thanks

Did not read the paper but frm what I know from early face recognition experiments, the attributes you are looking for are probably just the grey level inputs of the face images. Usually, images are rescaled, say to 32x32 pixels, so you have a 1024 dimensionnal vector to train your decision tree. Have a closer look at the article if they use other features, they will be written, or at least given a reference to?

Related

Segment tree built on "light bulbs"

I have encountered following problem:
There are n numbers (0 or 1) and there are 2 operations. You can swich all numbers to 0 or 1 on a specific range(note that switching 001 to 0 is 000, not 110) and you can also ask about how many elements are turned on on a specific range.
Example:
->Our array is 0100101
We set elements from 1 to 3 to 1:
->Our array is 1110101 now
We set elements from 2 to 5 to 0:
->Our array is 1000001 now
We are asking about sum from 2nd to 7th element
-> The answer is 1
Brute force soltion is too slow(O(n*q), where q is number of quetions), so I assume that there has to be a faster one. Probably using segment tree, but I can not find it...
You could build subsampling binary tree in the fashion of mipmaps used in computer graphics.
Each node of the tree contains the sum of its children's values.
E.g.:
0100010011110100
1 0 1 0 2 2 1 0
1 1 4 1
2 5
7
This will bring down complexity for a single query to O(log₂n).
For an editing operation, you also get O(log₂n) by implementing a shortcut: Instead of applying changes recursively, you stop at a node that is fully covered by the input range; thus creating a sparse representation of the sequence. Each node representing M light bulbs either
has value 0 and no children, or
has value M and no children, or
has a value in the range 1..M-1 and 2 children.
The tree above would actually be stored like this:
7
2 5
1 1 4 1
1 0 1 0 1 0
01 01 01
You end up with O(q*log₂n).

Is there an adaptation to the Marching Squares algorithm to make it lossless compression for constrained inputs?

I'm using the Marching Squares algorithm to take a lattice of values and turn them into a contour when the values exceed 50%. My values have the property that most are 0% and 100% where the transitions from 0% to 100% occurs across at most a single intervening value, such that the contour created will pass through every lattice position where the value is greater than 0% and less than 100%. For example, consider this field of values representing the approximate percentages shown in the greyscale squares of the following image:
0 0 0 0 0 0 0 0
0 0 6 71 71 20 0 0
0 28 35 100 100 48 20 0
0 100 100 100 100 100 71 0
0 100 100 100 100 100 71 0
0 9 18 100 100 35 6 0
0 0 9 100 100 28 0 0
0 0 0 0 0 0 0 0
The traditional Marching Squares algorithm would produce a contour as shown in this image:
The blue field represents the contour and the greyscale squares represent the lattice values for the above data.
Given the resulting contour, I can convert it back to a lattice of numbers again by taking the area covered by the contour for each lattice position as the recreated value for that lattice position. For the above contour, it looks like this image that shows the same contour and the resulting values converted back to a lattice of values shown by greyscale squares:
The new values are similar but not exactly the same as the original, some are larger, others are smaller, thus information has been lost and the algorithm is lossy compression. The decompressed field of values looks approximately like this:
0 0 0 0 0 0 0 0
0 0 3 67 70 4 0 0
0 12 43 100 100 59 4 0
0 91 100 100 100 100 70 0
0 88 100 100 100 100 67 0
0 4 27 100 100 43 3 0
0 0 3 88 91 12 0 0
0 0 0 0 0 0 0 0
Is there a way to adjust the linear interpolation step to not lose information, or at least come much closer to the original data field? If not, can the contour have extra points added to resolve this. For example, perhaps the interpolation step is left as is, but instead of a straight line between the points in the Marching Squares algorithm, there are extra points added along the path to force the desired area in each corner of the four lattice squares considered at each part of the Marching Steps algorithm?
In the lower right area of the example, one step of the Marching Steps algorithm finds these four values:
100 28
0 0
The interpolation produces 50% on left side and 70% on top side. This means on the left, the point A is placed exactly on the border between the 0% square in lower left and the 100% square in upper left. This means on the top, the point B is placed 70% of the way toward the center of the 28% value in upper right. The foregoing results in a diagonal line from A to B taking the upper left corner whose value is 100.
We could add additional intervening points between A and B such that the area values are not lost upon return back (decompression) from contour to lattice values. For example, consider this drawing:
The original Marching squares gives the points A and B in the drawing. The yellow highlight shows additional points X, Y, and Z that could be added so that the area covered is 100% in upper left, 0% in lower left, and 28% in upper right. For the 28%, 14% is handled below point B and 14% above point B.
Is this a known problem that has existing solutions or are there similar problems in compression of images that can be drawn upon to help solve this problem? Does the proposed solution seem reasonable or can it be simplified further? I'm concerned that it will be pretty complex to handle the four quadrants for each of the 14 variations of Marching Squares that produce lines, so if there is a way to simplify this, I'd like to find it.
In summary, would like to adjust the computation of the blue contour, such that the area of each lattice square covered by the contour matches the original data used to create the blue contour, and thus have lossless compression to convert the lattice into a contour that is perfectly reversible.

Which algorithms are there to find the Smallest Set of Smallest Rings?

I have an unweighted undirected connected graph. Generally, it's a chemical compound with lots of cycles side by side. The problem is common in this field and is called like the title says. Good algorithm is Horton's one. However, I don't seem to find any exact information about the algorithm, step by step.
Clearly speaking my problem is this, Algorithm for finding minimal cycles in a graph , but unfortunately the link to the site is disabled.
I only found python code of Figueras algorithm but Figuearas does not work in every case. Sometimes it doesn't find all rings.
The problem is similar to this, Find all chordless cycles in an undirected graph , I tried it but didn't work for more complex graphs like mine.
I found 4-5 sources of needed information, but the algorithm is not fully explained at all.
I don't seem to find any algorithm for SSSR although it seems a common problem, mainly in the chemistry field.
Horton's algorithm is pretty simple. I'll describe it for your use case.
For each vertex v, compute a breadth-first search tree rooted at v. For each edge wx such that v, w, x are pairwise distinct and such that the least common ancestor of w and x is v, add a cycle consisting of the path from v to w, the edge wx, and the path from x back to v.
Sort these cycles by size nondecreasing and consider them in order. If the current cycle can be expressed as the "exclusive OR" of cycles considered before it, then it is not part of the basis.
The test in Step 2 is the most complicated part of this algorithm. What you need to do, basically, is write out the accepted cycle and the candidate cycle as a 0-1 incidence matrix whose rows are indexed by cycle and whose columns are indexed by edge, then run Gaussian elimination on this matrix to see whether it makes an all-zero row (if so, discard the candidate cycle).
With some effort, it's possible to save the cost of re-eliminating the accepted cycles every time, but that's an optimization.
For example, if we have a graph
a---b
| /|
| / |
|/ |
c---d
then we have a matrix like
ab ac bc bd cd
abca 1 1 1 0 0
bcdb 0 0 1 1 1
abdca 1 1 0 1 1
where I'm cheating a bit because abdca is not actually one of the cycles generated in Step 1.
Elimination proceeds as follows:
ab ac bc bd cd
1 1 1 0 0
0 0 1 1 1
1 1 0 1 1
row[2] ^= row[0];
ab ac bc bd cd
1 1 1 0 0
0 0 1 1 1
0 0 1 1 1
row[2] ^= row[1];
ab ac bc bd cd
1 1 1 0 0
0 0 1 1 1
0 0 0 0 0
so that set of cycles is dependent (don't keep the last row).

Multiple Inputs for Backpropagation Neural Network

I've been working on this for about a week. There are no errors in my coding, I just need to get algorithm and concept right. I've implemented a neural network consisting of 1 hidden layer. I use the backpropagation algorithm to correct the weights.
My problem is that the network can only learn one pattern. If I train it with the same training data over and over again, it produces the desired outputs when given input that is numerically close to the training data.
training_input:1, 2, 3
training_output: 0.6, 0.25
after 300 epochs....
input: 1, 2, 3
output: 0.6, 0.25
input 1, 1, 2
output: 0.5853, 0.213245
But if I use multiple varying training sets, it only learns the last pattern. Aren't neural networks supposed to learn multiple patterns? Is this a common beginner mistake? If yes then point me in the right direction. I've looked at many online guides, but I've never seen one that goes into detail about dealing with multiple input. I'm using sigmoid for the hidden layer and tanh for the output layer.
+
Example training arrays:
13 tcp telnet SF 118 2425 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 26 10 0.38 0.12 0.04 0 0 0 0.12 0.3 anomaly
0 udp private SF 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 3 0 0 0 0 0.75 0.5 0 255 254 1 0.01 0.01 0 0 0 0 0 anomaly
0 tcp telnet S3 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 255 79 0.31 0.61 0 0 0.21 0.68 0.6 0 anomaly
The last columns(anomaly/normal) are the expected outputs. I turn everything into numbers, so each word can be represented by a unique integer.
I give the network one array at a time, then I use the last column as the expected output to adjust the weights. I have around 300 arrays like these.
As for the hidden neurons, I tried from 3, 6 and 20 but nothing changed.
+
To update the weights, I calculate the gradient for the output and hidden layers. Then I calculate the deltas and add them to their associated weights. I don't understand how that is ever going to learn to map multiple inputs to multiple outputs. It looks linear.
If you train a neural network too much, with respect to the number of iterations through the back-propagation algorithm, on one data set the weights will eventually converge to a state where it will give the best outcome for that specific training set (overtraining for machine learning). It will only learn the relationships between input and target data for that specific training set, but not the broader more general relationship that you might be looking for. It's better to merge some distinctive sets and train your network on the full set.
Without seeing the code for the back-propagation algorithm I could not give you any advice on if it's working correctly. One problem I had when implementing the back-propagation was not properly calculating the derivative of the activation function around the input value. This website was very helpful for me.
No Neural networks are not supposed to know multiple tricks.
You train them for a specific task.
Yes they can be trained for other tasks as well
But then they get optimized for another task.
So thats why you should create load and save functions, for your network so that you can easily switch brains and perform other tasks, if required.
If your not sure what taks it is currently train a neural to find the diference between the tasks.

General matrix definition for image filtering or trasnformation

I am looking for matrices I can generate to transform other matrices, but I am not talking about regular matrices like:
From this question: The canonical examples you'll find everywhere are non-gaussian box blur:
1 1 1
1 1 1
1 1 1
Image sharpening:
0 -1 0
-1 5 -1
0 -1 0
Edge detection:
0 1 0
1 -4 1
0 1 0
and emboss:
-2 -1 0
-1 1 1
0 1 2
Those are for applying to each region of an image, I just want a big matrix. Is that possible?
For example: A 2560*2560 matrix that I can multiply directly with an image of 2560*2560 pixels.
Yes it's possible, but maybe not in the way you would think. Take a look at the Gaussian blur example at http://scipy-lectures.github.io/intro/scipy.html#fast-fourier-transforms-scipy-fftpack
The thing is that convolution in the image is equivalent to multiplication in the frequency domain. This is the Convolution Theorem from Fourier Transforms (https://en.wikipedia.org/wiki/Fourier_transform#Convolution_theorem). So, it is possible -- and in fact for huge images like you're talking about it should be faster. But the matrices are no longer simple ones like the examples you posted above.

Resources