I want to know that all decision trees are binary Trees?
thanks
No, they're not. What if the decision is "What is your favourite colour?" You could have 2 billion answers if you accept 8 bit per channel RGB.
No. All decision trees are not binary trees. But, if you really want a binary tree, you can also do it. In fact, the number of branches depends on the nature of data that you are going to classify.
There are two types of data.
1. Discrete data
2. Continuous data
Suppose A be the splitting attribute and A has distinct values {a_1, a_2, ... , a_v}, based on the training data. Then there are 3 cases.
CASE 1 : Ais discrete valued - a branch is created for each known value of a_j in A. For example, if age is a discrete valued attrribute, it can be branched to young, medium or old.
CASE 2 : A is continuous valued - the test node has two possible outcomes corresponding to the conditions A <= split_point and A > split_point, respectively. For example, if age is a continuous valued attribute, it can be branched by conditions age <= 55 and age > 55.
CASE 3 - A is a discrete valued and a binary tree must be produced - the test at the node if of the form of is A in splitting_subset ?, where splitting_subset is the set of values acceptable in a certain decision out of two decisions. For example, if age is a discrete valued attribute, it can be branched by the form of a test of is age in {young, medium}?.
Related
Lets say I'm given the following and need to find 'use' KNN to predict the class label of record 15 and know beforehand that k is set to 3. What are the proper steps, regardless of table, or label or k is set to in order to do this?
The first 10 are training data, and the other 10 are testing data.
First you need to convert the categorical data to numeric data.
For example: In case of the Astigmatism column you may use 1 for 'Yes' and 0 for 'No'.
Similarly do this for Age, Spectacle Prescription and Tear Production Rate.
Now that you have converted your categorical data to numeric values,you are ready to apply KNN.
Considering the testing data, select each row one by one and calculate its distance (can be L1 distance or L2 distance) from each point in the training set. So for the 11th data point you calculate its distance from all training points from 0 to 10.
Note calculation of distance has become possible only because of the conversion of categorial data to numerical values.
Then after you have 10 distance values corresponding to the distance of the 11th data point with all other training datapoints, you select the 3 (As k=3) points with minimum distance,and see their labels then you select the label with the majority.
Repeat this for all testing points.
Suppose I have a a graph with 2^N - 1 nodes, numbered 1 to 2^N - 1. Node i "depends on" node j if all the bits in the binary representation of j that are 1, are also 1 in the binary representation of i. So, for instance, if N=3, then node 7 depends on all other nodes. Node 6 depends on nodes 4 and 2.
The problem is eliminating nodes. I can eliminate a node if no other nodes depend on it. No nodes depend on 7; so I can eliminate 7. After eliminating 7, I can eliminate 6, 5, and 3, etc. What I'd like is to find an efficient algorithm for listing all the possible unique elimination paths. (that is, 7-6-5 is the same as 7-5-6, so we only need to list one of the two). I have a dumb algorithm already, but I think there must be a better way.
I have three related questions:
Does this problem have a general name?
What's the best way to solve it?
Is there a general formula for the number of unique elimination paths?
Edit: I should note that a node cannot depend on itself, by definition.
Edit2: Let S = {s_1, s_2, s_3,...,s_m} be the set of all m valid elimination paths. s_i and s_j are "equivalent" (for my purposes) iff the two eliminations s_i and s_j would lead to the same graph after elimination. I suppose to be clearer I could say that what I want is the set of all unique graphs resulting from valid elimination steps.
Edit3: Note that elimination paths may be different lengths. For N=2, the 5 valid elimination paths are (),(3),(3,2),(3,1),(3,2,1). For N=3, there are 19 unique paths.
Edit4: Re: my application - the application is in statistics. Given N factors, there are 2^N - 1 possible terms in statistical model (see http://en.wikipedia.org/wiki/Analysis_of_variance#ANOVA_for_multiple_factors) that can contain the main effects (the factors alone) and various (2,3,... way) interactions between the factors. But an interaction can only be present in a model if all sub-interactions (or main effects) are present. For three factors a, b, and c, for example, the 3 way interaction a:b:c can only be in present if all the constituent two-way interactions (a:b, a:c, b:c) are present (and likewise for the two-ways). Thus, the model a + b + c + a:b + a:b:c would not be allowed. I'm looking for a quick way to generate all valid models.
It seems easier to think about this in terms of sets: you are looking for families of subsets of {1, ..., N} such that for each set in the family also all its subsets are present. Each such family is determined by the inclusion-wise maximal sets, which must be overlapping. Families of pairwise overlapping sets are called Sperner families. So you are looking for Sperner families, plus the union of all the subsets in the family. Possibly known algorithms for enumerating Sperner families or antichains in general are useful; without knowing what you actually want to do with them, it's hard to tell.
Thanks to #FalkHüffner's answer, I saw that what I wanted to do was equivalent to finding monotonic Boolean functions for N arguments. If you look at the figure on the Wikipedia page for Dedekind numbers (http://en.wikipedia.org/wiki/Dedekind_number) the figure expresses the problem graphically. There is an algorithm for generating monotonic Boolean functions (http://www.mathpages.com/home/kmath094.htm) and it is quite simple to construct.
For my purposes, I use the algorithm, then eliminate the first column and last row of the resulting binary arrays. Starting from the top row down, each row has a 1 in the ith column if one can eliminate the ith node.
Thanks!
You can build a "heap", in which at depth X are all the nodes with X zeros in their binary representation.
Then, starting from the bottom layer, connect each item to a random parent at the layer above, until you get a single-component graph.
Note that this graph is a tree, i.e., each node except for the root has exactly one parent.
Then, traverse the tree (starting from the root) and count the total number of paths in it.
UPDATE:
The method above is bad, because you cannot just pick a random parent for a given item - you have a limited number of items from which you can pick a "legal" parent... But I'm leaving this method here for other people to give their opinion (perhaps it is not "that bad").
In any case, why don't you take your graph, extract a spanning-tree (you can use Prim algorithm or Kruskal algorithm for finding a minimal-spanning-tree), and then count the number of paths in it?
Let me start with the version of genetic algorithm I am implementing. I apologize in advance for any terminology errors that I make here. Please feel free to correct me.
The chromosome for my problem is two dimensional. Three rows and thirty two columns. Essentially the alleles (values) are indexes that are contained by this chromosome.
How an Index is formulated
Each row and column (together) of the chromosome refer to a single gene. Each gene contains an integer value (0 - 30). A single column (I believe referred to as a gnome) therefore refers to an index of a four dimensional array containing user data on which the fitness function operates.
This is how a chromosome would look like
11 22 33 14 27 15 16 ...
3 29 1 7 18 24 22 ...
29 9 16 10 14 21 3 ...
e.g. column 0 ==> data[11][3][29]
where
11 -> (0, 0); 0th row, 0th column
3 -> (1, 0); 1st row, 0th column
29 -> (2, 0); 2nd row, 0th column
For completeness, the fitness function works as follows: (for a single chromosome)
for first 10 iterations: (user 0 to 9)
for each column (genome)
consider gene value for first row as the first index of data array
consider gene value for the second row as the second index of data array
consider gene value for the third row as the third index of data array
so if the first column contains [11][3][29] user = 0
it refers to data[0][11][3][29]
SUM the data array value for all columns and save it
Do the same for all iterations (users)
for second 10 iterations: (user 10 to 19)
for each column (genome)
consider gene value for the SECOND row as the FIRST index of data array
consider gene value for the THIRD row as the SECOND index of data array
consider gene value for FIRST row as the THIRD index of data array
SUM the data array value for all columns and save it
Do the same for all iterations (users)
for third 10 iterations: (user 20 to 29)
for each column (genome)
consider gene value for the THIRD row as the FIRST index of data array
consider gene value for FIRST row as the SECOND index of data array
consider gene value for the SECOND row as the THIRD index of data array
SUM the data array value for all columns and save it
Do the same for all iterations (users)
Out of the 30 (sum) values calculated so far, assign the minimum value as fitness value
to this chromosome.
The point to explain the fitness function here is to explain the optimization problem I am dealing with. I am sorry I could not formulate it in Mathematical notation. Anyone who could do it, his/her comment is more than welcome. Essentially it is maximizing the minimum X. Where X refers to data contained in data array. (maximizing is done over generation where the highest fitness chromosomes are selected for next generations)
Q1) I am using a single random number generator for crossover and mutation probabilities. Generally speaking, is this correct was to implement it with a single generator? I ask this question because the crossover rate I chose is 0.7 and mutation to be 0.01. My random number generator generates a uniformly distributed integer number. The number are between 0 to (2^31 - 1). If a number generated by the random function lies under the border where it satisfies mutation, the same number also satisfies crossover. Does this effect the evolution process?
NOTE: the highest number that the random number generates is 2147483647. 1% of this value is 21474836. so whenever a number less than 21474836, it suggests that this gene can be mutated. this number also suggest that crossover must be done. Shouldn't there be different generators?
Q2) Although I see that there is a relation between genes is a column when calculating fitness. But while performing mutation, all the genes should be considered independent from each other or all the rows for a genome (column) should be effected by mutation.
Explanation
As I learned in a binary string of e.g. 1000 bits where each bit corresponds to a gene, with a mutation rate of 1% would mean 1 out of 100 bits might get flipped. in my case however I have chromosome which is 2D (3 rows, 32 columns). Should I consider all 96 genes independent of each other or simply consider 32 genes. And whenever I need a flip, flip the column all together. How does mutation work in 2D chromosome?
Q3) Do I really have a correlation between rows here. I am a bit confused?
Explanation
I have 2D chromosome, whose column values altogether points to the data i have to use to calculate fitness of this chromosome. Genetic algorithm manipulates chromosomes where as fitness is assigned by the data that is associated with this chromosome. My question is how would genetic algorithm should treat 2D chromosome. Should there be a relation between the genes in a column. Can I get a reference to some paper/code where a 2D chromosome is manipulated?
I'm not sure if i understood the chromosome structure, but it doesn't matter, the concepts are the same:
1 - You have a chromosome object, which you can access the individual genes
2 - You have a fitness function, which takes a chromosome and outputs a value
3 - You have a selection function, which selects chromosomes to mate
4 - You have a crossover function, which generally takes 2 chromosomes, exchange genes between them and outputs two new chromosomes
5 - You have a mutation operator, which acts randomly on the genes of a chromosome
So
Q1) You can use a single random generator, there's no problem at all. But why are you using
integer numbers? It's much easier to generate a random between [0, 1).
Q2) This is up to you, but generally the genes are mutated randomly, independent of each other (mutation happens after the crossover, but i think you already know that).
EDIT: Yes, you should consider all the 96 genes independent of each other. For each mutation, you'll select one 'row' and one 'column' and modify (mutate) that gene with some probability p, so:
for row in chromosome.row
for col in row
val = random_between_0_and_1
if val < p
chromosome[row][col] = noise
Q4) It's up to you to decide what the fitness function will do. If this chromosome is 'good' or 'bad' at solving your problem, then you should return a value that reflects that.
All the random numbers you use would typically be independently generated, so use one RNG or many, it doesn't matter. You should generate new numbers for each gene for crossover and mutation step, if you use the same single random number for multiple purposes you will limit the explorable solution space.
To make your algorithm easier to understand, generate uniformly distributed floats in [0..1) as r()=rand()/(2^32-1), then you can express things simply as, for example,
if r() < 0.3
mutate()
I don't understand your other questions. Please rewrite them.
An improvement you can do relatively to mutation and crossover probabilities is built a GA that choose these probabilities by itself. Because the use of given probabilities (or a function that evolves with the number of runs for probabilities) is always arbitrary, codify your operators inside chromosomes.
For example, you have two operators. Add a bit to the end of chromosome where 1 codify for mutation and 0 for crossover. When you apply operators on parents, you will obtain childs that will have the code for the operator to apply. In this way, the GA makes a double search: in the space of solutions and in the space of operators. The choose of operators is given by the nature of your problem a by the concrete conditions of the run. During the calculation, probabilites of both operators will change automatically to maximize you objective function.
Same thing for an arbitrary number of operators. You will need simply more bits to codify. I use generally three operators (three for crossover and one for mutation) and this mechanism works fine.
The source data for the subject is an m-by-n binary matrix (only 0s and 1s are allowed).
m Rows represent observations, n columns - features. Some observations are marked as targets which need to be separated from the rest.
While it looks like a typical NN, SVM, etc problem, I don't need generalization. What I need is an efficient algorithm to find as many as possible combinations of columns (features) that completely separate targets from other observations, classify, that is.
For example:
f1 f2 f3
o1 1 1 0
t1 1 0 1
o2 0 1 1
Here {f1, f3} is an acceptable combo which separates target t1 from the rest (o1, o2) (btw, {f2} is NOT as by task definition a feature MUST be present in a target). In other words,
t1(f1) & t1(f3) = 1 and o1(f1) & o1(f3) = 0, o2(f1) & o2(f3) = 0
where '&' represents logical conjunction (AND).
The m is about 100,000, n is 1,000. Currently the data is packed into 128bit words along m and the search is optimized with sse4 and whatnot. Yet it takes way too long to obtain those feature combos.
After 2 billion calls to the tree descent routine it has covered about 15% of root nodes. And found about 8,000 combos which is a decent result for my particular application.
I use some empirical criteria to cut off less probable descent paths, not without limited success, but is there something radically better? Im pretty sure there gotta be?.. Any help, in whatever form, reference or suggestion, would be appreciated.
I believe the problem you describe is NP-Hard so you shouldn't expect to find the optimum solution in a reasonable time. I do not understand your current algorithm, but here are some suggestions on the top of my head:
1) Construct a decision tree. Label targets as A and non-targets as B and let the decision tree learn the categorization. At each node select the feature such that a function of P(target | feature) and P(target' | feature') is maximum. (i.e. as many targets as possible fall to positive side and as many non-targets as possible fall to negative side)
2) Use a greedy algorithm. Start from the empty set and at each time step add the feauture that kills the most non-target rows.
3) Use a randomized algorithm. Start from a small subset of positive features of some target, use the set as the seed for the greedy algorithm. Repeat many times. Pick the best solution. Greedy algorithm will be fast so it will be ok.
4) Use a genetic algorithm. Generate random seeds for the greedy algorithm as in 3 to generate good solutions and cross-product them (bitwise-and probably) to generate new candidates seeds. Remember the best solution. Keep good solutions as the current population. Repeat for many generations.
You will need to find the answer "how many of the given rows have the given feature f" fast so probably you'll need specialized data structures, perhaps using a BitArray for each feature.
I have a graph-theoretic (which is also related to combinatorics) problem that is illustrated below, and wonder what is the best approach to design an algorithm to solve it.
Given 4 different graphs of 6 nodes (by different, I mean different structures, e.g. STAR, LINE, COMPLETE, etc), and 24 unique objects, design an algorithm to assign these objects to these 4 graphs 4 times, so that the number of repeating neighbors on the graphs over the 4 assignments is minimized. For example, if object A and B are neighbors on 1 of the 4 graphs in one assignment, then in the best case, A and B will not be neighbors again in the other 3 assignments.
Obviously, the degree to which such minimization can go is dependent on the specific graph structures given. But I am more interested in a general solution here so that given any 4 graph structures, such minimization is guaranteed as the result of the algorithm.
Any suggestion/idea of solving this problem is welcome, and some pseudo-code may well be sufficient to illustrate the design. Thank you.
Representation:
You have 24 elements, I will name this elements from A to X (24 first letters).
Each of these elements will have a place in one of the 4 graphs. I will assign a number to the 24 nodes of the 4 graphs from 1 to 24.
I will identify the position of A by a 24-uple =(xA1,xA2...,xA24), and if I want to assign A to the node number 8 for exemple, I will write (xa1,Xa2..xa24) = (0,0,0,0,0,0,0,1,0,0...0), where 1 is on position 8.
We can say that A =(xa1,...xa24)
e1...e24 are the unit vectors (1,0...0) to (0,0...1)
note about the operator '.':
A.e1=xa1
...
X.e24=Xx24
There are some constraints on A,...X with these notations :
Xii is in {0,1}
and
Sum(Xai)=1 ... Sum(Xxi)=1
Sum(Xa1,xb1,...Xx1)=1 ... Sum(Xa24,Xb24,... Xx24)=1
Since one element can be assign to only one node.
I will define a graph by defining the neighbors relation of each node, lets say node 8 has neighbors node 7 and node 10
to check that A and B are neighbors on node 8 for exemple I nedd:
A.e8=1 and B.e7 or B.e10 =1 then I just need A.e8*(B.e7+B.e10)==1
in the function isNeighborInGraphs(A,B) I test that for every nodes and I get one or zero depending on the neighborhood.
Notations:
4 graphs of 6 nodes, the position of each element is defined by an integer from 1 to 24.
(1 to 6 for first graph, etc...)
e1... e24 are the unit vectors (1,0,0...0) to (0,0...1)
Let A, B ...X be the N elements.
A=(0,0...,1,...,0)=(xa1,xa2...xa24)
B=...
...
X=(0,0...,1,...,0)
Graph descriptions:
IsNeigborInGraphs(A,B)=A.e1*B.e2+...
//if 1 and 2 are neigbors in one graph
for exemple
State of the system:
L(A)=[B,B,C,E,G...] // list of
neigbors of A (can repeat)
actualise(L(A)):
for element in [B,X]
if IsNeigbotInGraphs(A,Element)
L(A).append(Element)
endIf
endfor
Objective functions
N(A)=len(L(A))+Sum(IsneigborInGraph(A,i),i in L(A))
...
N(X)= ...
Description of the algorithm
start with an initial position
A=e1... X=e24
Actualize L(A),L(B)... L(X)
Solve this (with a solveur, ampl for
exemple will work I guess since it's
a nonlinear optimization
problem):
Objective function
min(Sum(N(Z),Z=A to X)
Constraints:
Sum(Xai)=1 ... Sum(Xxi)=1
Sum(Xa1,xb1,...Xx1)=1 ...
Sum(Xa24,Xb24,... Xx24)=1
You get the best solution
4.Repeat step 2 and 3, 3 more times.
If all four graphs are K_6, then the best you can do is choose 4 set partitions of your 24 objects into 4 sets each of cardinality 6 so that the pairwise intersection of any two sets has cardinality at most 2. You can do this by choosing set partitions that are maximally far apart in the Hasse diagram of set partitions with partial order given by refinement. The general case is much harder, but perhaps you can still begin with this crude approximation of a solution and then be clever with which vertex is assigned which object in the four assignments.
Assuming you don't want to cycle all combinations and calculate the sum every time and choose the lowest, you can implement a minimum problem (solved depending on your constraints using either a linear programming solver i.e. symplex algorithm engines or a non-linear solver, much harder talking in terms of time) with constraints on your variables (24) depending on the shape of your path. You can also use free software like LINGO/LINDO to create rapidly a decision theory model and test its correctness (you need decision theory notions though)
If this has anything to do with the real world, then it's unlikely that you absolutely must have a solution that is the true minimum. Close to the minimum should be good enough, right? If so, you could repeatedly randomly make the 4 assignments and check the results until you either run out of time or have a good-enough solution or appear to have stopped improving your best solution.