I understand given tabular data how to create a decision tree from scratch. I program in python, we create a class called node inside the class we can add functions making child, root, split etc.
May be videos like this explain the process.
https://www.youtube.com/watch?v=sgQAhG5Q7iY&ab_channel=NormalizedNerd
I want to write a class for decision space. That represents the data points in the tabular data in a plane or higher dimension space. Then create a partition based on the decision tree.
So let say we have the decision tree. given in the following picture. The data set has two features age and degree. At the age =4 there is split and then degree has a split at at degree =2. So I want to define a class that return me right hand side in the picture.
Decision space is more general than decision tree. So is there is an implementation in python from scratch ? So I am guessing we convert a decision tree to rule set and then from rule set we can from rule space?
Definition 1. A decision space is an m-dimensional (product) space $Dattr_1$ × · · · × Dattr_i × · · · × Dattr_m where m is the
number of attributes and Dattr_i is the space covered by the i-th attribute specified as a bounded poset (i.e., a partially
ordered set with a least and the greatest element).
Related
I came across class center based fuzzification algorithm on page 16 of this research paper on TRFDT. However, I fail to understand what is happening in step 2 of this algorithm (titled in the paper as Algorithm 2: Fuzzification). If someone could explain it by giving a small example it would certainly be helpful.
It is not clear from your question which parts of the article you understand and IMHO the article is written in not the clearest possible way, so this is going to be a long answer.
Let's start with some intuition behind this article. In short I'd say it is: "let's add fuzziness everywhere to decision trees".
How a decision tree works? We have a classification problem and we say that instead of analyzing all attributes of a data point in a holistic way, we'll analyze them one by one in an order defined by the tree and will navigate the tree until we reach some leaf node. The label at that leaf node is our prediction. So the trick is how to build a good tree i.e. a good order of attributes and good splitting points. This is a well studied problem and the idea is to build a tree that encode as much information as possible by some metric. There are several metrics and this article uses entropy which is similar to widely used information gain.
The next idea is that we can change the classification (i.e. split of the values into a classes) as fuzzy rather than exact (aka "crisp"). The idea here is that in many real life situations not all members of the class are equally representative: some a more "core" examples and some a more "edge" example. If we can catch this difference, we can provide a better classification.
And finally there is a question of how similar the data points are (generally or by some subset of attributes) and here we can also have a fuzzy answer (see formulas 6-8).
So the idea of the main algorithm (Algorithm 1) is the same as in the ID3 tree: recursively find the attribute a* that classifies the data in the best way and perform the best split along that attribute. The main difference is in how the information gain for the best attribute selection is measured (see heuristic in formulas 20-24) and that because of fuzziness the usual stop rule of "only one class left" doesn't work anymore and thus another entropy (Kosko fuzzy entropy in 25) is used to decide if it is time to stop.
Given this skeleton of the algorithm 1 there are quite a few parts that you can (or should) select:
How do you measure μ(ai)τ(Cj)(x) used in (20) (this is a measure of how well x represents the class Cj with respect to attribute ai, note that here being not in Cj and far from the points in Cj is also good) with two obvious choices of the lower (16 and 18) and the upper bounds (17 and 19)
How do you measure μRτ(x, y) used in (16-19). Given that R is induced by ai this becomes μ(ai)τ(x, y) which is a measure of similarity between two points with respect to attribute ai. Here you can choose one of the metrics (6-8)
How do you measure μCi(y) used in (16-19). This is the measure of how well the point y fits in the class Ci. If you already have data as fuzzy classification, there is nothing you should do here. But if your input classification is crisp, then you should somehow produce μCi(y) from that and this is what the Algorithm 2 does.
There is a trivial solution of μCj(xi) = "1 if xi ∈ Cj and 0 otherwise" but this is not fuzzy at all. The process of building fuzzy data is called "fuzzification". The idea behind the Algorithm 2 is that we assume that every class Cj is actually some kind of a cluster in the space of attributes. And so we can measure the degree of membership μCj(xi) as the distance from the xi to the center of the cluster cj (the closer we are, the higher the membership should be so it is really some inverse of a distance). Note that since distance is measured by attributes, you should normalize your attributes somehow or one of them might dominate the distance. And this is exactly what the Algorithm 2 does:
it estimates the center of the cluster for class Cj as the center of mass of all the known points in that class i.e. just an average of all points by each coordinate (attribute).
it calculates the distances from each point xi to each estimated center of class cj
looking into formula at step #12 it uses inverse square of the distance as a measure of proximity and just normalizes the value because for fuzzy sets Sum[over all Cj](μCj(xi)) should be 1
Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.
I stumbled across the following novelty website for generating "rotation anagrams" of a given input text called Sokumenzu / Side View Generator which produces animated versions of results such as the following:
I'm really keen to understand the algorithm behind it. I've tried seeing if any of the publicly exposed javascript can help me do so but it is a mess of obfuscation and my grasp of the language isn't so firm that I can't be sure that the real work isn't being done server side.
I have the rough outline of how I think a similar such system could be built but it would have its own short comings (and maybe a small advantage if the real approach is hard-coded):
Pre-computation
Define an nxnxn cube composed of equally sized sub-cubes
Each sub-cube may either contain a sphere or not
Create a virtual camera orthogonal to one of the cube's faces a fixed distance away
For each of the possible states of the cube:
Cast rays from the camera and build up an nxn matrix of which cells appear occupied from the camera's point of view.
Input this matrix into a neural network / other recognizer which has been pre-trained on the latin alphabet.
If the recognizer matches a character:
Add the state which triggered recognition to a hashtable indexed on the character it recognized.
Handle collisions (there should be many) by keeping the highest confidence recognition
For every key in the hashtable
Rotate the corresponding state in fixed increments recognizing characters as before
If a character other than the current key is recognized:
Add that character and the amount of rotation performed to a tuple in a list.
Store each of these lists in the hashtable indexed on the current key.
Query
Generate all of the permutations achieved by substituting each of the characters linked in the list associated with input character at that position.
Find the first dictionary word in the list of permutations
Visualize using the rotation information stored for each character
Obviously this isn't the same algorithm as is used since this operates on a character by character basis. I suppose that you could use a similar approach on a word by word basis taking the face of the entire volume as input to a text recognizer but I'm sure that they have probably done something simpler, cleverer, and far far more efficient.
The one advantage of this terrible terrible idea is that by retraining the recognizer you could support other character sets.
Anyone know how this actually works?
I think it is way simpler than that.
For each pair of letters (which are 2d objects), you can try to find a 3d object which will project to one or the other depending if it is seen from a 0° angle or a 90° one.
Finding a set of 3d points on a 3d-grid that project to two given sets of points in 2d depending on the projection looks like a problem of discrete tomography, which you can read about here on Wikipedia : https://en.wikipedia.org/wiki/Discrete_tomography
Note that you can process the 3d shape line by line, and actually only solve 2d instances.
Once that pre-computation is done, and you have a graph of letters where two letters are linked if there is a 3d shape that produces the two of them from a different angle, I suspect the way the algorithm works is as follows :
Compute the set of letters of the original word. Then, explore all the set of letters you can have by changing a letter of the input to one that it is linked to. When you find a set of letters that can make a word, stop. (there is probably a pre-computed dictionary that does a matching between words and sets of letters)
If a 3d shape needs to project in a different part of the word (that is, you need to have a shape that project in position 2 or 4 of the word depending if it is the original word or the other one, like u and v in D(u)ncan - Une(v)en ) you compute the appropriate permutation matrix. Like this for your name :
DU__________
______uv____
__nn________
____ce______
________ae__
__________nn
(first letter of each pair is the letter projected to the left, second letter the one projected to the bottom)
It is computed from the permutation matrix :
100000
000100
010000
001000
000010
000001
And the matching of the letters. (D-U, c-e ...).
The linked demo appears to work with precomputed or maybe even handcrafted letter pairs. A.N. has pointed you on how to rearrange the letter pairs. Note that the Sokumenzu sometimes fails to generate another word, for example with "Lydia". It then maps the word to itself. It also tries an all uppercase version when it doesn't find a match with the given case.
If you want a more general solution, you can combine arbitrary bitmaps of the same height provided that each row that has at least one pixel in the first bitmap also has a pixel in the second bitmap. (So you can't have an i on one face and an l on the other, because there's will be a pixel in the row between the tittle and the stem of the i.)
You can create the layers of your cube independently. Your aim is to have at least one pixel that projects to every pixel of the two bitmaps.
Create a list of the positions of the filled pixels of each bitmap. If one of the lists is empty, there is no solution. Create pairs out of the lists so that all pixels from the longer list are used and each pixel from the shorter list is used at least once.
Depending on the top-down appearence you want to achieve, you can use various algorithms. If you want to create a diagonal, you can use Bresenham arithmetic. You can also create a scattered appearance by assigning the pairs randomly, as long as you use every pixel at least once.
To illustrate:
· · · · · · · · · ·
· · ● · · · · · ○ ○ #
· · · · · · · · · ·
· ● · ○ · · ○ · ● · #
· ○ ○ ● · · ● · · ● #
# # # # # #
Both the discs and the circles will produce the hash patterns, although the circles will look tidier when seen from above and will probably also work better inn perspective.
I am trying to implement a classification toolkit for univariate network data using igraph and python.
However, my question is actually more of an algorithms question in relational classification area instead of programming.
I am following Classification in Networked Data paper.
I am having a difficulty to understand what this paper refers to "Network-Only Bayes Classifier"(NBC) which is one of the relational classifiers explained in the paper.
I implemented Naive Bayes classifier for text data using bag of words feature representation earlier. And the idea of Naive Bayes on text data is clear on my mind.
I think this method (NBC) is a simple translation of the same idea to the relational classification area. However, I am confused with the notation used in the equations, so I couldn't figure out what is going on. I also have a question on the notation used in the paper here.
NBC is explained in page 14 on the paper,
Summary:
I need the pseudo-code of the "Network-Only Bayes Classifier"(NBC) explained in the paper, page 14.
Pseudo-code notation:
Let's call vs the list of vertices in the graph. len(vs) is the
length. vs[i] is the ith vertex.
Let's assume we have a univariate and binary scenario, i.e., vs[i].class is either 0 or 1 and there is no other given feature of a node.
Let's assume we run a local classifier before so that every node has an initial label, which are calculated by the local classifier. I am only interested in with the relational classifier part.
Let's call v the vertex we are trying to predict, and v.neighbors() is the list of vertices which are neighbors of v.
Let's assume all the edge weights are 1.
Now, I need the pseudo-code for:
def NBC(vs, v):
# v.class is 0 or 1
# v.neighbors is list of neighbor vertices
# vs is the list of all vertices
# This function returns 0 or 1
Edit:
To make your job easier, I did this example. I need the answer for last 2 equations.
In words...
The probability that node x_i belongs to the class c is equal to:
The probability of the neighbourhood of x_i (called N_i) if x
belonged indeed to the class c; Multiplied by ...
The probability of the class c itself; Divided by ...
The probability of the neighbourhood N_i (of node x_i) itself.
As far as the probability of the neighbourhood N_i (of x_i) if x were to belong to the class c is concerned, it is equal to:
A product of some probability; (which probability?)
The probability that some node (v_j) of the neighbourhood (N_i) belongs to the class c if x belonged indeed to the class c
(raised to the weight of the edge connecting the node that is being examined and the node that is being classified...but you are not interested in this...yet). (The notation is a bit off here I think, why do they define v_j and then never use it?...Whatever).
Finally, multiply the product of some probability with some 1/Z. Why? Because all ps are probabilities and therefore lie within the range of 0 to 1, but the weights w could be anything, meaning that in the end, the calculated probability could be out of range.
The probability that some x_i belongs to a class c GIVEN THE
EVIDENCE FROM ITS NEIGHBOURHOOD, is a posterior probability. (AFTER
something...What is this something? ... Please see below)
The probability of appearance of neighbourhood N_i if x_i
belonged to the class c is the likelihood.
The probability of the class c itself is the prior probability.
BEFORE something...What is this something? The evidence. The prior
tells you the probability of the class without any evidence
presented, but the posterior tells you the probability of a specific
event (that x_i belongs to c) GIVEN THE EVIDENCE FROM ITS
NEIGHBOURHOOD.
The prior, can be subjective. That is, derived by limited observations OR be an informed opinion. In other words, it doesn't have to be a population distribution. It only has to be accurate enough, not absolutely known.
The likelihood is a bit more challenging. Although we have here a formula, the likelihood must be estimated from a large enough population or as much "physical" knowledge about the phenomenon being observed as possible.
Within the product (capital letter Pi in the second equation that expresses the likelihood) you have a conditional. The conditional is the probability that a neighbourhood node belongs to some class if x belonged to class c.
In the typical application of the Naive Bayesian Classifier, that is document classification (e.g. spam mail), the conditional that an email is spam GIVEN THE APPEARANCE OF SPECIFIC WORDS IN ITS BODY is derived by a huge database of observations, or, a huge database of emails that we really, absolutely know which class they belong to. In other words, I must have an idea of how does a spam email looks like and eventually, the majority of spam emails converge to having some common theme (I am some bank official and i have a money opportunity for you, give me your bank details to wire money to you and make you rich...).
Without this knowledge, we can't use Bayes rule.
So, to get back to your specific problem. In your PDF, you have a question mark in the derivation of the product.
Exactly.
So the real question here is: What is the likelihood from your Graph / data?
(...or Where are you going to derive it from? (obviously, either a large number of known observations OR some knowledge about the phenomenon. For example, what is the likelihood that a node is infected given that a proportion of its neighbourhood is infected too)).
I hope this helps.
Relaxed Radix Balanced Trees (RRB-trees) are a generalization of immutable vectors (used in Clojure and Scala) that have 'effectively constant' indexing and update times. RRB-trees maintain efficient indexing and update but also allow efficient concatenation (log n).
The authors present the data structure in a way that I find hard to follow. I am not quite sure what the invariant is that each node maintains.
In section 2.5, they describe their algorithm. I think they are ensuring that indexing into the node will only ever require e extra steps of linear search after radix searching. I do not understand how they derived their formula for the extra steps, and I think perhaps I'm not sure what each of the variables mean (in particular "a total of p sub-tree branches").
What's how does the RRB-tree concatenation algorithm work?
They do describe an invariant in section 2.4 "However, as mentioned earlier
B-Trees nodes do not facilitate radix searching. Instead we chose
the initial invariant of allowing the node sizes to range between m
and m - 1. This defines a family of balanced trees starting with
well known 2-3 trees, 3-4 trees and (for m=32) 31-32 trees. This
invariant ensures balancing and achieves radix branch search in the
majority of cases. Occasionally a few step linear search is needed
after the radix search to find the correct branch.
The extra steps required increase at the higher levels."
Looking at their formula, it looks like they have worked out the maximum and minimum possible number of values stored in a subtree. The difference between the two is the maximum possible difference between the maximum and minimum number of values underneath a point. If you divide this by the number of values underneath a slot, you have the maximum number of slots you could be off by when you work out which slot to look at to see if it contains the index you are searching for.
#mcdowella is correct that's what they say about relaxed nodes. But if you're splitting and joining nodes, a range from m to m-1 means you will sometimes have to adjust up to m-1 (m-2?) nodes in order to add or remove a single element from a node. This seems horribly inefficient. I think they meant between m and (2 m) - 1 because this allows nodes to be split into 2 when they get too big, or 2 nodes joined into one when they are too small without ever needing to change a third node. So it's a typo that the "2" is missing in "2 m" in the paper. Jean Niklas L’orange's masters thesis backs me up on this.
Furthermore, all strict nodes have the same length which must be a power of 2. The reason for this is an optimization in Rich Hickey's Clojure PersistentVector. Well, I think the important thing is to pack all strict nodes left (more on this later) so you don't have to guess which branch of the tree to descend. But being able to bit-shift and bit-mask instead of divide is a nice bonus. I didn't time the get() operation on a relaxed Scala Vector, but the relaxed Paguro vector is about 10x slower than the strict one. So it makes every effort to be as strict as possible, even producing 2 strict levels if you repeatedly insert at 0.
Their tree also has an even height - all leaf nodes are equal distance from the root. I think it would still work if relaxed trees had to be within, say, one level of one-another, though not sure what that would buy you.
Relaxed nodes can have strict children, but not vice-versa.
Strict nodes must be filled from the left (low-index) without gaps. Any non-full Strict nodes must be on the right-hand (high-index) edge of the tree. All Strict leaf nodes can always be full if you do appends in a focus or tail (more on that below).
You can see most of the invariants by searching for the debugValidate() methods in the Paguro implementation. That's not their paper, but it's mostly based on it. Actually, the "display" variables in the Scala implementation aren't mentioned in the paper either. If you're going to study this stuff, you probably want to start by taking a good look at the Clojure PersistentVector because the RRB Tree has one inside it. The two differences between that and the RRB Tree are 1. the RRB Tree allows "relaxed" nodes and 2. the RRB Tree may have a "focus" instead of a "tail." Both focus and tail are small buffers (maybe the same size as a strict leaf node), the difference being that the focus will probably be localized to whatever area of the vector was last inserted/appended to, while the tail is always at the end (PerSistentVector can only be appended to, never inserted into). These 2 differences are what allow O(log n) arbitrary inserts and removals, plus O(log n) split() and join() operations.