I have to generate a random tree (the data structure, not the graphical one) given some parameters: at least the mean depth and the mean number of children of the nodes (as floats). There is no other contrainst (for now at least).
I really don't know this field so maybe there is something obvious I missed when I googled but I couldn't find anything... Maze generation algorithms looked interesting but they don't have these parameters as far as I can tell.
So please, tell me if this is possible at all, and if it is, give me some pointers, or even keywords to search for.
Thanks
You can start by creating procedure gen, creating the tree with given height and random number of children on each level avgChildCount. The number of children is selected in range of:
[0, (avgChildCount*2).toInt]
So, when you have this procedure you can introduce another one, taking two averages avgHeight and avgChildCount calling gen:
gen(random(0, (avgHeight*2).toInt), avgChildCount)
Related
I am trying to use the Viterbi min-sum algorithm which tries to find the pathway through a bunch of nodes that minimizes the overall Hamming distance (fancy term for "xor two numbers and count the resulting bits") against some fixed input.
I understand find how to use DP to compute the minimal distance overall, but I am having trouble using it to also capture the corresponding path that corresponds to the minimal distance.
It seems like memoizing the path at each node would be really memory-intensive. Is there a standard way to handle these kinds of problems?
Edit:
http://i.imgur.com/EugiEWG.jpg
Here is a sample trellis with what I am talking about. The general idea is to find the path through the trellis that most closely emulates the input bitstring, with minimal error (measured by minimizing overall Hamming distance, or the number of mismatched bits).
As you can see, the first chunk of my input string is 01, and I can traverse there in column 1 of the trellis. The next chunk is 10, and I can move there in column 2. Next chunk is 11. Fine so far. Next chunk is 10, which is a problem because I can't reach that state from where I am now, so I have to go to the next best thing (00) and the rest can be filled fine.
But this can become more complex. I'd need to be able to somehow get the corresponding path to the minimal Hamming distance.
(The point of this exercise is that the trellis represents what are ACTUALLY valid transitions, whereas the input string is something you receive through telecommunicationa and might get garbled and have incorrect bits here and there. This program tries to figure out what the input string SHOULD be by minimizing error).
There's the usual "follow path backwards" technique, requiring only the table of values (but the whole table of values, no cheating with "keep only the most recent part"). The algorithm is simple: start at the end, decide which way you came from. You can make that decision, because either there's exactly one way such that if you came from it you'd compute the value that matches the stored one, or several result in the same value and it wouldn't matter which one you chose.
Storing also a table of "back-pointers" doesn't take much space (about as much as the table of weights, but you can actually omit most of the table of weights if you do this), doing it that way allows you to have a much simpler backwards phase: just follow the pointers. That really is the path, just stored backwards.
You are correct that the immediate approach for calculating the paths, is space expensive.
This problem comes up often in DNA sequencing, where the cost is prohibitive. There are a number of ways to overcome it (see more here):
You can reduce up to a square root of the space if you are willing to double the execution time (see 2.1.1 in the link above).
Using a compressed tree, you can reduce one of the dimensions logarithmically (see 2.1.2 in the link above).
I'm trying to create decision tree from data. I'm using the tree for guess-the-animal-game kind of application. User answers questions with yes/no and program guesses the answer. This program is for homework.
I don't know how to create decision tree from data. I have no way of knowing what will be the root node. Data will be different every time. I can't do it by hand. My data is like this:
Animal1: property1, property3, property5
Animal2: property2, property3, property5, property6
Animal3: property1, property6
etc.
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i don't know if i should use them.
Can someone direct me, what algorithm should i use, to build decision tree in this situation?
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i
don't know if i should use them.
Yes, you should. They are very commonly used decision trees, and have some nice open source implementations for them. (Weka's J48 is an example implementation of C4.5)
If you need to implement something from scratch, implementing a simple decision tree is fairly simple, and is done iteratively:
Let the set of labled samples be S, with set of properties P={p1,p2,...,pk}
Choose a property pi
Split S to two sets S1,S2 - S1 holds pi, and S2 do not. Create two children for the current node, and move S1 and S2 to them respectively
Repeat for S'=S1, S'=S2 for each of the subsets of samples, if they are not empty.
Some pointers:
At each iteration you basically split the current data to 2 subsets, the samples that hold pi, and the data that does not. You then create two new nodes, which are the current node's children, and repeat the process for each of them, each with the relevant subset of data.
A smart algorithm chooses the property pi (in step 2) in a way that minimizes the tree's height as much as it can (finding the best solution is NP-Hard, but there are greedy approaches to minimize entropy, for example).
After the tree is created, some pruning to it is done, in order to avoid overfitting.
A simple extension of this algorithm is using multiple decision trees that work seperately - this is called Random Forests, and is empirically getting pretty good results usually.
I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.
I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).
Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.
I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.
http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?
Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.
We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.
I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.
My current approach is something like this:
I want to convert each document into an N-dimensional vector set up for a K-means clustering.
To do this, I recursively walk the document tree and for each level I calculate a vector. If I am at a tree vertex, I recur on all subvertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.
Depending on the data at a tree leaf, I apply a function which takes the data into a vector.
But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.
I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.
Distance function
As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.
In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.
The take-one-step-back viewpoint:
I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.
Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.
If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.
Example:
Mapping:
Exception A -> 0
Exception B -> 1
Exception C -> 2
Exception D -> 3
Doc A: 0-1-2
Doc B: 1-2-3
2-grams for doc A:
X0, 01, 12, 2X
2-grams for doc B:
X1, 12, 23, 3X
Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)
However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).
Here are some pointers:
http://portal.acm.org/citation.cfm?id=1227182
http://www.springerlink.com/content/yu0bajqnn4cvh9w9/
Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.
Here you may find a paper that seems closely related to your problem.
From the abstract:
This thesis presents Ixor, a system which collects, stores, and analyzes
stack traces in distributed Java systems. When combined with third-party
clustering software and adaptive cluster filtering, unusual executions can be
identified.
Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.