I'm probably just trying to do something crazy here, so let me explain my use case first:
I've got an object graph in Ruby, consisting only of the basic JSON types (strings, numbers, arrays, hashes, trues, falses, nils). I'd like ultimately to serialize this graph to JSON.
The problem is that I don't have control over the origin of all objects in the graph. This means that some of the strings contained in the object graph might be tagged with the wrong encodings (for example, a string that's actually just a bunch of random garbage bytes ends up tagged with a UTF-8 encoding). This will cause the JSON serialization to fail (since JSON only supports UTF-8 encoded strings).
I have a strategy for handling these problematic strings, which basically consists of replacing them with a transformed version of each string (the exact transformation isn't really relevant).
In order to apply this transformation to strings, I need to walk the entire object graph and find all of them. This is trivial to implement recursively using standard depth-first search. One wrinkle is that I'd like to avoid mutating the original object graph or any strings therein, so I'm basically building a copy of the object graph as I traverse it (with only the non-problematic leaf nodes being referenced directly from the new graph, and all other nodes being duped).
This all works, and is reasonably efficient, save the duplication of non-leaf nodes in the transformed object graph. The problem is that it sometimes gets fed very deeply-nested object graphs, so the recursion will on occasion produce a SystemStackError.
I've implemented a non-recursive solution using DFS with a stack of Enumerator objects, but it seems to be dramatically slower than the recursive solution (presumably on account of the extra object allocations for the Enumerators and the silly StopIteration exceptions that get raised at the end of each Enumerator.
Breadth-first search seems inappropriate, because I don't think there's a way to determine the path back up to the root when visiting a given node, which I think I need in order to build a copy of the tree.
Am I wrong about BFS here? Are there other techniques that I could be using to accomplish this traversal without recursion? Is this all just loony?
Instead of using recursion you could use a stack explicitly see here for more details:
Way to go from recursion to iteration
http://haacked.com/archive/2007/03/04/Replacing_Recursion_With_a_Stack.aspx/
Related
I have a representation of an object being for example
SubObjects: H1,H2,F1,F2
where each of the H anf F represent a specific smaller object. I wish to query easily to check all the representations which have 3 of the subobject in common
eg H1,H4,F1,F2 would be returned back, even H1,H2,F1,F5. when i query for Objects which have 3 parts of the string representation in common for H1,H2,F1,F2.
The string position is important therefore H2,H1,F1,F2 is different from H1,H2,F1,F2.
A brute force plan of action is not possible as I have thousands of such strings to compare. Was thinking of some way hacking round the problem by the use of suffix trees.
Is there any more efficient data structure which i can use to solve the problem?
As i stated in my question i resorted to use suffix trees. Such trees could let me query the tree really rapidly for particular substrings and get back all the objects which contain that particular substring. I dont know if a better solution exists but suffix trees worked well for my problem.
suffix trees:
I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?
In my lexical analyzer generator I use McNaughton and Yamada algorithm for NFA construction, and one of its properties that transition form I to J marked with char at J position.
So, each node of NFA can be represented simply as list of next possible states.
Which data structure best suit for storing this type of data? It must provide fast lookup for all possible states and use less space, but insertion time is not so important.
My understanding is that you want to encode a graph, where the nodes are states and the edges are transitions, and where every edge is labelled with a character. Is that correct?
The dull but practical answer is to have a object for each state, and to encode the transitions in some little structure in that object.
The simplest one would be an array, indexed by character code: that's as fast as it gets, but not naturally space-efficient. You can make it more space efficient by using a sort of offset, truncated array: store only the part of the array which contains transitions, along with the start and end indices of that part. When looking up a character in it, check that its code is within the bounds; if it isn't, treat it as a null edge (or an edge back to the start state or whatever), and if it is, fetch the element at index (character code - start). Does that make sense?
A more complex option would be a little hashtable, which would be more compact but slightly slower. I would suggest closed hashing, because collision lists will use too much memory; linear probing should be enough. You could look into using perfect hashing (look it up), which takes a lot of time to generate the table but then gives collision-free lookup. The generation process is quite complex, though.
A clever approach is to use both arrays and hashtables, and to pick one or the other based on the number of edges: if the compacted array would be more than, say, a third full, use it, but if not, use a hashtable.
Now, something a bit more radical you could do would be to use arrays, but to overlap them - if they're sparse, they'll have lots of holes in, and if you're clever, you can arrange them so that the entries in each array lines up with holes in the others. That will give you fast lookups, but also excellent memory efficiency. You will need some scheme for distinguishing when a lookup has found something from when it's found an empty slot with some other state's transition in, but i'm sure you can think of something.
Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.
We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.
I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.
My current approach is something like this:
I want to convert each document into an N-dimensional vector set up for a K-means clustering.
To do this, I recursively walk the document tree and for each level I calculate a vector. If I am at a tree vertex, I recur on all subvertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.
Depending on the data at a tree leaf, I apply a function which takes the data into a vector.
But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.
I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.
Distance function
As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.
In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.
The take-one-step-back viewpoint:
I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.
Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.
If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.
Example:
Mapping:
Exception A -> 0
Exception B -> 1
Exception C -> 2
Exception D -> 3
Doc A: 0-1-2
Doc B: 1-2-3
2-grams for doc A:
X0, 01, 12, 2X
2-grams for doc B:
X1, 12, 23, 3X
Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)
However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).
Here are some pointers:
http://portal.acm.org/citation.cfm?id=1227182
http://www.springerlink.com/content/yu0bajqnn4cvh9w9/
Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.
Here you may find a paper that seems closely related to your problem.
From the abstract:
This thesis presents Ixor, a system which collects, stores, and analyzes
stack traces in distributed Java systems. When combined with third-party
clustering software and adaptive cluster filtering, unusual executions can be
identified.
I was just looking at Eric Lippert's simple implementation of an immutable binary tree, and I have a question about it. After showing the implementation, Eric states that
Note that another nice feature of
immutable data structures is that it
is impossible to accidentally (or
deliberately!) create a tree which
contains a cycle.
It seems that this feature of Eric's implementation does not come from the immutability alone, but also from the fact that the tree is built up from the leaves. This naturally prevents a node from having any of its ancestors as children. It seems that if you built the tree in the other direction, you'd introduce the possibility of cycles.
Am I right in my thinking, or does the impossibility of cycles in this case come from the immutability alone? Considering the source, I wonder whether I'm missing something.
EDIT: After thinking it over a bit more, it seems that building up from the leaves might be the only way to create an immutable tree. Am I right?
If you're using an immutable data structure, in a strict (as opposed to lazy) language, it's impossible to create a cycle; as you must create the elements in some order, and once an element is created, you cannot mutate it to point at an element created later. So if you created node n, and then created node m which pointed at n (perhaps indirectly), you could never complete the cycle by causing n to point at m as you are not allowed to mutate n, nor anything that n already points to.
Yes, you are correct that you can only ever create an immutable tree by building up from the leaves; if you started from the root, you would have to modify the root to point at its children as you create them. Only by starting from the leaves, and creating each node to point to its children, can you construct a tree from immutable nodes.
If you really want to try hard at it you could create a tree with cycles in it that is immutable. For example, you could define an immutable graph class and then say:
Graph g = Graph.Empty
.AddNode("A")
.AddNode("B")
.AddNode("C")
.AddEdge("A", "B")
.AddEdge("B", "C")
.AddEdge("C", "A");
And hey, you've got a "tree" with "cycles" in it - because of course you haven't got a tree in the first place, you've got a directed graph.
But with a data type that actually uses a traditional "left and right sub trees" implementation of a binary tree then there is no way to make a cyclic tree (modulo of course sneaky tricks like using reflection or unsafe code.)
When you say "built up from the leaves", I guess you're including the fact that the constructor takes children but never takes a parent.
It seems that if you built the tree in
the other direction, you'd introduce
the possibility of cycles.
No, because then you'd have the opposite constraint: the constructor would have to take a parent but never a child. Therefore you can never create a descendant until all its ancestors are created. Therefore no cycles are possible.
After thinking it over a bit more, it
seems that building up from the leaves
might be the only way to create an
immutable tree. Am I right?
No... see my comments to Brian and ergosys.
For many applications, a tree whose child nodes point to their parents is not very useful. I grant that. If you need to traverse the tree in an order determined by its hierarchy, an upward-pointing tree makes that hard.
However for other applications, that sort of tree is exactly the sort we want. For example, we have a database of articles. Each article can have one or more translations. Each translation can have translations. We create this data structure as a relational database table, where each record has a "foreign key" (pointer) to its parent. None of these records need ever change its pointer to its parent. When a new article or translation is added, the record is created with a pointer to the appropriate parent.
A common use case is to query the table of translations, looking for translations for a particular article, or translations in a particular language. Ah, you say, the table of translations is a mutable data structure.
Sure it is. But it's separate from the tree. We use the (immutable) tree to record the hierarchical relationships, and the mutable table for iteration over the items. In a non-database situation, you could have a hash table pointing to the nodes of the tree. Either way, the tree itself (i.e. the nodes) never get modified.
Here's another example of this data structure, including how to usefully access the nodes.
My point is that the answer to the OP's question is "yes", I agree with the rest of you, that the prevention of cycles does come from immutability alone. While you can build a tree in the other direction (top-down), if you do, and it's immutable, it still cannot have cycles.
When you're talking about powerful theoretical guarantees like
another nice feature of immutable data structures is that
it is impossible to accidentally (or
deliberately!) create a tree which
contains a cycle [emphasis in original]
"such a tree wouldn't be very useful" pales in comparison -- even if it were true.
People create un-useful data structures by accident all the time, let alone creating supposedly-useless ones on purpose. The putative uselessness doesn't protect the program from the pitfalls of cycles in your data structures. A theoretical guarantee does (assuming you really meet the criteria it states).
P.S. one nice feature of upward-pointing trees is that you can guarantee one aspect of the definition of trees that downward-pointing tree data structures (like Eric Lippert's) don't: that every node has at most one parent. (See David's comment and my response.)
You can't build it from the root, it requires you to mutate nodes you already added.