finding common sub trees in a tree - algorithm

Consider this conceptual diagram below, which is for demonstrational purposes only.
Abc Foo
\ / \
\ / Foo2
Bar \
/ \ Foo3
Bar2 Bar3 \
/ \ Foo4
X Y
In the above tree, there is a unique "path", Foo->Bar->Bar2->X. This path is distinct from Abc->Bar->Bar2->X .. Obviously this information is lost in the above represenation, but consider I have all the individual unique paths stored.
They do however, share some part of the path "Bar->Bar2->X".
The purpose of the algorithm I'm trying to either find or implement is that I would like to aggregate this information so I can not store individual paths. But more importantly, I'm trying to find all these common paths, and given them weights. So for example, in the above case, I could condense the information about the "Bar->Bar2->X" and say it happened 2 times. Obviously I'd require it to work for all cases.
And yes, the ultimate idea is to be able to quickly ask the question "Show me all the distinct paths from Foo". In this example there is only 1, Foo->Bar->Bar2->X. Foo->Bar->Bar2->Y and Foo->Bar->Bar3 do not exist. The diagram is for viewing purposes only.
Any ideas?

So this is just a starting point I hope others will help me fill in but I would think about them as Strings and look at the common sub paths as the common substring problem which has been looked at quite a bit in the past. Off the top of my head I might invert each path/string and then build a trie structure from those because then by counting the number of keys below a given node you can see how many times that ending path gets used... There is probably a better and more efficient way but that should work. Anyone else have ideas on treating them as strings?

You could store each unique path separately. To answer questions such as “who does Foo call”, you could create an index in the form of a hash table.
As an alternative, you could try using DAWG, but I'm not sure how much would it help in your case.

Related

How to find ALL optimal local alignments using Smith-Waterman?

If I got this right then it is possible that there is more than one max value in the local alignment matrix. So in order to get all optimal local alignments, instead of only one, I would have to find the location of all these maximum values in the matrix and trace each of them back individually, right?
Example:
XGTCXXGTCX
|||
AGTCA
XGTCXXGTCX
|||
AGTCA
There is no such thing as ALL optimal alignments. There should only be one optimal alignment. I guess there could be multiple paths for the same alignment but they would have the same overall score and it doesn't look like that's the kind of question you are asking.
What your diagram in your post shows is multiple (primer?) hits. In such a case, what I do is I run smith-waterman once, get the optimal alignment. Then I generate a new alignment where the subject sequence has been trimmed to only include the downstream sequence. The advantage to this way is I don't have to modify any S-W code or have to dig in the internals of 3rd party code.
So it would look like this:
Alignment 1
XGTCXXGTCX
|||
AGTCA
Delete upstream subject sequence:
XGTCXXGTCX => XGTCX
Alignment 2
XGTCX
|||
AGTCA
The only tricky part is you have to keep track of how many bases have been deleted from the alignment so you can correctly adjust the match coordinates.
I know this post is pretty old nowadays but since I found that, other People might also find this while looking for help and in my opinion, the correct answer has not been given, yet. So:
Clearly, there can be MULTIPLE optimal local alignments. You've just shown an example of such. Yet, there is EXACTLY ONE optimal local alignment SCORE. Looking at the original paper that presented the SmithWaterman-Algorithm, Smith and Waterman already indicate how to find the second best alignment, third best alignment...
here's a Reprint to read that stuff (for your Problem, check page 196):
https://pdfs.semanticscholar.org/40c5/441aad96b366996e6af163ca9473a19bb9ad.pdf
So (in contrast to other answers on here), the SmithWaterman-Algorithm also gives second best local alignments and so on.
Just check for the second best score within your Scoringmatrix (in your case there'll be several entries with the same best score), that is not associated with your best local alignment, do the usual backtracking and you solved your Problem. :)

Storing arrangement of same word efficiently

I had a question related to Dictionary storage.
I was reading about Trie Data-structures and so far I have read that it works pretty well as prefix tree. But, I came to Trie - DS in efforts to see if it can reduce the storage of arrangement of letters formed through same word efficiently.
For ex : words "ANT", "TAN" and NAT have same letters but according to Trie it goes on to create two separate paths for these words. I can understand that Trie is meant for prefix storage and reduce redundancy. But can anyone help me in reducing the redundancy here.
One way I was thinking was to change the behavior of Trie as to each node has a status of 'word complete'; In addition if I put 'word start' status too I can make this work as below :
A
N - A - T
T - A - N
Now, every time I can check if the word is starting form here and go till the end.
Does this makes sense ? and if this is feasible ?
Or is their any better method to do this ?
Thanks
You can use 2 tries and also store the reverse trie. Then you can use a wildcard expansion everywhere in the search for example you can split the search word into 2 half and search for one half by the prefix and the other half by its suffix:http://phpir.com/tries-and-wildcards/. When you concatenate the 2 you can efficient search with a wildcard.
If you add a status field to each node you will increase the memory cost of your tree (assuming 8-bit chars) by a possibly not insignificant portion.
I understand that you want to reduce the number of letters in the DS, but you have to consider what happens if some contents are subsets of other contents, e.g. how ANTAN would be represented. Think about the minimal number of chars (128) as nodes of a fully connected graph. Obviously all words are stored in this graph, however it is not suitable to store any specific words. There is no way of telling where words end. The information stored in a trie is not just letters, but complete and properly terminated words.
If you add a marker as you suggest, how will you be able to encode this: SUPERCHARGED, SUPER, PERCH. You would set word_starts at S and P and word_ends at R and H. How would you know that SUPERCH and PER are not contained? You could instead use a non-zero label and assign number-pairs to the beginning and end of words: S:1 P:2 R:1 H:2. To make sure that start and end can occur at the same letter, you would have to use specific bits as labels.
You could then use NATANT as minimal flat representation and N:001 A:000 T:011 A:100 N:010 T: 100. This requires #words bit for the marker in the worst case: A, AA, AAA.... If you would store that in a tree however, you would have to look for the other marker, which is not an operation supported by trees. So I see no good way of using a marker.
From an information theoretical point I think the critical issue here is to properly encode the length, ordering and contents of a word in a unique way for each possible combination of these.
I originally meant to just comment, but it got a bit lengthy. I am not sure if this answers your question, but I hope it helps.
Are you hoping that any search for "ant" also brings up "tan" and "nat"?
If so, then use a TrieMap, always sort keys before reads/writes and map to a container of all words in that "anagram class."
If you're just looking for ideas to reduce the space overhead of using a Trie, then look no further. I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation.

Cleaning doubles out of a massive word list

I got a wordlist which is 56GB and I would like to remove doubles.
I've tried to approach this in java but I run out of space on my laptop after 2.5M words.
So I'm looking for an (online) program or algorithm which would allow me to remove all duplicates.
Thanks in advance,
Sir Troll
edit:
What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated
I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing 'a' into a.txt, first char equals 'b' into b.txt. ...
a.txt
b.txt
c.txt
-
afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.
if the files remain to big you can also split using more than 1 char
e.g:
aa.txt
ab.txt
ac.txt
...
Frameworks like Mapreduce or Hadoop are perfect for such tasks. You'll need to write your own map and reduce functions. Although i'm sure this must've been done before. A quick search on stackoverflow gave this
I suggest you use a Bloom Filter for this.
For each word, check if it's already present in the filter, otherwise insert it (or, rather some good hash value of it).
It should be fairly efficient and you shouldn't need to provide it with more than a gigabyte or two for it to have practically no false negatives. I leave it to you to work out the math.
I do like the divide-and-conquer comments here, but I have to admit: If you're running into trouble with 2.5mio words, something's going wrong with your original approach. Even if we assume each word is unique within those 2.5mio (which basically rules out that what we're talking about is a text in a natural language) and assuming each word is on average 100 unicode characters long we're at 500MB for storing the unique strings plus some overhead for storing the set structure. Meaning: You should be doing really fine since those numbers are totally overestimated already. Maybe before installing Hadoop, you could try increasing your heap size?

What algorithm would allow building optimal "groups" of terms?

I have a table of data and I want to pull specific records. The records are indicated in various, nigh-random ways (how isn't important), but I want to be able to identify them using 11 specific terms. Essentially, I'm being given a lot of queries against non-indexed fields and having to rewrite them using specific indexed fields -- except thanks to an Enterprisey System it's not as simple as that: the data has to be packaged in a certain way that avoids directly touching SQL.
It might be easier to give an example in 2-dimensions, although the problem itself uses 11 that will probably change:
123
+---+
A|X O|
B| X |
C|X O|
+---+
If I wanted to group all the X's in the above grid, I could say: A1 and B2 and C1. Better would be (A,C)1 and B2. Even better would be (A,B,C)(1,2) -- empty spaces can be included or excluded for this problem, they don't matter. What's important is keeping the number of groups down, getting all the Xs and avoiding all the Os.
To give a hint on sizing, the actual problem will generally deal with anywhere between 100 and 5000 "good" records. It is also not necessary to have The Ideal Answer -- a Pretty Good answer would suffice.
This sounds a lot like Karnaugh maps, with X=true, 0=false, and blank="don't care".

How to spot and analyse similar patterns like Excel does?

You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.
For example
Type...
test-1
test-2
test-3
Excel will continue it with:
test-4
test-5
test-n...
Same works for some other patterns such as dates and so on.
I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:
test-blue-somethingelse
test-yellow-somethingelse
test-red-somethingelse
Now based on this entries I want say that the pattern is:
test-[DYNAMIC]-something
Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.
I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).
Document classification can be useful in this scenario but I'm not sure where to start.
UPDATE:
I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.
Such as:
test_[DYNAMIC]12[DYNAMIC2]
I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.
As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.
The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.
Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.
finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]
something like (pseudocode - kinda):
String s1 = 'asdf-1-jkl';
String s2= 'asdf-2-jkl';
int s1I = 0, s2I = 0;
String dyn1, dyn2;
for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++)
if (s1.charAt(s1I) != s2.charAt(s2I))
break;
int s1E = s1.length(), s2E = s2.length;
for (;s2E>0&&s1E>0;s1E--,s2E--)
if (s1.charAt(s1E) != s2.charAt(s2E))
break;
dyn1 = s1.substring(s1I, s1E);
dyn2 = s2.substring(s2I, s2E);
About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)
I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.
Google docs might be better than excel for this sort of thing, believe it or not.
Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.

Resources