Binary search tree for strings - binary-tree

I've been doing a bit of research and can't for the life in me find out if this is possible. Is it possible to use a binary search tree for strings? The way I see it is, if I was to use a binary search tree for strings I'd have to represent those strings with numbers to validate the comparing. I know it's probably better to use a Suffix tree, but if I was to use a binary search tree for strings, what would be the best method for comparing string values such as names? Thanks.

i think there is no other way besides what you already said, the other way would be to decompose the string and use part of the string as a key, this is very common in databases, althought not very recommended.

Related

data structure for finding the substring from large number of strings

My problem statement is that I am given millions of strings, and I have to find one sub-string which can be present in any of those strings.
e.g. given is "xyzoverflowasxs, werstackweq" etc. and I have to find a given sub string named as "stack", which should return "werstackweq". What kind of data structure we can use for solving this problem ?
I think we can use suffix tree for this , but wanted some more suggestions for this problem.
I think the way to go is with a dictionary holding the actual words, and another data structure pointing to entries within this dictionary. One way to go would be with suffix trees and their variants, as mentioned in the question and the comments. I think the following is a far simpler (heuristic) alternative.
Say you choose some integer k. For each of your strings, finding the k Rabin Fingerprints of length-k within each string should be efficient and easy (any language has an implementation).
So, for a given k, you could hold two data structures:
A dictionary of the words, say a hash table based on collision lists
A dictionary mapping each fingerprint to an array of the linked-list node pointers in the first data structure.
Given a word of length k or greater, you would choose a k subword, calculate its Rabin fingerprint, find the words which contain this fingerprint, and check if they indeed contain this word.
The question is which k to use, and whether to use multiple such k. I would try this experimentally (starting with simultaneously a few small k values for, say, 1, 2, and 3, and also a couple of larger ones). The performance of this heuristic anyway depends on the distribution of your dictionary and queries.

given prefix, how to find most frequently words effiecitly

This is a interview question extended from this one: http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
But this question required on more thing: GIVEN PREFIX
For example, given "Bl" return most frequently words such as "bloom, blame, bloomberg" etc.
SO using TRIE is a must. But then, how to efficitnly construct heap? It's not right or pratical to construct heap for each prefix at run time. what could be a good solution.
[suppose this TRIE or data structure is static, pre-build]
thanks!
Keep a trie for all the words appearing in the file and their counts. So now if it asked to return all the words with the prefix "BI", then you can do it efficiently using the trie formed. But since you are interested in giving the most frequently occuring words with the prefix "BI", you can now use a min-heap to give the answer efficiently just like it has been done in the post you have linked to.
Also note that since the space usage of a trie can grow very large, you can suggest the interviewer to use a Ternary search tree instead of a trie.

Efficient Querying for common substrings

I have a representation of an object being for example
SubObjects: H1,H2,F1,F2
where each of the H anf F represent a specific smaller object. I wish to query easily to check all the representations which have 3 of the subobject in common
eg H1,H4,F1,F2 would be returned back, even H1,H2,F1,F5. when i query for Objects which have 3 parts of the string representation in common for H1,H2,F1,F2.
The string position is important therefore H2,H1,F1,F2 is different from H1,H2,F1,F2.
A brute force plan of action is not possible as I have thousands of such strings to compare. Was thinking of some way hacking round the problem by the use of suffix trees.
Is there any more efficient data structure which i can use to solve the problem?
As i stated in my question i resorted to use suffix trees. Such trees could let me query the tree really rapidly for particular substrings and get back all the objects which contain that particular substring. I dont know if a better solution exists but suffix trees worked well for my problem.
suffix trees:

Trie based addressbook and efficient search by name and contact number

it is a known approach to develop an addressbook based on trie datastructure. It is an efficient data structure for strings. Suppose if we want to create an efficient search mechanism for an address book based on names, numbers etc, what is the efficient data structure to enable memory efficient and faster search based on any type of search terms irrespective of data type?
This is a strange question maybe you should add more informations but you can use a trie data structure not only for strings but also for many other data types. The definition of a trie is to make a dictionnary with an adjacent tree model. I know of a kart-trie that is something similar to a trie and uses a binary tree model. So it is the same data structure but with a different tree model. The kart-trie uses a clever key-alternating algorithm to hide a trie-data structure in a binary tree. It's not a patricia trie, or a radix-trie.
Good algorithm for managing configuration trees with wildcards?
http://code.dogmap.org/kart/
But I think a ternary tree would do the same trick:
http://en.wikipedia.org/wiki/Ternary_search_tree
http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/

Find all (english word) substrings of a given string

This is an interview question: Find all (english word) substrings of a given string. (every = every, ever, very).
Obviously, we can loop over all substrings and check each one against an English dictionary, organized as a set. I believe the dictionary is small enough to fit the RAM. How to organize the dictionary ? As for as I remember, the original spell command loaded the words file in a bitmap, represented a set of words hash values. I would start from that.
Another solution is a trie built from the dictionary. Using the trie we can loop over all string characters and check the trie for each character. I guess the complexity of this solution would be the same in the worst case (O(n^2))
Does it make sense? Would you suggest other solutions?
The Aho-Corasick string matching algorithm which "constructs a finite state machine that resembles a trie with additional links between the various internal nodes."
But everything considered the "build a trie from the English dictionary and do a simultaneous search on it for all suffixes of the given string" should be pretty good for an interview.
I'm not sure a Trie will work easily to match sub words that begin in the middle of the string.
Another solution with a similar concept is to use a state machine or regular expression.
the regular expression is just word1|word2|....
I'm not sure if standard regular expression engines can handle an expression covering the whole English language, but it shouldn't be hard to build the equivalent state machine given the dictionary.
Once the regular expression is compiled \ the state machine is built the complexity of analyzing a specific string is O(n)
The first solution can be refined to have a different hash map for each word length (to reduce collisions) but other than that I can't think of anything significantly better.

Resources