Efficient Querying for common substrings

Efficient Querying for common substrings - algorithm

I have a representation of an object being for example
SubObjects: H1,H2,F1,F2
where each of the H anf F represent a specific smaller object. I wish to query easily to check all the representations which have 3 of the subobject in common
eg H1,H4,F1,F2 would be returned back, even H1,H2,F1,F5. when i query for Objects which have 3 parts of the string representation in common for H1,H2,F1,F2.
The string position is important therefore H2,H1,F1,F2 is different from H1,H2,F1,F2.
A brute force plan of action is not possible as I have thousands of such strings to compare. Was thinking of some way hacking round the problem by the use of suffix trees.
Is there any more efficient data structure which i can use to solve the problem?

As i stated in my question i resorted to use suffix trees. Such trees could let me query the tree really rapidly for particular substrings and get back all the objects which contain that particular substring. I dont know if a better solution exists but suffix trees worked well for my problem.
suffix trees:

Related

What would be a good data structure to store a dictionary(of words) to optimize the search time?

Provided a list of valid words, and a search word, I want to find whether the search word is a valid word or not ALLOWING 2 typo characters.
What would be a good data structure to store a dictionary of words(assumingly it contains a million words) and algorithm to find whether the word exists in the dictionary(allowing 2 typo characters).
If no typo characters where allowed, then a Trie would be a good way to store the words but not sure if it stays the best way to store dictionary when typos are allowed. Not sure what the complexity for a backtracking algorithm(to search for a word in Trie allowing 2 typos) would be. Any idea about it?

You might want to checkout Directed Acyclic Word Graph or DAWG. It has more of an automata structure than a tree of graph structure. Multiple possibilities from one place may provide you with your solution.

If there is no need to also store all mistyped words I would consider to use a two-step approach for this problem.
Build a set containing hashes of all valid words (not including
typos). So probably we are talking here about some 10.000 entries,
which should still allow quite fast lookups with a binary search. If
the hash of a word is found in the set it is typed correctly.
If a words hash is not found in the set the word is probably
mistyped. So calculate a the Damerau-Levenshtein distance between
the word and all known words to figure out what the user might have
meant. To gain some performance here modify the DL-algorithm to
abort calculation if the distance gets bigger than your allowed
threshold of 2 typos.

Hash-maps or search tree?

The problem is as follows: Given is a list of cities and their countries, population and geo-coordinates. You should read this data, save it and answer it in an endless loop of the following type:
Request: a prefix (e.g., free).
Answer: all states beginning with this prefix ("case-insensitive")
and their associated data (country + population + geo-coordinates).
The cities should be sorted by population (highest population first).
Which data structure are the most suitable for the described problem ?
First Part : My Thoughts are hanging between Trie and Hashmap. Although i tend to the Trie more because i'm dealing with prefix requests , and Trie is basically according to Wikipedia :
"a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings".
in addition to that in terms of Storage and reading data Trie has the advantage over Hash-maps.
Second part: returning the sorted cities by population would be a little bit challenging when we speak about Time Complexity.If i'm thinking in the right direction i should save the values of the keys as lists and it will be easier to sort just the returning list , so i don't have to save it sorted to save some times.
Please share you thoughts and correct me if i'm wrong .

There are pros of cons of picking vanilla tries and vanilla hashmaps. In general, for autocomplete systems, the structure of a trie is extremely useful because you're usually searching for prefixes and the user would like to see the words that begin with the string that they have just entered.
However, there is a method to make the best use of both of these data structures, it is called a Hash Trie (implementation: http://www.sanfoundry.com/java-program-implement-hash-trie/). So the way you would implement this is by using the structure of the trie, but the final node is the actual string it refers to. In python, this is done using dictionaries instead of lists while implementing the trie.
For the second half of the question, a list would be your best bet, in essence a list of tuples (population, city) and sort by the population and return the cities. Regarding it being "easier" to sort, I'm not sure if I agree with this, easy is a relevant term and there's really no way of saying that it's easier than, maybe storing it in a tree and then returning the Pre-Order Traversal of the tree. Essentially, if you're using comparison based sort, it won't get better than nlog (n).

data structure for finding the substring from large number of strings

My problem statement is that I am given millions of strings, and I have to find one sub-string which can be present in any of those strings.
e.g. given is "xyzoverflowasxs, werstackweq" etc. and I have to find a given sub string named as "stack", which should return "werstackweq". What kind of data structure we can use for solving this problem ?
I think we can use suffix tree for this , but wanted some more suggestions for this problem.

I think the way to go is with a dictionary holding the actual words, and another data structure pointing to entries within this dictionary. One way to go would be with suffix trees and their variants, as mentioned in the question and the comments. I think the following is a far simpler (heuristic) alternative.
Say you choose some integer k. For each of your strings, finding the k Rabin Fingerprints of length-k within each string should be efficient and easy (any language has an implementation).
So, for a given k, you could hold two data structures:
A dictionary of the words, say a hash table based on collision lists
A dictionary mapping each fingerprint to an array of the linked-list node pointers in the first data structure.
Given a word of length k or greater, you would choose a k subword, calculate its Rabin fingerprint, find the words which contain this fingerprint, and check if they indeed contain this word.
The question is which k to use, and whether to use multiple such k. I would try this experimentally (starting with simultaneously a few small k values for, say, 1, 2, and 3, and also a couple of larger ones). The performance of this heuristic anyway depends on the distribution of your dictionary and queries.

what are some other possible use cases of a Trie data structure other than T9/Spell checker dictionaries?

I am trying to understand Trie data structure & I've understood that they are used in Spell checkers/Auto suggest or correct spellings etc. i.e. especially used in the context of language dictionaries. I wonder if there are any other possible use cases for a Trie data structure (as it is or in any augmented form).
Thanks for advance.
PS: This is not a homework problem & I am here trying to better understand possible usecases for a Trie data structure and that's it.

Tries are integral in routing systems.
Most routers stores IP address in a form of a trie (Patricia Trees) which are well suited for lookups etc.
Tries are useful as a lookup structure where you are dealing with strings (of bytes/bits etc).
Suffix trees are essentially tries and have wide string related applications, like substring checks, finding repeated substrings, palindrome finding etc.
Here are a couple of algorithm puzzles for you to try out.
Given an nxn binary matrix (of zeroes and ones), eliminate the duplicate rows.
Given n numbers, find two numbers x,y among those such that x XOR y (the exclusive OR) is maximum among all the n^2 possibilities.

Binary search tree for strings

I've been doing a bit of research and can't for the life in me find out if this is possible. Is it possible to use a binary search tree for strings? The way I see it is, if I was to use a binary search tree for strings I'd have to represent those strings with numbers to validate the comparing. I know it's probably better to use a Suffix tree, but if I was to use a binary search tree for strings, what would be the best method for comparing string values such as names? Thanks.

i think there is no other way besides what you already said, the other way would be to decompose the string and use part of the string as a key, this is very common in databases, althought not very recommended.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Efficient Querying for common substrings - algorithm

Related

What would be a good data structure to store a dictionary(of words) to optimize the search time?

Hash-maps or search tree?

data structure for finding the substring from large number of strings

what are some other possible use cases of a Trie data structure other than T9/Spell checker dictionaries?

Binary search tree for strings

Categories

Resources