data structure for finding the substring from large number of strings - algorithm

My problem statement is that I am given millions of strings, and I have to find one sub-string which can be present in any of those strings.
e.g. given is "xyzoverflowasxs, werstackweq" etc. and I have to find a given sub string named as "stack", which should return "werstackweq". What kind of data structure we can use for solving this problem ?
I think we can use suffix tree for this , but wanted some more suggestions for this problem.

I think the way to go is with a dictionary holding the actual words, and another data structure pointing to entries within this dictionary. One way to go would be with suffix trees and their variants, as mentioned in the question and the comments. I think the following is a far simpler (heuristic) alternative.
Say you choose some integer k. For each of your strings, finding the k Rabin Fingerprints of length-k within each string should be efficient and easy (any language has an implementation).
So, for a given k, you could hold two data structures:
A dictionary of the words, say a hash table based on collision lists
A dictionary mapping each fingerprint to an array of the linked-list node pointers in the first data structure.
Given a word of length k or greater, you would choose a k subword, calculate its Rabin fingerprint, find the words which contain this fingerprint, and check if they indeed contain this word.
The question is which k to use, and whether to use multiple such k. I would try this experimentally (starting with simultaneously a few small k values for, say, 1, 2, and 3, and also a couple of larger ones). The performance of this heuristic anyway depends on the distribution of your dictionary and queries.

Related

What is the fastest way to lookup an item from a small set of items by key?

Say I have a class with a fields array. The fields each have a name. Basically, like a SQL table.
class X {
foo: String
bar: String
...
}
What is the way to construct a data structure and algorithm to fetch a field by key such that it is (a) fast, in terms of number of operations, and (b) minimal, in terms of memory / data-structure size?
Obviously if you know the index of the field the fastest would be to lookup the field by index in the array. But I need to find these by key.
Now, the number of keys will be relatively small for each class. In this example there are only 2 keys/fields.
One way to do this would be to create a hash table, such as like this one in JS. You give it the key, and it iterates through each character in the key and runs it through some mixing function. But this is, for one, dependent on the size of the key. Not too bad for the types of field names I am expecting which shouldn't be too large, let's say they usually aren't longer than 100 characters.
Another way to do this would be to create a trie. You first have to compute the trie, then when you do a lookup, each node of the trie would have one character, so it would have name.length number of steps to find the field.
But I'm wondering, since the number of fields will be small, why do we need to iterate over the keys in the string? A possibly simpler approach, as long as the number of fields is small, is to just iterate through the fields and do a direct string match against each field name.
But all of these 3 techniques would be roughly the same in terms of number of iterations.
Is there any other type of magic that will give you the fewest number of iterations/steps?
It seems that there could be a possible hashing algorithm that uses to its advantage the fact that the number of items in the hash table will be small. You would create a new hash table for each class, giving it a "size" (number of fields on the specific class used for this hash table). Somehow maybe it can use this size information to construct a simple hashing algorithm that minimizes the number of iterations.
Is anything like that possible? If so, how would you do it? If not, then it would be interesting to know why its not possible to get any more optimal than these.
How "small" is the field list?
If you keep field-list sorted by key, you can use binary search.
For a very small number of fields (e.g. 4) it will perform about the same number of iterations and key-comparison as linear search, if considering the worst case of linear search. (Linear search would be very efficient (speed and memory) for this case.)
To beat the average case of linear search, you'd need more fields (e.g. 8).
This is as memory efficient as your linear search solution. More memory efficient than trie solution.

Hash-maps or search tree?

The problem is as follows: Given is a list of cities and their countries, population and geo-coordinates. You should read this data, save it and answer it in an endless loop of the following type:
Request: a prefix (e.g., free).
Answer: all states beginning with this prefix ("case-insensitive")
and their associated data (country + population + geo-coordinates).
The cities should be sorted by population (highest population first).
Which data structure are the most suitable for the described problem ?
First Part : My Thoughts are hanging between Trie and Hashmap. Although i tend to the Trie more because i'm dealing with prefix requests , and Trie is basically according to Wikipedia :
"a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings".
in addition to that in terms of Storage and reading data Trie has the advantage over Hash-maps.
Second part: returning the sorted cities by population would be a little bit challenging when we speak about Time Complexity.If i'm thinking in the right direction i should save the values of the keys as lists and it will be easier to sort just the returning list , so i don't have to save it sorted to save some times.
Please share you thoughts and correct me if i'm wrong .
There are pros of cons of picking vanilla tries and vanilla hashmaps. In general, for autocomplete systems, the structure of a trie is extremely useful because you're usually searching for prefixes and the user would like to see the words that begin with the string that they have just entered.
However, there is a method to make the best use of both of these data structures, it is called a Hash Trie (implementation: http://www.sanfoundry.com/java-program-implement-hash-trie/). So the way you would implement this is by using the structure of the trie, but the final node is the actual string it refers to. In python, this is done using dictionaries instead of lists while implementing the trie.
For the second half of the question, a list would be your best bet, in essence a list of tuples (population, city) and sort by the population and return the cities. Regarding it being "easier" to sort, I'm not sure if I agree with this, easy is a relevant term and there's really no way of saying that it's easier than, maybe storing it in a tree and then returning the Pre-Order Traversal of the tree. Essentially, if you're using comparison based sort, it won't get better than nlog (n).

what are some other possible use cases of a Trie data structure other than T9/Spell checker dictionaries?

I am trying to understand Trie data structure & I've understood that they are used in Spell checkers/Auto suggest or correct spellings etc. i.e. especially used in the context of language dictionaries. I wonder if there are any other possible use cases for a Trie data structure (as it is or in any augmented form).
Thanks for advance.
PS: This is not a homework problem & I am here trying to better understand possible usecases for a Trie data structure and that's it.
Tries are integral in routing systems.
Most routers stores IP address in a form of a trie (Patricia Trees) which are well suited for lookups etc.
Tries are useful as a lookup structure where you are dealing with strings (of bytes/bits etc).
Suffix trees are essentially tries and have wide string related applications, like substring checks, finding repeated substrings, palindrome finding etc.
Here are a couple of algorithm puzzles for you to try out.
Given an nxn binary matrix (of zeroes and ones), eliminate the duplicate rows.
Given n numbers, find two numbers x,y among those such that x XOR y (the exclusive OR) is maximum among all the n^2 possibilities.

whats the best way to traverse a large dictionary of words?

lets say I'm looking for a word that may or may not be in a dictionary of 95k words - I Cannot use word length to facilitate search. My question is in regards to the fastest way to find the word without doing a O(n) look up.
Here are my two thoughts:
first, store the words in a hast table, look up of the word is O(1), this seems the best scenario in my mind, but going through different websites using Trie was also suggested, my question regarding this is whether its practical to have a Trie that holds so many words.
The lookup would be O(k) in this case.
So what is the most optimal way of finding a word in a large dictionary?
Optimality depends on your use case - do you care about look up-time or space? (also, do you care about inserting new words?).
The best you can do time-wise is to use a hash table, but for a dictionary, it is space-inefficient. A trie compresses the space requirement because it stores prefixes, not the entire word, but takes longer to look up. So, to answer your question, it is more space efficient to have a trie with a large number of words than a hash table.
If you are just searching for a single word, the cost of setting up a hash table or tree structure would exceed a linear search. These structures become (very) efficient when their costs are amortized over (very) many uses.
If the dictionary is sorted (and why wouldn't a dictionary be?), then you can look for a single word in log(n) time with a binary search through the file, no additional structures needed.
I think the best way to find a word in a dictionary is a B+ tree.And let me explain you the reason.
Lets say you have a root block of 10 strings.The strings in the block are sorted.These 10 strings are followed by a pointer to another cell of 10 strings and that goes one.So the only thing you have to do is just String compare your Key word starting by the First one until you find a word smaller in comparison (StringCompare).
If we take it as standard that each string has next to it a pointer that shows to a cell with words that are smaller in comparison,it will take you 5 steps and 5 comparisons to end to the final bracket of data that will may or may not contain your Key word.
in 5 comparisons + the comparisons in the final bracket you are searching a dictionary of 10*10*10*10*10 words.
The algorithm is of logarithmic speed Log 100000 with base the number of strings in the cell.If each cell has 10 words you need 5 steps.
I must mention that only the Root of the tree must be stored in the Ram memory.All the other blocks can be stored in the hard drive without significant loss in performance because of the few steps.
Hope i explained right :D At least i tried! have fun
Trie is preferable because this data-structure can be faster than hash-table. Hash tables is O(1) only in ideal case, in real world applications collisions can occur. Different types of trie data structure doesn't suffer from this.
Another case is compression. Trie are much more compact than hash table. Hash table require some space for efficient insert operations. If load factor of the hash table are colse to 100% than insert operations takes very long time.
With hash tables you must compare your key with at least one key from the dictionary, key comparison in this case takes O(k) where k in key length. With trie you are doing the same thing, your lookup operations is O(k).
Tries allow ordered traversal, hash tables - don't.
There is many types of tries out there, for example ternary search trie is verty good in this particular case. Array mapped trie are also very fast, compared to regular hash table.

Data structure for range query

I was recently asked a coding question on the below problem.
I have some solution to this problem but I am not very sure if those are most efficient.
Problem:
Write a program to track set of text ranges. Start point and end point will be string.
Text range example : [AbA-Ef]
Aa would fall before this range
AB would fall inside this range
etc.
String comparison would be like 'A' < 'a' < 'B' < 'b' ... 'Z' < 'z'
We need to support following operations on this range
Add range - this should merge the ranges if applicable
Delete range - this deletes range from tracked ranges and recompute the ranges
Query range - Given a character, function should return whether it is part of any of tracked ranges or not.
Note that tracked ranges can be dis-continuous.
My solutions:
I came up with two approaches.
Store ranges as doubly linked list or
Store ranges as some sort of balanced tree with leaf node having actual data and they are inter-connected as linked list.
Do you think that this solution are good enough or you can think of any better way of doing this so that those three API gives your best performance ?
You are probably looking for an interval tree.
Use the data structure with your custom comparator to indicate "What's on range", and you will be able to do the required operations efficiently.
Note, an interval tree is actually an efficient way to implement your 2nd idea (Store ranges as a some sort of balanced tree)
I'm not clear on what the "delete range" operation is supposed to do. Does it;
Delete a previously inserted range, and recompute the merge of the remaining ranges?
Stop tracking the deleted range, regardless of how many times parts of it have been added.
That doesn't make a huge difference algorithmically; it's just bookkeeping. But it's important to clarify. Also, are the ranges closed or half-open? (Another detail which doesn't affect the algorithm but does affect the implementation).
The basic approach to this problem is to merge the tracked set into a sorted list of disjoint (non-overlapping) ranges; either as a vector or a binary search tree, or basically any structure which supports O(log n) searching.
One approach is to put both endpoints of every disjoint range into the datastructure. To find out if a target value is in a range, find the index of the smallest endpoint greater than the target. If the index is odd the target is in some range; even means it's outside.
Alternatively, index all the disjoint ranges by their start points; find the target by searching for the largest start-point not greater than the target, and then compare the target with the associated end-point.
I usually use the first approach with sorted vectors, which are plausible if (a) space utilization is important and (b) insert and merge are relatively rare. With binary search trees, I go for the second approach. But they differ only in details and constants.
Merging and deleting are not difficult, but there are an annoying number of cases. You start by finding the ranges corresponding to the endpoints of the range to be inserted/deleted (using the standard find operation), remove all the ranges in between the two, and fiddle with the endpoints to correct the partially overlapping ranges. While the find operation is always O(log n), the tree/vector manipulation is o(n) (if the inserted/deleted range is large, anyway).
Most languages, including Java and C++, have a some sort of ordered map or ordered set in which you can both look up a value and find the next value after or the first value before a value. You could use this as a building block - If it contains a set of disjoint ranges then it will have a least element of a range followed by a greatest element of a range followed by the least element of a range followed by the greatest element of a range and so on. When you add a range you can check to see if you have preserved this property. If not, you need to merge ranges. Similarly, you want to preserve this when you delete. Then you can query by just looking to see if there is a least element just before your query point and a greatest element just after.
If you want to create your own datastructure from scratch, I would think about some sort of radix trie structure, because this avoids doing lots of repeated string comparisons.
I think you would go for B+ tree it's the same which you have mentioned as your second approach.
Here are some properties of B+ tree:
All data is stored leaf nodes.
Every leaf is at the same level.
All leaf nodes have links to other leaf nodes.
Here are few applications B+ tree:
It reduces the number of I/O operations required to find an element in the tree.
Often used in the implementation of database indexes.
The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, file systems.
NTFS uses B+ trees for directory indexing.
Basically it helps for range queries look ups, minimizes tree traversing.

Resources