Which data structure to use for storing a paragraph? - data-structures

Recently an interviewer asked me
1. Which data structure should be used if you need store a paragraph, traverse it later and find a word ?
2. Which data structure to use if you can also add, edit or delete words in that paragraph ?
Can someone help me with the answer ?
And if possible can someone also post similar questions with logical answers on data structures as am preparing for interviews.

I think what you are looking for is a Trie. A trie a tree whose nodes store unique combinations of letters (prefixes), and whose edges point to combinations of letters following those prefixes (suffixes). Tries can be built from text documents to give O(L) search, insertion, and deletion time (L is length of word you are searching for, adding, or deleting). Tries are used commonly in autocomplete and document search algorithms.

Related

What would be a good data structure to store a dictionary(of words) to optimize the search time?

Provided a list of valid words, and a search word, I want to find whether the search word is a valid word or not ALLOWING 2 typo characters.
What would be a good data structure to store a dictionary of words(assumingly it contains a million words) and algorithm to find whether the word exists in the dictionary(allowing 2 typo characters).
If no typo characters where allowed, then a Trie would be a good way to store the words but not sure if it stays the best way to store dictionary when typos are allowed. Not sure what the complexity for a backtracking algorithm(to search for a word in Trie allowing 2 typos) would be. Any idea about it?
You might want to checkout Directed Acyclic Word Graph or DAWG. It has more of an automata structure than a tree of graph structure. Multiple possibilities from one place may provide you with your solution.
If there is no need to also store all mistyped words I would consider to use a two-step approach for this problem.
Build a set containing hashes of all valid words (not including
typos). So probably we are talking here about some 10.000 entries,
which should still allow quite fast lookups with a binary search. If
the hash of a word is found in the set it is typed correctly.
If a words hash is not found in the set the word is probably
mistyped. So calculate a the Damerau-Levenshtein distance between
the word and all known words to figure out what the user might have
meant. To gain some performance here modify the DL-algorithm to
abort calculation if the distance gets bigger than your allowed
threshold of 2 typos.

Hash-maps or search tree?

The problem is as follows: Given is a list of cities and their countries, population and geo-coordinates. You should read this data, save it and answer it in an endless loop of the following type:
Request: a prefix (e.g., free).
Answer: all states beginning with this prefix ("case-insensitive")
and their associated data (country + population + geo-coordinates).
The cities should be sorted by population (highest population first).
Which data structure are the most suitable for the described problem ?
First Part : My Thoughts are hanging between Trie and Hashmap. Although i tend to the Trie more because i'm dealing with prefix requests , and Trie is basically according to Wikipedia :
"a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings".
in addition to that in terms of Storage and reading data Trie has the advantage over Hash-maps.
Second part: returning the sorted cities by population would be a little bit challenging when we speak about Time Complexity.If i'm thinking in the right direction i should save the values of the keys as lists and it will be easier to sort just the returning list , so i don't have to save it sorted to save some times.
Please share you thoughts and correct me if i'm wrong .
There are pros of cons of picking vanilla tries and vanilla hashmaps. In general, for autocomplete systems, the structure of a trie is extremely useful because you're usually searching for prefixes and the user would like to see the words that begin with the string that they have just entered.
However, there is a method to make the best use of both of these data structures, it is called a Hash Trie (implementation: http://www.sanfoundry.com/java-program-implement-hash-trie/). So the way you would implement this is by using the structure of the trie, but the final node is the actual string it refers to. In python, this is done using dictionaries instead of lists while implementing the trie.
For the second half of the question, a list would be your best bet, in essence a list of tuples (population, city) and sort by the population and return the cities. Regarding it being "easier" to sort, I'm not sure if I agree with this, easy is a relevant term and there's really no way of saying that it's easier than, maybe storing it in a tree and then returning the Pre-Order Traversal of the tree. Essentially, if you're using comparison based sort, it won't get better than nlog (n).

given prefix, how to find most frequently words effiecitly

This is a interview question extended from this one: http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
But this question required on more thing: GIVEN PREFIX
For example, given "Bl" return most frequently words such as "bloom, blame, bloomberg" etc.
SO using TRIE is a must. But then, how to efficitnly construct heap? It's not right or pratical to construct heap for each prefix at run time. what could be a good solution.
[suppose this TRIE or data structure is static, pre-build]
thanks!
Keep a trie for all the words appearing in the file and their counts. So now if it asked to return all the words with the prefix "BI", then you can do it efficiently using the trie formed. But since you are interested in giving the most frequently occuring words with the prefix "BI", you can now use a min-heap to give the answer efficiently just like it has been done in the post you have linked to.
Also note that since the space usage of a trie can grow very large, you can suggest the interviewer to use a Ternary search tree instead of a trie.

Data structure for dictionary with efficient querying of arbitrary positions

Can anyone suggest an appropriate data structure to hold a dictionary that will allow me to query the presence of words (items) that have particular letters at particular positions? For example, determine which words (if any) have letters a,b,c at positions x,y,z. Insertions do not have to be particularly efficient.
This is basically the scrabble problem (I have scores associated with the letters too, but that need not concern us). I suspect bioinformaticians have studied the same problem under the guise of sequence alignment. What's the state of the art in terms of speed?
If you are trying to build a very fast Scrabble player, you might want to look into the GADDAG data structure, which was specifically designed for the purpose. Essentially, the GADDAG is a compressed trie structure (specifically, it's a modified DAWG) that lets you explore outward and find all words that can be made with a certain set of letters subject to constraints about which letters of the words must be in what positions, as well as the overall lengths of the strings found.
The Wikipedia article on GADDAGs goes into more depth on the structure and links to the original paper on the subject. You might also want to look at DAWGs as a starting point.
Hope this helps!

Question regarding Algorithm Design Manual - Data Structure for the Dictionary

I started reading Algorithm Design Manual, and while reading it I came across one line which I am not getting. Can someone please clarify me what does author mean here? The line is:
Sorted linked lists or arrays – Maintaining a sorted linked list is usually
not worth the effort unless you are trying to eliminate duplicates, since we
cannot perform binary searches in such a data structure. A sorted array will
be appropriate if and only if there are not many insertions or deletions.
This line is in context with choosing data structure for dictionary.
The point that I am not getting is, why does author says that "Maintaining a sorted linked list is usuallynot worth the effort unless you are trying to eliminate duplicates, since we
cannot perform binary searches in such a data structure"
From what I understood I googled to see if we can binary search on sorted arrays and based on what I found it looks like we can. So I am not sure.
Can someone please help me understand this?
Thanks so much.
You cannot perform binary search on linked list efficiently because you cannot randomly seek in it in constant time. To find the midpoint you have to do n/2 steps (traverse the list). This adds a great overhead and makes lists unsuitable for binary search data structures.

Resources