Whats the best data-structure to store a string - data-structures

Recently I had an interview and I was asked this question.
Given a string which can have insert,delete and substring functions.
substring function returns the string from start index to end index which are given as parameters.
All three options are in random order, what is the efficient data-structure to use.

I'm assuming insert or delete operations here can be carried out in the middle of the string, not just end. Otherwise anything like c++ vector or python list is good enough.
Otherwise, Rope data structure is a very good candidate. It allows all of those operations in O(logN), which i think the best anyone could hope for. It's a good choice for editors, or while manipulating huge strings, genome data for example.
Another related, and more common choice for editors is Gap Buffer.

Related

What algorithm to use to match beginning of strings

I have a lot of strings that I would like to match against a search term.
Example:
folks
fort
garage
grabbed
grandmother
habit
happily
harry
heading
hunter
I'll like to search for the string "ha" and the algorithm to return the start of the list where where strings begin with "ha", in this case "habit".
Of course I don't one to go one by one since the list is huge. I can do some pre processing to sort the list or put it into a structure that makes this sort of search fasts.
Any suggestions?
Well you want a sorted structure of some type. You could get away with a TreeMap or a Radix Tree (Radix will save you some space). The overhead of this will be the sort operation or the overhead of inserting into a sorted data structure. However, once sorted a binary search will give you logN+1 worst case lookup performance.
Of note Lucene uses Radix Trees afaik
You can always look at Patricia Trees. They are almost perfectly suited for this kind of thing.
A Trie is what you are looking for.
Your post leaves too many questions unanswered. My interpretation is that you want to create a dictionary from an unordered list of words. But then when you search for ha, what is it that you really want?
Do you want
the first word that starts with ha?
the index of the first word that starts with ha?
to have easy access to all the words that start with ha?
If you want 1 and/or 3, then the person who says trie is correct. (The link I give you has an easy to read implementation).
If 2 is what you want, then can you talk about a use-case? If not, then you are looking at using a string search algorithm. Without more details, it's difficult to give more precise advice.
Your question has many fuzzy areas. Depending on exactly what your requirements are you might find that the Rabin-Karp string searching method is of use to you.

top 3 word count in a text editor

I know one way to solve this question is to Hash the words and its corresponding word count. Then traverse the Hash map and figure out the top 3.
Is there any better way to solve this ? Will it be better if I use a BST instead of a HashMap ?
A Trie is a good datastructure for this. No need for hash calculations and its time complexity for inserts and updates is O(1) in the size of the dictionary.
Basically a histogram is the standard way of doing so, have your pick of which implementation you want to use for the histogram interface, the difference between them is actually instance specific - each has its advantages and disadvantages.
You might also want to consider a map-reduce design to get the words count:
map(doc):
for each word:
emitIntermediate(word,"1")
reduce(word,list<string>):
emit(word,size(list))
This approach allows great scalability if you have a lot of documents - using the map-reduce interface, or elegant solution if you like functional programming.
Note: this approach is basically same as the hash solution, since the mapper is passing the (key,values) tuple using hashing.
Either a HashMap or a BST are a reasonable choice. Performance of each will vary depending upon the number of words you need to count over. A profiler is your friend in these instances (VisualVM is a reasonable choice to start with).
I would wager that a hash table would have better performance in this case since there are likely many different words. Lookups will take O(1) over O(log N).

Decoding Permutated English Strings

A coworker was recently asked this when trying to land a (different) research job:
Given 10 128-character strings which have been permutated in exactly the same way, decode the strings. The original strings are English text with spaces, numbers, punctuation and other non-alpha characters removed.
He was given a few days to think about it before an answer was expected. How would you do this? You can use any computer resource, including character/word level language models.
This is a basic transposition cipher. My question above was simply to determine if it was a transposition cipher or a substitution cipher. Cryptanalysis of such systems is fairly straightforward. Others have already alluded to basic methods. Optimal approaches will attempt to place the hardest and rarest letters first, as these will tend to uniquely identify the letters around them, which greatly reduces the subsequent search space. Simply finding a place to place an "a" (no pun intended) is not hard, but finding a location for a "q", "z", or "x" is a bit more work.
The overarching goal for an algorithm's quality isn't to decipher the text, as it can be done by better than brute force methods, nor is it simply to be fast, but it should eliminate possibilities absolutely as fast as possible.
Since you can use multiple strings simultaneously, attempting to create words from the rarest characters is going to allow you to test dictionary attacks in parallel. Finding the correct placement of the rarest terms in each string as quickly as possible will decipher that ciphertext PLUS all of the others at the same time.
If you search for cryptanalysis of transposition ciphers, you'll find a bunch with genetic algorithms. These are meant to advance the research cred of people working in GA, as these are not really optimal in practice. Instead, you should look at some basic optimizatin methods, such as branch and bound, A*, and a variety of statistical methods. (How deep you should go depends on your level of expertise in algorithms and statistics. :) I would switch between deterministic methods and statistical optimization methods several times.)
In any case, the calculations should be dirt cheap and fast, because the scale of initial guesses could be quite large. It's best to have a cheap way to filter out a LOT of possible placements first, then spend more CPU time on sifting through the better candidates. To that end, it's good to have a way of describing the stages of processing and the computational effort for each stage. (At least that's what I would expect if I gave this as an interview question.)
You can even buy a fairly credible reference book on deciphering double transposition ciphers.
Update 1: Take a look at these slides for more ideas on iterative improvements. It's not a great reference set of slides, but it's readily accessible. What's more, although the slides are about GA and simulated annealing (methods that come up a lot in search results for transposition cipher cryptanalysis), the author advocates against such methods when you can use A* or other methods. :)
first, you'd need a test for the correct ordering. something fairly simple like being able to break the majority of texts into words using a dictionary ordered by frequency of use without backtracking.
one you have that, you can play with various approaches. two i would try are:
using a genetic algorithm, with scoring based on 2 and 3-letter tuples (which you can either get from somewhere or generate yourself). the hard part of genetic algorithms is finding a good description of the process that can be fragmented and recomposed. i would guess that something like "move fragment x to after fragment y" would be a good approach, where the indices are positions in the original text (and so change as the "dna" is read). also, you might need to extend the scoring with something that gets you closer to "real" text near the end - something like the length over which the verification algorithm runs, or complete words found.
using a graph approach. you would need to find a consistent path through the graph of letter positions, perhaps with a beam-width search, using the weights obtained from the pair frequencies. i'm not sure how you'd handle reaching the end of the string and restarting, though. perhaps 10 sentences is sufficient to identify with strong probability good starting candidates (from letter frequency) - wouldn't surprise me.
this is a nice problem :o) i suspect 10 sentences is a strong constraint (for every step you have a good chance of common letter pairs in several strings - you probably want to combine probabilities by discarding the most unlikely, unless you include word start/end pairs) so i think the graph approach would be most efficient.
Frequency analysis would drastically prune the search space. The most-common letters in English prose are well-known.
Count the letters in your encrypted input, and put them in most-common order. Matching most-counted to most-counted, translated the cypher text back into an attempted plain text. It will be close to right, but likely not exactly. By hand, iteratively tune your permutation until plain text emerges (typically few iterations are needed.)
If you find checking by hand odious, run attempted plain texts through a spell checker and minimize violation counts.
First you need a scoring function that increases as the likelihood of a correct permutation increases. One approach is to precalculate the frequencies of triplets in standard English (get some data from Project Gutenburg) and add up the frequencies of all the triplets in all ten strings. You may find that quadruplets give a better outcome than triplets.
Second you need a way to produce permutations. One approach, known as hill-climbing, takes the ten strings and enters a loop. Pick two random integers from 1 to 128 and swap the associated letters in all ten strings. Compute the score of the new permutation and compare it to the old permutation. If the new permutation is an improvement, keep it and loop, otherwise keep the old permutation and loop. Stop when the number of improvements slows below some predetermined threshold. Present the outcome to the user, who may accept it as given, accept it and make changes manually, or reject it, in which case you start again from the original set of strings at a different point in the random number generator.
Instead of hill-climbing, you might try simulated annealing. I'll refer you to Google for details, but the idea is that instead of always keeping the better of the two permutations, sometimes you keep the lesser of the two permutations, in the hope that it leads to a better overall outcome. This is done to defeat the tendency of hill-climbing to get stuck at a local maximum in the search space.
By the way, it's "permuted" rather than "permutated."

find repeated word in infinite stream of words

You are given an infinite supply of words, which are coming one by one, and length of words, can be huge and is unknown how big it is. How will you find if the new word is repeated, what data structure will you use to store.This was the question asked to me in the interview .please help me to verify my answer.
Normally use a hash-table to keep track of the count of each word. Since you only have to answer whether the words are duplicated, you can reduce the word count to a bitmask, so that you only store a single bit for each hash index.
If the question is related to big data, like how to write a search engine for Google, your answer may need to relate to MapReduce or similar distributed techniques (which takes root somewhat in same hash table techniques as described above)
As with most sequential data, a trie would be a good choice here. Using a trie you can store new words very cost efficiently and still be sure to find new words. Tries can actually be seen as a form of multiple hashing of the words. If this still leads to problems, because the size of the words is to big, you can make it more efficient by producing a directed acyclic word graph (DAWG) from the words in order to reduce common suffixes as well as prefixes.
If all you need to do is efficiently detect if each word is one you've seen before, a Bloom filter is one nice option. It's kind of like a set and a hash table combined in one, and therefore can result in false positives -- for this reason they are sometimes adapted to use additional techniques to reduce that risk. The advantage of Bloom filters is that they are very space efficient (important if you really don't know how large the list will be). They are also fast. On the downside, you can't get the words out again, you can only tell whether you've seen them or not.
There's a nice description at: http://en.wikipedia.org/wiki/Bloom_filter.

Which data structure to add/look up/keep count of strings?

I'm trying to figure out what data structure to quickly support the following operations:
Add a string (if it's not there, add it, if it is there, increment a counter for the word)
Count a given string (look up by string and then read the counter)
I'm debating between a hash table or a trie. From my understanding a hash table is fast to look up and add as long as you avoid collisions. If I don't know my inputs ahead of time would a trie be a better way to go?
It really depends on the types of strings you're going to be using as "keys". If you're using highly variable strings, plus you do not have a good hash algorithm for your strings, then a trie can outperform a hash.
However, given a good hash, the lookup will be faster than in a trie. (Given a very bad hash, the opposite is true, though.) If you don't know your inputs, but do have a decent hashing algorithm, I personally prefer using a hash.
Also, most modern languages/frameworks have very good hashing algorithms, so chances are, you'll be able to implement a good lookup using a hash with very little work, that will perform quite well.
A trie won't buy you much; they're only interesting when prefixes are important. Hash tables are simpler, and usually part of your language's standard library, if not directly part of the language itself (Ruby, Python, etc). Here's a dead-simple way to do this in Ruby:
strings = %w(some words that may be repeated repeated)
counts = Hash.new(0)
strings.each { |s| counts[s] += 1 }
#counts => {"words"=>1, "be"=>1, "repeated"=>2, "may"=>1, "that"=>1, "some"=>1}
Addenda:
For C++, you can probably use Boost's hash implementation.
Either one is reasonably fast.
It isn't necessary to completely avoid collisions.
Looking at performance a little more closely, usually, hash tables are faster than trees, but I doubt if a real life program ever ran too slow simply because it used a tree instead of a HT, and some trees are faster than some hash tables.
What else can we say, well, hash tables are more common than trees.
One advantage of the complex trees is that they have predictable access times. With hash tables and simple binary trees, the performance you see depends on the data and with an HT performance depends strongly on the quality of the implementation and its configuration with respect to the data set size.

Resources