find repeated word in infinite stream of words

find repeated word in infinite stream of words - algorithm

You are given an infinite supply of words, which are coming one by one, and length of words, can be huge and is unknown how big it is. How will you find if the new word is repeated, what data structure will you use to store.This was the question asked to me in the interview .please help me to verify my answer.

Normally use a hash-table to keep track of the count of each word. Since you only have to answer whether the words are duplicated, you can reduce the word count to a bitmask, so that you only store a single bit for each hash index.
If the question is related to big data, like how to write a search engine for Google, your answer may need to relate to MapReduce or similar distributed techniques (which takes root somewhat in same hash table techniques as described above)

As with most sequential data, a trie would be a good choice here. Using a trie you can store new words very cost efficiently and still be sure to find new words. Tries can actually be seen as a form of multiple hashing of the words. If this still leads to problems, because the size of the words is to big, you can make it more efficient by producing a directed acyclic word graph (DAWG) from the words in order to reduce common suffixes as well as prefixes.

If all you need to do is efficiently detect if each word is one you've seen before, a Bloom filter is one nice option. It's kind of like a set and a hash table combined in one, and therefore can result in false positives -- for this reason they are sometimes adapted to use additional techniques to reduce that risk. The advantage of Bloom filters is that they are very space efficient (important if you really don't know how large the list will be). They are also fast. On the downside, you can't get the words out again, you can only tell whether you've seen them or not.
There's a nice description at: http://en.wikipedia.org/wiki/Bloom_filter.

Related

Whats the best data-structure to store a string

Recently I had an interview and I was asked this question.
Given a string which can have insert,delete and substring functions.
substring function returns the string from start index to end index which are given as parameters.
All three options are in random order, what is the efficient data-structure to use.

I'm assuming insert or delete operations here can be carried out in the middle of the string, not just end. Otherwise anything like c++ vector or python list is good enough.
Otherwise, Rope data structure is a very good candidate. It allows all of those operations in O(logN), which i think the best anyone could hope for. It's a good choice for editors, or while manipulating huge strings, genome data for example.
Another related, and more common choice for editors is Gap Buffer.

Understanding the effect the distribution of data has on hashing

So I've read the Wikipedia page on Hash functions as I'm currently playing with some.
Both on that page and other sources I've read mention that the distribution of the data affects the hash function.
Despite some explanations it is still unclear to me what exactly those effects are and perhaps why. So my question:
Just to make sure I've got it right, when they mention
distribution is this the frequency of each word in the input data
set?
What effect does the distribution of input data have on hash
functions? Of particular interest is, the performance of the hash
function, in terms of both speed and uniformity of the output produced by the hash algorithm.
EDIT 1:
I'm thinking specifically of the Wikipedia English corpus vs data from a more dynamic source, Twitter's tweets for example.

Usually you do not have as many input datasets as you have possible inputs. The distribution is therefore more of a propability, that a certain input with certain features will be picked. (essentially the same as you said, but with p<1 for every word instead of some count n>1) E.g. if you know, that the first bit of the input will always be 1, then the data is not uniformly distributed.
If your hash were very simple, eg. by only taking the first byte as 'hash', then this non-uniform distribution would lead to more collisions than anticipated. (only 128 values are possible even though you expected to get 256 different values)
Most (cryptographic) hash functions that you might know by name are good enough so that you do not have to care about this. For cryptography it is even an explicit condition: you must not be able to tell how many bits in the input changed just by looking at the difference of the hashes. That does not mean that it is impossible though. I can vaguely remember a paper stating an increased collision rate for md5 when only ascii letters and digits were hashed. I cannot find it right now, so enjoy this piece of information with care - but even if i have mixed up something, such a scenario is easily possible. And no matter whether it is md5 or some other algorithm, if you actually have such a relation, then certainly your distribution of input datasets is relevant again.

Decoding Permutated English Strings

A coworker was recently asked this when trying to land a (different) research job:
Given 10 128-character strings which have been permutated in exactly the same way, decode the strings. The original strings are English text with spaces, numbers, punctuation and other non-alpha characters removed.
He was given a few days to think about it before an answer was expected. How would you do this? You can use any computer resource, including character/word level language models.

This is a basic transposition cipher. My question above was simply to determine if it was a transposition cipher or a substitution cipher. Cryptanalysis of such systems is fairly straightforward. Others have already alluded to basic methods. Optimal approaches will attempt to place the hardest and rarest letters first, as these will tend to uniquely identify the letters around them, which greatly reduces the subsequent search space. Simply finding a place to place an "a" (no pun intended) is not hard, but finding a location for a "q", "z", or "x" is a bit more work.
The overarching goal for an algorithm's quality isn't to decipher the text, as it can be done by better than brute force methods, nor is it simply to be fast, but it should eliminate possibilities absolutely as fast as possible.
Since you can use multiple strings simultaneously, attempting to create words from the rarest characters is going to allow you to test dictionary attacks in parallel. Finding the correct placement of the rarest terms in each string as quickly as possible will decipher that ciphertext PLUS all of the others at the same time.
If you search for cryptanalysis of transposition ciphers, you'll find a bunch with genetic algorithms. These are meant to advance the research cred of people working in GA, as these are not really optimal in practice. Instead, you should look at some basic optimizatin methods, such as branch and bound, A*, and a variety of statistical methods. (How deep you should go depends on your level of expertise in algorithms and statistics. :) I would switch between deterministic methods and statistical optimization methods several times.)
In any case, the calculations should be dirt cheap and fast, because the scale of initial guesses could be quite large. It's best to have a cheap way to filter out a LOT of possible placements first, then spend more CPU time on sifting through the better candidates. To that end, it's good to have a way of describing the stages of processing and the computational effort for each stage. (At least that's what I would expect if I gave this as an interview question.)
You can even buy a fairly credible reference book on deciphering double transposition ciphers.
Update 1: Take a look at these slides for more ideas on iterative improvements. It's not a great reference set of slides, but it's readily accessible. What's more, although the slides are about GA and simulated annealing (methods that come up a lot in search results for transposition cipher cryptanalysis), the author advocates against such methods when you can use A* or other methods. :)

first, you'd need a test for the correct ordering. something fairly simple like being able to break the majority of texts into words using a dictionary ordered by frequency of use without backtracking.
one you have that, you can play with various approaches. two i would try are:
using a genetic algorithm, with scoring based on 2 and 3-letter tuples (which you can either get from somewhere or generate yourself). the hard part of genetic algorithms is finding a good description of the process that can be fragmented and recomposed. i would guess that something like "move fragment x to after fragment y" would be a good approach, where the indices are positions in the original text (and so change as the "dna" is read). also, you might need to extend the scoring with something that gets you closer to "real" text near the end - something like the length over which the verification algorithm runs, or complete words found.
using a graph approach. you would need to find a consistent path through the graph of letter positions, perhaps with a beam-width search, using the weights obtained from the pair frequencies. i'm not sure how you'd handle reaching the end of the string and restarting, though. perhaps 10 sentences is sufficient to identify with strong probability good starting candidates (from letter frequency) - wouldn't surprise me.
this is a nice problem :o) i suspect 10 sentences is a strong constraint (for every step you have a good chance of common letter pairs in several strings - you probably want to combine probabilities by discarding the most unlikely, unless you include word start/end pairs) so i think the graph approach would be most efficient.

Frequency analysis would drastically prune the search space. The most-common letters in English prose are well-known.
Count the letters in your encrypted input, and put them in most-common order. Matching most-counted to most-counted, translated the cypher text back into an attempted plain text. It will be close to right, but likely not exactly. By hand, iteratively tune your permutation until plain text emerges (typically few iterations are needed.)
If you find checking by hand odious, run attempted plain texts through a spell checker and minimize violation counts.

First you need a scoring function that increases as the likelihood of a correct permutation increases. One approach is to precalculate the frequencies of triplets in standard English (get some data from Project Gutenburg) and add up the frequencies of all the triplets in all ten strings. You may find that quadruplets give a better outcome than triplets.
Second you need a way to produce permutations. One approach, known as hill-climbing, takes the ten strings and enters a loop. Pick two random integers from 1 to 128 and swap the associated letters in all ten strings. Compute the score of the new permutation and compare it to the old permutation. If the new permutation is an improvement, keep it and loop, otherwise keep the old permutation and loop. Stop when the number of improvements slows below some predetermined threshold. Present the outcome to the user, who may accept it as given, accept it and make changes manually, or reject it, in which case you start again from the original set of strings at a different point in the random number generator.
Instead of hill-climbing, you might try simulated annealing. I'll refer you to Google for details, but the idea is that instead of always keeping the better of the two permutations, sometimes you keep the lesser of the two permutations, in the hope that it leads to a better overall outcome. This is done to defeat the tendency of hill-climbing to get stuck at a local maximum in the search space.
By the way, it's "permuted" rather than "permutated."

Algorithm for Comparing Words (Not Alphabetically)

I need to code a solution for a certain requirement, and I wanted to know if anyone is either familiar with an off-the-shelf library that can achieve it, or can direct me at the best practice. Description:
The user inputs a word that is supposed to be one of several fixed options (I hold the options in a list). I know the input must be in a member in the list, but since it is user input, he/she may have made a mistake. I'm looking for an algorithm that will tell me what is the most probable word the user meant. I don't have any context and I can’t force the user to choose from a list (i.e. he must be able to input the word freely and manually).
For example, say the list contains the words "water", “quarter”, "beer", “beet”, “hell”, “hello” and "aardvark".
The solution must account for different types of "normal" errors:
Speed typos (e.g. doubling characters, dropping characters etc)
Keyboard adjacent-character typos (e.g. "qater" for “water”)
Non-native English typos (e.g. "quater" for “quarter”)
And so on...
The obvious solution is to compare letter-by-letter and give "penalty weights" to each different letter, extra letter and missing letter. But this solution ignores thousands of "standard" errors I'm sure are listed somewhere. I'm sure there are heuristics out there that deal with all the cases, both specific and general, probably using a large database of standard mismatches (I’m open to data-heavy solutions).
I'm coding in Python but I consider this question language-agnostic.
Any recommendations/thoughts?

You want to read how google does this: http://norvig.com/spell-correct.html
Edit: Some people have mentioned algorithms that define a metric between a user given word and a candidate word (levenshtein, soundex). This is however not a complete solution to the problem, since one would also need a datastructure to efficiently perform a non-euclidean nearest neighbour search. This can be done e.g. with the Cover Tree: http://hunch.net/~jl/projects/cover_tree/cover_tree.html

A common solution is to calculate the Levenshtein distance between the input and your fixed texts. The Levenshtein distance of two strings is just the number of simple operations - insertions, deletions, and substitutions of a single character - required to turn one of the string into the other.

Have you considered algorithms that compare by phonetic sounds, such as soundex? It shouldn't be too hard to produce soundex representations of your list of words, store them, and then get a soundex of the user input and find the closest match there.

Look for the Bitap algorithm. It qualifies well for what you want to do, and even comes with a source code example in Wikipedia.

If your data set is really small, simply comparing the Levenshtein distance on all items independently ought to suffice. If it's larger, though, you'll need to use a BK-Tree or similar indexing system. The article I linked to describes how to find matches within a given Levenshtein distance, but it's fairly straightforward to adapt to do nearest-neighbor searches (and left as an exercise to the reader ;).

Though it may not solve the entire problem, you may want to consider using the soundex algorithm as part of the solution. A quick google search of "soundex" and "python" showed some python implementations of the algorithm.

Try searching for "Levenshtein distance" or "edit distance". It counts the number of edit operations (delete, insert, change letter) you need to transform one word into another. It's a common algorithm, but depending on the problem you might need something special with different weights for the different types of typos.

How to determine a strings dna for likeness to another

I am hoping I am wording this correctly to get across what I am looking for.
I need to compare two pieces of text. If the two strings are alike I would like to get scores that are very alike if the strings are very different i need scores that are very different.
If i take a md5 hash of an email and change one character the hash changes dramatically I want something to not change too much. I need to compare how alike two pieces of content are without storing the string.
Update: I am looking now at combining some ideas from the various links people have provided. Ideally I would of liked a single input function to create my score so I am looking at using a reference string to always compare my input to. I am also looking at taking asci characters and suming these up. Still reading all the links provided.

What you're looking for is a LCS algorithm (see also Levenshtein distance). You may also try Soundex or some other phonetic algorithm.

Reading your comments, it sounds like you are actually trying to compare entire documents, each containing many words.
This is done successfully in information retrieval systems by treating documents as N-dimensional points in space. Each word in the language is an axis. The distance along the axis is determined by the number of times that word appears in the document. Similar documents are then "near" each other in space.
This way, the whole document doesn't need to be stored, just its word counts. And usually the most common words in the language are not counted at all.

Check their Levenshtein Distance
In PHP you even have the levenshtein() function that makes exactly that.

I need to compare two pieces of text. If the two strings are alike I would like to get scores that are very alike if the strings are very different i need scores that are very different.
It really depends on what you mean by "same" or "different". For example, if someone replaces "United States of America" with "USA" in your string, is that mostly the same string (because USA is just an abbreviation for something longer), or is it very different (because a lot of characters changed)?
You essentially need to either devise a function that describes how to compute "sameness" or use a pre-existing definition thereof. For example, the aforementioned Levenshtein distance measures total difference based on the number of changes you have to make to get to the original string.

Since the Levenshtein distance needs both input strings to produce a value, you would have to store all strings.
You could, however, use a small number of strings as markers and only store these as strings.
You would then calculate the Levenshtein distance from a new string to each of these marker strings and store these values. You could then guess that two strings that have a similar Levenshtein distance to all markers are also similar to each other. It would likely be sensible to "engineer" these markers in such a way that their mutual Levenshtein distance is as large as possible. I don't know whether there has been some research in this direction.

Many people have suggested looking at distance/metric like approaches, and I think the wording of the question leads that way. (By the way, a hash like md5 is trying to do pretty much the opposite thing that a metric does, so it's hardly surprising that this wouldn't work for you. There are similar ideas that don't change much under small deltas, but I suspect they don't encode enough information for what you want to do)
Particularly given your update in the comments though, I think this type of approach is not very helpful.
What you are looking for is more of a clustering problem, where you want to generate a signature (i.e. feature vector) from each email and later compare it to new inputs. So essentially what you have is a machine learning problem. Deciding what "close" means may be a bit of a challenge. To get started though, assuming it actually is emails you're looking at you may do well to look at the sorts of feature generation done by many spam-filters, this will give you (probably Euclidean, at least to start) a space to measure distances in based on a signature (feature vector).
Without knowing more about your problem it's hard to be more specific.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio