Sort array of strings based on matching words with input string - algorithm

I was not able to find any solution for below problem in a coding contest.
Problem:
We have input string of "good words" separated by underscore and list of user reviews (basically array of strings where each array element is having some words separated by underscore).
We have to sort the list of user reviews such that elements having more number of good words comes first.
Example:
input:
good words: "pool_clean_food".
user review array:["food_bedroom_environment","view_sea_desert","clean_pool_table"].
output: [2,0,1]
Explanation:
Array[2]="clean_pool_table" having 2 good words i.e. pool and clean
Array[0]="food_bedroom_environment" having 1 good word i.e. food
Array[1]="view_sea_desert" having 0 good word i.e. nil
How can I approach the problem, which data structure shall I use so that my code can handle large inputs?

Split the words of input good words by underscore and store them in a hashset.
Now for each review, assign score 0 initially. split the words by underscore as well and check if the words are present in the hashset one by one. If the word is present, add 1 to score of that word.
Now consider every reviews as <review, score> pair and sort those reviews based on their score value in ascending order. You can use any standard sorting O(nlogn) algorithm for this.
Instead of hashset, you can use Trie which will be speed up the algorithm in case the words are too big.

Related

Look for a data structure to match words by letters

Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?
One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.
I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.

Best data structure to count letter frequencies?

Task:
What is the most common first letter found in all the words in this document?
-unweighted (count a word once regardless of how many times it shows up)
-weighted (count a word separately for each time it shows up)
What is the most common word of a given length in this document?
I'm thinking of using a hashmap to count the most common first letter. But should I use a hashmap for both the unweighted and weighted?
And for most common word of a given length(ex. 5) could I use something more simple like an array list?
For the unweighted, you need a hash table to keep track of the words you've already seen, as well as a hash map to count the occurrences of the first letter. That is, you need to write:
if words_seen does not contain word
add word to words seen
update hash map with first letter of word
end-if
For the weighted, you don't need that hash table, because you don't care how many times the word occurs. So you can just write:
update hash map with first letter of word
For the most common words, you need a hash map to keep track of all the unique words you see, and the number of times you see the word. After you've scanned the entire document, make a pass through that hash map to determine the most frequent one with the desired length.
You probably don't want to use an array list for the last task, because you want to count occurrences. If you used an array list then after scanning the entire document you'd have to sort that list and count frequencies. That would take more memory and more time than just using the hash map.

Efficient algorithm to find most common phrases in a large volume of text

I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.
Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.
I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").
If so then I would recommend the following linear-time approach:
Split your text into words and remove things you don't consider significant (i.e. remove capitalisation, punctuation, word breaks, etc.)
Convert your text into an array of integers (one integer per unique word) (e.g. every instance of "cat" becomes 1, every "dog" becomes 2) This can be done in linear time by using a hash-based dictionary to store the conversions from words to numbers. If the word is not in the dictionary then assign a new id.
Construct a suffix-array for the array of integers (this is a sorted list of all the suffixes of your array and can be constructed by linear time - e.g. using the algorithm and C code here)
Construct the longest common prefix array for your suffix array. (This can also be done in linear-time, for example using this C code) This LCP array gives the number of common words at the start of each suffix between consecutive pairs in the suffix array.
You are now in a position to collect your common phrases.
It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.
Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.

Checking if a word is made up of one or more concatenated dictionary words

Here's the scenario:
I have an array of millions of random strings of letters of length 3-32, and an array of words (the dictionary).
I need to test if a random string can be made up by concatenating 1, 2, or 3 different dictionary words or not.
As the dictionary words would be somewhat fixed, I can do any kind of pre-processing on them.
Ideally, I'd like something that optimizes lookup speeds by doing some kind of pre-processing on the dictionary.
What kind of data structures / algorithms should I be looking at to implement this?
First, Build a B-Tree like Trie structure from your dict. Each root would map to a letter. Each 2nd level subtree would then have all of the words that could be made with two letters, and so on.
Then take your word and start with the first letter and walk down the B-Tree Trie until you find a match and then recursively apply this algorithm to the rest of the word. If you don't find a match at any point you know you can't form the word via concats.
Store the dictionary strings in a hashed set data structure. Iterate through all possible splits of the string you want to check in 1, 2 or 3 parts, and for each such split look up all parts in the hash set.
Make a regex matching every word in your dictionary.
Put parentheses around it.
Put a + on the end.
Compile it with any correct (DFA-based) regex engine.

Efficient data structure for word lookup with wildcards

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).
So if the user entered:
"orange" it should match an entry "orange' in the dictionary.
Now the catch is that the user can also enter a wildcard or series of wildcard characters like say
"or__ge" which would also match "orange"
The key requirements are:
* this should be as fast as possible.
* use the smallest amount of memory to achieve it.
If the size of the word list was small I could use a string containing all the words and use regular expressions.
however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.
So is some sort of 'tree' be the way to go for this...?
Any thoughts or suggestions on this would be totally appreciated!
Thanks in advance,
Matt
Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.
The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.
Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.
Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.
I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)
However if that wasn't good enough I would probably use a prefix tree for this.
The basic structure is a tree where:
The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
The next level down contains all the possible second letters for each given first letter
And so on until you reach an "end of word" marker for each word
Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.
In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.
No matter which algorithm you choose, you have a tradeoff between speed and memory consumption.
If you can afford ~ O(N*L) memory (where N is the size of your dictionary and L is the average length of a word), you can try this very fast algorithm. For simplicity, will assume latin alphabet with 26 letters and MAX_LEN as the max length of word.
Create a 2D array of sets of integers, set<int> table[26][MAX_LEN].
For each word in you dictionary, add the word index to the sets in the positions corresponding to each of the letters of the word. For example, if "orange" is the 12345-th word in the dictionary, you add 12345 to the sets corresponding to [o][0], [r][1], [a][2], [n][3], [g][4], [e][5].
Then, to retrieve words corresponding to "or..ge", you find the intersection of the sets at [o][0], [r][1], [g][4], [e][5].
You can try a string-matrix:
0,1: A
1,5: APPLE
2,5: AXELS
3,5: EAGLE
4,5: HELLO
5,5: WORLD
6,6: ORANGE
7,8: LONGWORD
8,13:SUPERLONGWORD
Let's call this a ragged index-matrix, to spare some memory. Order it on length, and then on alphabetical order. To address a character I use the notation x,y:z: x is the index, y is the length of the entry, z is the position. The length of your string is f and g is the number of entries in the dictionary.
Create list m, which contains potential match indexes x.
Iterate on z from 0 to f.
Is it a wildcard and not the latest character of the search string?
Continue loop (all match).
Is m empty?
Search through all x from 0 to g for y that matches length. !!A!!
Does the z character matches with search string at that z? Save x in m.
Is m empty? Break loop (no match).
Is m not empty?
Search through all elements of m. !!B!!
Does not match with search? Remove from m.
Is m empty? Break loop (no match).
A wildcard will always pass the "Match with search string?". And m is equally ordered as the matrix.
!!A!!: Binary search on length of the search string. O(log n)
!!B!!: Binary search on alphabetical ordering. O(log n)
The reason for using a string-matrix is that you already store the length of each string (because it makes it search faster), but it also gives you the length of each entry (assuming other constant fields), such that you can easily find the next entry in the matrix, for fast iterating. Ordering the matrix isn't a problem: since this has only be done once the dictionary updates, and not during search-time.
If you are allowed to ignore case, which I assume, then make all the words in your dictionary and all the search terms the same case before anything else. Upper or lower case makes no difference. If you have some words that are case sensitive and others that are not, break the words into two groups and search each separately.
You are only matching words, so you can break the dictionary into an array of strings. Since you are only doing an exact match against a known length, break the word array into a separate array for each word length. So byLength[3] is the array off all words with length 3. Each word array should be sorted.
Now you have an array of words and a word with potential wild cards to find. Depending on wether and where the wildcards are, there are a few approaches.
If the search term has no wild cards, then do a binary search in your sorted array. You could do a hash at this point, which would be faster but not much. If the vast majority of your search terms have no wildcards, then consider a hash table or an associative array keyed by hash.
If the search term has wildcards after some literal characters, then do a binary search in the sorted array to find an upper and lower bound, then do a linear search in that bound. If the wildcards are all trailing then finding a non empty range is sufficient.
If the search term starts with wild cards, then the sorted array is no help and you would need to do a linear search unless you keep a copy of the array sorted by backwards strings. If you make such an array, then choose it any time there are more trailing than leading literals. If you do not allow leading wildcards then there is no need.
If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length.
So an array of arrays of strings. Each array of strings is sorted, and contains strings of equal length. Optionally duplicate the whole structure with the sorting based on backwards strings for the case of leading wildcards.
The overall space is one or two pointers per word, plus the words. You should be able to store all the words in a single buffer if your language permits. Of course, if your language does not permit, grep is probably faster anyway. For a million words, that is 4-16MB for the arrays and similar for the actual words.
For a search term with no wildcards, performance would be very good. With wildcards, there will occasionally be linear searches across large groups of words. With the breakdown by length and a single leading character, you should never need to search more than a few percent of the total dictionary even in the worst case. Comparing only whole words of known length will always be faster than generic string matching.
Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).
You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

Resources