Radix sort sorts the numbers starting from lease significant digit to most significant digit.
I have the following scenario :
My alphabet is the english alphabet and therefore my "numbers" are english language strings. The characters of these strings are revealed one at a time left to right. That is, the most significant digit , for all strings, is revealed first and so on. At any stage, i will have a set of k character long strings, that is sorted. At this point one character more is revealed for every string. And i want to sort the new set of strings. How do i do this efficiently without starting from scratch ?
For example if i had the following sorted set { for, for, sta, sto, sto }
And after one more character each is revealed, the set is { form, fore, star, stop, stoc }
The new sorted set should be {fore, form, star, stoc, stop }
I m hoping for a complexity O(n) after each new character is added, where n is the size of the set.
If you want to do this in O(n) you have to somehow keep track of "groups":
for, for | sta | sto, sto
Within this groups, you can sort the strings according to their last character keeping the set sorted.
Storing groups can be done in various ways. At first sight, I would recommend some kind of remembering offsets of group beginnings/endings. However, this consumes extra memory.
Another possibility might be storing the strings in some kind of prefix-tree, which correspond quite naturally to "adding one char after another", but I don't know if this is suitable for your application.
Related
In agglutinative languages, "words" is a fuzzy concept. Some agglutinative languages are like Turkish, Inuktitut, and many Native American languages (amongst others). In them, "words" are often/usually composed of a "base", and multiple prefixes/suffixes. So you might have ama-ebi-na-mo-kay-i-mang-na (I just made that up), where ebi is the base, and the rest are affixes. Let's say this means "walking early in the morning when the birds start singing", ama/early ebi/walk na/-ing mo/during kay/bird i/plural mang/sing na-ing. These words can get quite long, like 30+ "letters".
So I was playing around with creating a "dictionary" for a language like this, but it's not realistic to write definitions or "dictionary entries" as your typical English "words", because there are a possibly infinite number of words! (All combinations of prefixes/bases/suffixes). So instead, I was trying to think maybe you could have just these "word parts" in the database (prefixes/suffixes/bases, which can't stand by themselves actually in the real spoken language, but are clearly distinct in terms of adding meaning). By having a database of word parts, you would then (in theory) query by passing as input a long say 20-character "word", and it would figure out how to break this word down into word parts because of the database (somehow).
That is, it would take amaebinamokayimangna as input, and know that it can be broken down into ama-ebi-na-mo-kay-i-mang-na, and then it simply queries the database for those parts to return whatever metadata is associated with those parts.
What would you need to do to accomplish this basically, at a high level? Assuming you had a database (SQL or just in a text file) containing these affixes and bases, how could you take the input and know that it breaks down into these parts organized in this way? Maybe it turns out there is are other parts in the DB which can be arrange like a-ma-e-bina-mo-kay-im-ang-na, which is spelled the the exact same way (if you remove the hyphens), so it would likely find that as a result too, and return it as another possible match.
The only way (naive way) I can think of solving this currently, is to break the input string into ngrams like this:
function getNgrams(str, { min = 1, max = 8 } = {}) {
const ngrams = []
const points = Array.from(str)
const n = points.length
let minSize = min
while (minSize <= max) {
for (let i = 0; i < (n - minSize + 1); i++) {
const ngram = points.slice(i, i + minSize)
ngrams.push(ngram.join(''))
}
minSize++
}
return ngrams
}
And it would then check the database if any of those ngrams exist, maybe passing in if this is a prefix (start of word), infix, or suffix (end of word) part. The database parts table would have { id, text, is_start, is_end } sort of thing. But this would be horribly inefficient and probably wouldn't work. It seems really complex how you might go about solving this.
So wondering, how would you solve this? At a high level, what is the main vision you see of how you would tackle this, either in a SQL database or some other approach?
The goal is, save to some persisted area the word parts, and how they are combined (if they are a prefix/infix/suffix), and then take as input a string which could be generated from those parts, and try and figure out what the parts are from the persisted data, and then return those parts in the correct order.
First consider the simplified problem where we have a combination of prefixes only. To be able to split this into prefixes, we would do:
Store all the prefixes in a trie.
Let's say the input has n characters. Create an array of length n (of numbers, if you need just one possible split, or sets of numbers, if you need all possible splits). We will store in this array for each index, from which positions of the input string this index can be reached by adding a prefix from the dictionary.
For each substring starting with the 1st character of the input, if it belongs to the Trie, mark the index as can be reached from 0th position (i.e. there is a path from 0th position to k-th position). Trie allows us to do this in O(n)
For all i = 2..n, if the i-th character can be reached from the beginning, repeat the previous step for the substrings starting at i, mark their end position as "can be reached from (i-1)th position" as appropriate (i.e. there is a path from (i-1)th position to ((i-1)+k)th position).
At the end, we can traverse these indices backwards, starting at the end of the array. Each time we jump to an index stored in the array, we are skipping a prefix in the dictionary. Each path from the last position to the first position gives us a possible split. Since we repeated the 4-th step only for positions that can be reached from the 0-th position, all paths are guaranteed to end up at the 0-th position.
Building the array takes O(n^2) time (assuming we have the trie built already). Traversing the array to find all possible splits is O(n*s), where s is the number of possible splits. In any case, we can say if there is a possible split as soon as we have built the array.
The problem with prefixes, suffixes and base words is a slight modification of the above:
Build the "previous" indices for prefixes and "next" for suffixes (possibly starting from the end of the input and tracking the suffixes backwards).
For each base word in the string (all of which we can also find efficiently -O(n^2)- using a trie) see if the starting position can be reached from the left using prefixes, and end position can be reached from right using suffixes. If yes, you have a split.
As you can see, the keywords are trie and dynamic programming. The problem of finding only a single split requires O(n^2) time after the tries are built. Tries can be built in O(m) time where m is the total length of added strings.
What's the best approach to search for one character in a million characters string? This is more from an algorithmic point of view rather than how to do it with a particular programming language?
Is binary search a good approach?
Without preprocessing, scan the string until you meet the target character. If you only need to check presence or the location of the first instance, you are done. Otherwise, you need to scan to the end.
With preprocessing
if you need to report presence or count, form an histogram (count of the instances for every possible value); this can be done in a single pass (with possible early termination if the count is not required). Then a query is done in constant time.
if you need to report the first instance (or some), fill a table of first-occurrence-indexes for each character value; this can be done in a single pass (with possible early termination). Then a query is done in constant time.
if you need to report all instances, you can prefill linked lists of all instances of every character; this can be done in a single pass, but the storage cost is heavy (one link per character). Then a query is done in time proportional to the number of occurrences.
Note that sorting with a general sort, then answering the queries by binary search is probably the worst thing you can do. General sorting will be more costly than needed (N Log(N) instead of N), and the queries will be expensive (Log(N) instead of 1). Not counting that if you need the location information, you'll have to augment the string with an extra field before sorting.
If the characters in the string are known to be in sorted order (a pretty unlikely situation !), the answer is different:
if you need to query just once, use a dichotomic search (two if you are asked the count or the range where the character is found).
if you need to perform more queries (at least S Log(S), where S is the size of the alphabet), then you can delimit the ranges of equal characters by a series of dichotomic searches.
Let L be the string length and S the alphabet size.
Without preprocessing, you need a sequential search. It will take a number of comparisons equal to the position of the first occurrence of the target character (or L if absent). Best case 1, worst case L, average case LS/K (for a uniform and balanced distribution with K occurrences of the target character).
With preprocessing, you fill a presence table by a sequential scan of the string. The number of character comparisons will equal the "last" first occurrence of any character (or L if one is absent). Best case S, worst case L. Extra storage of S bits is required. Subsequent queries are done in constant time.
Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?
One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.
I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.
I am trying to write a program that will group all anagrams in a list together, and the output has to be sorted alphabetically. I already have a program to sort the input alphabetically which does it in O(nlog(n)) time using heapsort. My program also groups the anagrams, however it is too slow. I believe using hashing will give an efficient algorithm but not quite sure how to implement it. Does anyone have any suggestions for an efficient algorithm to complete this task?
eg.
Input:
eat tea tan ate nat bat
Output:
ate eat tea
bat
nat tan
It seems you are looking at it wrong. From what I understands, you first sort strings by alphabetical order, and then try to seperate them to groups.
Try doing it the opposite. First, group the strings into anagrams, and only then, sort each group.
Grouping the anagram can be done in various ways, here is one of them:
Sort each string into an anagram. This means, each anagram is itself sorted. For example: eat,tea, nat will all be sorted to the string "aet". (Remember the original form of each word for later usage).
Once each word is "anagram sorted", you can simply use a hash table to group all of them, by using a Map<String,List<String>> - where key is the "sorted anagram", and the value is a list containing all original words.
Once you have this map, you need to sort each list which is a value in this map, and this is your final output.
Yeah, hashing it is.
You can use the following hashing technique: (Assuming your strings are all without space and have only lower case characters, and if they are having upper case, it will be treated differently (cat and Act are not anagrams then))
Hash value of a character will be the square of its ascii value, ie.
a = 97*97, b = 98*98, etc.
Add up the character values in every word, that will be its hash value.
Now, group together words with same (equal) hash value.
PS: if cat and Act are anagrams, convert A to a before computation.
PPS: In response to #amit's comments, I squared ASCII value of each character to reduce collisions, but, this won't be absolutely collision free. You can use square of n^th Fibonacci number as hash value, and then add them. This reduces collision even further.
So, hash values will be like:
a = 98^2, b = 99^2, c = (98+99)^2, d = (b+c)^2 and so on...
I had a telephone recently for a SE role and was asked how I'd determine if two words were anagrams or not, I gave a reply that involved something along the lines of getting the character, iterating over the word, if it exists exit loop and so on. I think it was a N^2 solution as one loop per word with an inner loop for the comparing.
After the call I did some digging and wrote a new solution; one that I plan on handing over tomorrow at the next stage interview, it uses a hash map with a unique prime number representing each character of the alphabet.
I'm then looping through the list of words, calculating the value of the word and checking to see if it compares with the word I'm checking. If the values match we have a winner (the whole mathematical theorem business).
It means one loop instead of two which is much better but I've started to doubt myself and am wondering if the additional operations of the hashmap and multiplication are more expensive than the original suggestion.
I'm 99% certain the hash map is going to be faster but...
Can anyone confirm or deny my suspicions? Thank you.
Edit: I forgot to mention that I check the size of the words first before even considering doing anything.
An anagram contains all the letters of the original word, in a different order. You are on the right track to use a HashMap to process a word in linear time, but your prime number idea is an unnecessary complication.
Your data structure is a HashMap that maintains the counts of various letters. You can add letters from the first word in O(n) time. The key is the character, and the value is the frequency. If the letter isn't in the HashMap yet, put it with a value of 1. If it is, replace it with value + 1.
When iterating over the letters of the second word, subtract one from your count instead, removing a letter when it reaches 0. If you attempt to remove a letter that doesn't exist, then you can immediately state that it's not an anagram. If you reach the end and the HashMap isn't empty, it's not an anagram. Else, it's an anagram.
Alternatively, you can replace the HashMap with an array. The index of the array corresponds to the character, and the value is the same as before. It's not an anagram if a value drops to -1, and it's not an anagram at the end if any of the values aren't 0.
You can always compare the lengths of the original strings, and if they aren't the same, then they can't possibly be anagrams. Including this check at the beginning means that you don't have to check if all the values are 0 at the end. If the strings are the same length, then either something will produce a -1 or there will be all 0s at the end.
The problem with multiplying is that the numbers can get big. For example, if letter 'c' was 11, then a word with 10 c's would overflow a 32bit integer.
You could reduce the result modulo some other number, but then you risk having false positives.
If you use big integers, then it will go slowly for long words.
Alternative solutions are to sort the two words and then compare for equality, or to use a histogram of letter counts as suggested by chrylis in the comments.
The idea is to have an array initialized to zero containing the number of times each letter appears.
Go through the letters in the first word, incrementing the count for each letter. Then go through the letters in the second word, decrementing the count.
If the counts reach zero at the end of this process, then the words are anagrams.