Marking multiplicity in a bag of strings - algorithm

I have a bag of strings that I want to map to another bag, such that duplicates are post-pended with the multiplicity of that member when discovered, while preserving ordering. For example, given:
["a", "b", "a", "c", "b", "a"]
I want:
["a", "b", "a #1", "c", "b #1", "a #2"]
(As this is a partially-ordered bag, ["a", "a #1", "a #2", "b", "b #1", "c"] is not a valid result.)
An obvious solution builds the multiplicity set (for the example above, {a:3, b:2, c:1}) and is O(n) in time and O(n) in space:
function mark(names) {
var seen = {};
for (var i = 0; i < names.length; i++) {
var name = names[i];
if (name in seen) {
names[i] = name + ' #' + seen[name];
seen[name]++;
} else {
seen[name] = 1;
}
}
return names;
};
My question: is there a non-obvious solution that has better overall complexity? Or said differently, what other ways are there to implement this algorithm that better handle the worst case when the bag is actually a set (of very large size)?
Are there other approaches if the poset requirement is removed?

Which complexity are you trying to lower? There isn't going to be a way to make time less than O(n) because you're at minimum going to spit out a list that's n long which means you're doing at least n operations. You also can't get lower than O(n) total space either, because your output is going to be n long.
The working space for the "seen" would be O(m) where m is the number of unique entries into the array if you used a hashmap or something simliar. As m<=n necessarily you still can't get lower than O(n).
If you are looking for space saving and the input comes in sorted you could do it with O(1) working space just counting up until you see a new character and resetting your counter. Again it doesn't get you lower than O(n) including your output list, but that's impossible.

Related

What is the Time and Space complexity of following solution?

Problem statement:
Given a non-empty string s and a dictionary wordDict containing a list of non-empty words, add spaces in s to construct a sentence where each word is a valid dictionary word. Return all such possible sentences.
Note:
The same word in the dictionary may be reused multiple times in the segmentation.
You may assume the dictionary does not contain duplicate words.
Sample test case:
Input:
s = "catsanddog"
wordDict = ["cat", "cats", "and", "sand", "dog"]
Output:
[
"cats and dog",
"cat sand dog"
]
My Solution:
class Solution {
unordered_set<string> words;
unordered_map<string, vector<string> > memo;
public:
vector<string> getAllSentences(string s) {
if(s.size()==0){
return {""};
}
if(memo.count(s)) {
return memo[s];
}
string curWord = ""; vector<string> result;
for(int i = 0; i < s.size(); i++ ) {
curWord+=s[i];
if(words.count(curWord)) {
auto sentences = getAllSentences(s.substr(i+1));
for(string s : sentences) {
string sentence = curWord + ((int)s.size()>0? ((" ") + s) : "");
result.push_back(sentence);
}
}
}
return memo[s] = result;
}
vector<string> wordBreak(string s, vector<string>& wordDict) {
for(auto word : wordDict) {
words.insert(word);
}
return getAllSentences(s);
}
};
I am not sure about the time and space complexity. I think it should be 2^n where n is the length of given string s. Can anyone please help me to prove time and space complexity?
I have also some following questions:
If I don't use memo in the getAllSentences function what will be the
time complexity in this case?
Is there any better solution than this?
Let's try to go through the algorithm step by step but for specific wordDict to simplify the things.
So let wordDict be all the characters from a to z,
wordDict = ["a",..., "z"]
In this case if(words.count(curWord)) would be true every time when i = 0 and false otherwise.
Also, let's skip using memo cache (we'll add it later).
In the case above, we just got though string s recursively until we reach the end without any additional memory except result vector which gives the following:
time complexity is O(n!)
space complexity is O(1) - just 1 solution exists
where n - lenght of s
Now let's examine how using memo cache changes the situation in our case. Cache would contain n items - size of our string s which changes space complexity to O(n). Our time is the same since every there will be no hits by using memo cache.
This is the basis for us to move forward.
Now let's try to find how the things are changed if wordDict contains all the pairs of letters (and length of s is 2*something, so we could reach the end).
So, wordDict = ['aa','ab',...,'zz']
In this case we move forward with for 2 letters instead of 1 and everything else is the same, which gives us the following complexity withoug using memo cache:
time complexity is O((n/2)!)
space complexity is O(1) - just 1 solution exists
Memo cache would contain (n/2) items, giving a complexity of O(n) which also changes space complexity to O(n) but all the checks there are of different length.
Let's now imagine that wordDict contains both dictionaries we mentioned before ('a'...'z','aa'...'zz').
In this case we have the following complexity without using memo cache
time complexity is O((n)!) as we need to check the case for i=0 and i=1 which roughly doubles the number of checks we need to do for each step but on the other size it reduces the number of checks we have to do later since we move forward by 2 letters instead of one (this is the trickiest part for me).
Space complexity is ~O(2^n) since every additional char doubles the number of results.
Now let's think of the memo cache we have. It would be usefull for every 3 letters, because for example '...ab c...' gives the same as '...a bc...', so it reduces the number of calculations by 2 at every step, so our complexity would be the following
time complexity is roughly O((n/2)!) and we need O(2*n)=O(n) memory to store the memo. Let's also remember that in n/2 expression 2 reflects the cache effectiveness.
space complexity is O(2^n) - 2 here is a charateristic of the wordDict we've constructed
These were 3 cases for us to understand how the complexity is changing depending of the curcumstances. Now let's try to generalize it to the generic case:
time complexity is O((n/(l*e))!) where l = min length of words in wordDict, e - cache effectiveness (I would assume it 1 in general case but there might bt situations where it's different as we saw in the case above
space complexity is O(a^n) where a is a similarity of words in our wordDict, could be very very roughly estimated as P(h/l)=(h/l)! where h is max word length in a dictionary and l is min word length as (for example, if wordDict contains all combinations of up 3 letters, this gives us 3! combinations for every 6 letters)
This is how I see your approach and it's complexity.
As for improving the solution itself, I don't see any simple way to improve it. There might be an alternative way to divide the string in 3 parts and then processing each part separately but it would definitely work if we could get rid of searching the results and just count the number of results without displaying them.
I hope it helps.

Efficient way to implement search operation in JSON file

I have a huge JSON file which is an array of objects containing city crime information. The number of crimes per city is listed as a key/value. I'm parsing it to a hash using yajl/json_gem.
What is the efficient way to find top 10 cities that have most crimes / least crimes?
Generally, an efficient way of traversing through a list to find the k min or max elements is with a min or max heap. A heap is a tree-like data structure that always has the smallest or largest element at the top of the tree, and inserting a new element or deleting an element is O(log n).
Let's say you have N elements in your table and want to keep track of the k max elements (the process is identical for min, you just use a different heap). Per this StackOverflow post, storing the data in a max-heap of size k (and dropping values that are smaller than the minimum value in the heap) is an efficient solution to this problem.
The space complexity is O(n) (for each element in the table, there is one element in the heap), and the time complexity is O(n log k) (because you have to insert n elements worst case, and each one takes log k time).
Now, on to the implementation: Ruby doesn't have a Heap data structure, but the gem algorithms has a heap implemented in C.
I don't want to write the code for you, but I think that from this theory, you should be able to implement an efficient solution.
I do not expect this to be a complete answer, as the question is not clear, but this may provide the beginnings of a solution.
Suppose
h = { "info":[
{"name": "Paris", "crime_rate": "750"},
{"name": "Rome", "crime_rate": "800"},
{"name": "London", "crime_rate": "600"},
{"name": "Berlin", "crime_rate": "400"},
{"name": "Amsterdam", "crime_rate": "700"}
]
}
and the cities with the top two and bottom two crime rates are desired.
def top_so_many(h, meth, nbr)
h[:info].public_send(meth, nbr) { |g| g[:crime_rate] }.map { |g| g[:name] }
end
top_so_many(h, :max_by, 2)
#=> ["Rome", "Paris"]
top_so_many(h, :min_by, 2)
#=> ["Berlin", "London"]
I would try something like this:
Store your JSON in a variable:
json = {"info":[ {"name": "xyz", "crime_rate": 750}, {"name":"ABC", "crime_rate", "900"}......]}
Parse the JSON:
h = JSON.parse(s)
Use find or select to find the required numer, sort, and take 10 first objects
h.find { |el| el.crime_rate > 500 }.first(10) # or any other condition

Algorithm to solve nPr (permutations). For dummies

Yes, I've RTFM. Or, in this case, RTFSO. If it showed up in the search results for "npr" or "permutation", I read it. And while I have implemented Heap's algorithm, I can't make the leap from there (all permutations), to nPr (all permutations of length r, out of a larger set n).
An actual algorithm (pseudo-code is fine) is preferred to a long-winded explanation that doesn't include actual code. If you want to school me on the theory, fine, I'll be happy to learn from it, but I'd also like the accompanying code. If you can put in terms of Heap's, great; otherwise, I'll muddle through.
I don't have any code to show you (unless you want to see Heap's implemented in VBScript (which is all I have to work with at work)) because, as I said, I don't know where to go from there to get every r-length subset of set n.
In case my description of nPr is lacking, here is a very simple example of what I'm looking to do:
Given the set...
A, B, C
...I want to find every two-character permutation, like so:
A B
A C
B C
That example is overly simplistic, as what I am really trying to derive is a generalized solution that takes a set (array), and the number of items that should be in each permutation, as calling parameters.
Hmmm...now that I've written all this out, it seems to me that I only really need to know how to derive all subsets of length r from set n, since I can then find the permutations of those subsets using Heap's.
FYI: I'll be 50 this year; this isn't homework.
Relatively straightforward with recursion:
For each element in the set, use it or not.
Recurse with the rest of the set for both variants.
Stop when the result is complete or the remaining set is empty.
For performance, avoid actual set operation using start / position indices.
In JavaScript:
function nPr(set, n) {
nPrImpl(set, 0, new Array(n), 0);
}
function nPrImpl(set, pos, result, resultPos) {
// result complete
if (resultPos == result.length) {
window.console.log(result);
return;
}
// No more characters available
if (pos >= set.length) {
return;
}
// With set[pos]
result[resultPos] = set[pos];
nPrImpl(set, pos + 1, result, resultPos + 1);
// Without
nPrImpl(set, pos + 1, result, resultPos);
}
// Test:
nPr(['A', 'B', 'C'], 2);
Output:
["A", "B"]
["A", "C"]
["B", "C"]
Demo: https://tidejnet.appspot.com/v3/#id=8ht8adf3rlyi

Finding anagrams for a given word

Two words are anagrams if one of them has exactly same characters as that of the another word.
Example : Anagram & Nagaram are anagrams (case-insensitive).
Now there are many questions similar to this . A couple of approaches to find whether two strings are anagrams are :
1) Sort the strings and compare them.
2) Create a frequency map for these strings and check if they are the same or not.
But in this case , we are given with a word (for the sake of simplicity let us assume a single word only and it will have single word anagrams only) and we need to find anagrams for that.
Solution which I have in mind is that , we can generate all permutations for the word and check which of these words exist in the dictionary . But clearly , this is highly inefficient. Yes , the dictionary is available too.
So what alternatives do we have here ?
I also read in a similar thread that something can be done using Tries but the person didn't explain as to what the algorithm was and why did we use a Trie in first place , just an implementation was provided that too in Python or Ruby. So that wasn't really helpful which is why I have created this new thread. If someone wants to share their implementation (other than C,C++ or Java) then kindly explain it too.
Example algorithm:
Open dictionary
Create empty hashmap H
For each word in dictionary:
Create a key that is the word's letters sorted alphabetically (and forced to one case)
Add the word to the list of words accessed by the hash key in H
To check for all anagrams of a given word:
Create a key that is the letters of the word, sorted (and forced to one case)
Look up that key in H
You now have a list of all anagrams
Relatively fast to build, blazingly fast on look-up.
I came up with a new solution I guess. It uses the Fundamental Theorem of Arithmetic. So the idea is to use an array of the first 26 prime numbers. Then for each letter in the input word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list. All anagrams will have the same signature because
Any integer greater than 1 is either a prime number, or can be written
as a unique product of prime numbers (ignoring the order).
Here's the code. I convert the word to UPPERCASE and 65 is the position of A which corresponds to my first prime number:
private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103,
107, 109, 113 };
This is the method:
private long calculateProduct(char[] letters) {
long result = 1L;
for (char c : letters) {
if (c < 65) {
return -1;
}
int pos = c - 65;
result *= PRIMES[pos];
}
return result;
}
We know that if two words don't have the same length, they are not anagrams. So you can partition your dictionary in groups of words of the same length.
Now we focus on only one of these groups and basically all words have exactly the same length in this smaller universe.
If each letter position is a dimension, and the value in that dimension is based on the letter (say the ASCII code). Then you can calculate the length of the word vector.
For example, say 'A'=65, 'B'=66, then length("AB") = sqrt(65*65 + 66*66). Obviously, length("AB") = length("BA").
Clearly, if two word are anagrams, then their vectors have the same length. The next question is, if two word (of same number of letters) vectors have the same length, are they anagrams? Intuitively, I'd say no, since all vectors with that length forms a sphere, there are many. Not sure, since we're in the integer space in this case, how many there are actually.
But at the very least it allows you to partition your dictionary even further. For each word in your dictionary, calculate the vector's distance:
for(each letter c) { distance += c*c }; distance = sqrt(distance);
Then create a map for all words of length n, and key it with the distance and the value is a list of words of length n that yield that particular distance.
You'll create a map for each distance.
Then your lookup becomes the following algorithm:
Use the correct dictionary map based on the length of the word
Compute the length of your word's vector
Lookup the list of words that match that length
Go through the list and pick the anagrams using a naive algorithm is now the list of candidates is greatly reduced
Reduce the words to - say - lower case (clojure.string/lower-case).
Classify them (group-by) by letter frequency-map (frequencies).
Drop the frequency maps,
... leaving the collections of anagrams.
(These) are the corresponding functions in the Lisp dialect Clojure.
The whole function can be expressed so:
(defn anagrams [dict]
(->> dict
(map clojure.string/lower-case)
(group-by frequencies)
vals))
For example,
(anagrams ["Salt" "last" "one" "eon" "plod"])
;(["salt" "last"] ["one" "eon"] ["plod"])
An indexing function that maps each thing to its collection is
(defn index [xss]
(into {} (for [xs xss, x xs] [x xs])))
So that, for example,
((comp index anagrams) ["Salt" "last" "one" "eon" "plod"])
;{"salt" ["salt" "last"], "last" ["salt" "last"], "one" ["one" "eon"], "eon" ["one" "eon"], "plod" ["plod"]}
... where comp is the functional composition operator.
Well Tries would make it easier to check if the word exists.
So if you put the whole dictionary in a trie:
http://en.wikipedia.org/wiki/Trie
then you can afterward take your word and do simple backtracking by taking a char and recursively checking if we can "walk" down the Trie with any combiniation of the rest of the chars (adding one char at a time). When all chars are used in a recursion branch and there was a valid path in the Trie, then the word exists.
The Trie helps because its a nice stopping condition:
We can check if the part of a string, e.g "Anag" is a valid path in the trie, if not we can break that perticular recursion branch. This means we don't have to check every single permutation of the characters.
In pseudo-code
checkAllChars(currentPositionInTrie, currentlyUsedChars, restOfWord)
if (restOfWord == 0)
{
AddWord(currentlyUsedChar)
}
else
{
foreach (char in restOfWord)
{
nextPositionInTrie = Trie.Walk(currentPositionInTrie, char)
if (nextPositionInTrie != Positions.NOT_POSSIBLE)
{
checkAllChars(nextPositionInTrie, currentlyUsedChars.With(char), restOfWord.Without(char))
}
}
}
Obviously you need a nice Trie datastructure which allows you to progressively "walk" down the tree and check at each node if there is a path with the given char to any next node...
static void Main(string[] args)
{
string str1 = "Tom Marvolo Riddle";
string str2 = "I am Lord Voldemort";
str2= str2.Replace(" ", string.Empty);
str1 = str1.Replace(" ", string.Empty);
if (str1.Length != str2.Length)
Console.WriteLine("Strings are not anagram");
else
{
str1 = str1.ToUpper();
str2 = str2.ToUpper();
int countStr1 = 0;
int countStr2 = 0;
for (int i = 0; i < str1.Length; i++)
{
countStr1 += str1[i];
countStr2 += str2[i];
}
if(countStr2!=countStr1)
Console.WriteLine("Strings are not anagram");
else Console.WriteLine("Strings are anagram");
}
Console.Read();
}
Generating all permutations is easy, I guess you are worried that checking their existence in the dictionary is the "highly inefficient" part. But that actually depends on what data structure you use for the dictionary: of course, a list of words would be inefficient for your use case. Speaking of Tries, they would probably be an ideal representation, and quite efficient, too.
Another possibility would be to do some pre-processing on your dictionary, e.g. build a hashtable where the keys are the word's letters sorted, and the values are lists of words. You can even serialize this hashtable so you can write it to a file and reload quickly later. Then to look up anagrams, you simply sort your given word and look up the corresponding entry in the hashtable.
That depends on how you store your dictionary. If it is a simple array of words, no algorithm will be faster than linear.
If it is sorted, then here's an approach that may work. I've invented it just now, but I guess its faster than linear approach.
Denote your dictionary as D, current prefix as S. S = 0;
You create frequency map for your word. Lets denote it by F.
Using binary search find pointers to start of each letter in dictionary. Lets denote this array of pointers by P.
For each char c from A to Z, if F[c] == 0, skip it, else
S += c;
F[c] --;
P <- for every character i P[i] = pointer to first word beginning with S+i.
Recursively call step 4 till you find a match for your word or till you find that no such match exists.
This is how I would do it, anyway. There should be a more conventional approach, but this is faster then linear.
tried to implement the hashmap solution
public class Dictionary {
public static void main(String[] args){
String[] Dictionary=new String[]{"dog","god","tool","loot","rose","sore"};
HashMap<String,String> h=new HashMap<String, String>();
QuickSort q=new QuickSort();
for(int i=0;i<Dictionary.length;i++){
String temp =new String();
temp= q.quickSort(Dictionary[i]);//sorted word e.g dgo for dog
if(!h.containsKey(temp)){
h.put(temp,Dictionary[i]);
}
else
{
String s=h.get(temp);
h.put(temp,s + " , "+ Dictionary[i]);
}
}
String word=new String(){"tolo"};
String sortedword = q.quickSort(word);
if(h.containsKey(sortedword.toLowerCase())){ //used lowercase to make the words case sensitive
System.out.println("anagrams from Dictionary : " + h.get(sortedword.toLowerCase()));
}
}
Compute the frequency count vector for each word in the dictionary, a vector of length of the alphabet list.
generate a random Gaussian vector of the length of the alphabet list
project each dictionary word's count vector in this random direction and store the value (insert such that the array of values is sorted).
Given a new test word, project it in the same random direction used for the dictionary words.
Do a binary search to find the list of words that map to the same value.
Verify if each word obtained as above is indeed a true anagram. If not, remove it from the list.
Return the remaining elements of the list.
PS: The above procedure is a generalization of the prime number procedure which may potentially lead to large numbers (and hence computational precision issues)
# list of words
words = ["ROOPA","TABU","OOPAR","BUTA","BUAT" , "PAROO","Soudipta",
"Kheyali Park", "Tollygaunge", "AROOP","Love","AOORP",
"Protijayi","Paikpara","dipSouta","Shyambazaar",
"jayiProti", "North Calcutta", "Sovabazaar"]
#Method 1
A = [''.join(sorted(word)) for word in words]
dict ={}
for indexofsamewords,samewords in enumerate(A):
dict.setdefault(samewords, []).append(indexofsamewords)
print(dict)
#{'AOOPR': [0, 2, 5, 9, 11], 'ABTU': [1, 3, 4], 'Sadioptu': [6, 14], ' KPaaehiklry': [7], 'Taeggllnouy': [8], 'Leov': [10], 'Paiijorty': [12, 16], 'Paaaikpr': [13], 'Saaaabhmryz': [15], ' CNaachlortttu': [17], 'Saaaaborvz': [18]}
for index in dict.values():
print( [words[i] for i in index ] )
The Output :
['ROOPA', 'OOPAR', 'PAROO', 'AROOP', 'AOORP']
['TABU', 'BUTA', 'BUAT']
['Soudipta', 'dipSouta']
['Kheyali Park']
['Tollygaunge']
['Love']
['Protijayi', 'jayiProti']
['Paikpara']
['Shyambazaar']
['North Calcutta']
['Sovabazaar']
One solution is -
Map prime numbers to alphabet characters and multiply prime number
For ex -
a -> 2
b -> 3
......
.......
......
z -> 101
So
'ab' -> 6
'ba' -> 6
'bab' -> 18
'abba' -> 36
'baba' -> 36
Get MUL_number for Given word. return all the words from dictionary which have same MUL_number as given word
First check if the length of the strings are the same.
then check if the sum of the characters in both the strings are same (ie the ascii code sum)
then the words are anagrams
else not an anagram

Find ranges in array

I've been trying to find the optimal solution to the following (interesting?) problem that came up at work: Eventually I settled for a good enough solution but I'd like to know if there's a better one.
Let a1...an be an array of strings.
Let s1...sk be an unordered list of strings, all of them also members of the array.
The task is to find the minimum set of index ranges eleements of s cover in a.
So for example if a = [ "x", "y", "a", "f", "c" ] and s = { "c","y","f" }, the answer would be (1;1), (3;4), assuming that the array is indexed from zero.
a is typically fairly large (hundreds of thousands of elements), while s is relatively small, typically length(s) < log(length(a)).
So the question is: can you find a time-efficient algorithm for this problem? (Space efficiency is not a concern within reasonable limits.)
Just a quick but important update: I need to perform this operation with different s values but the same a a lot. So precomputing stuff based on a is allowed, indeed it is the only way.
Build a hash table H(a) to map from element to index: ax->x in O(n) time and space. Then look up each sy in H(a) (in O(1) time on average for a total of O(k) for s) and keep track of the ranges. For that you can use an array of pair(min_index, max_index) sorted by min_index and do a binary search to either locate the range or where you should insert the new 1 element range.
So overall, the solution above would take O( n + k + k * log( nb_ranges ) ) time and O( n + nb_ranges ) space.
This is what you want, written in python:
def flattened(indexes):
s, rest = indexes[0], indexes[1:]
result = (s, s)
for e in rest:
if e == result[1] + 1:
result = (result[0], e)
else:
yield result
result = (e, e)
yield result
a = ["x", "y", "a", "f", "c"]
s = ["c", "y", "f"]
# Create lookup table of ai to index in a
src_indexes = dict((key, i) for i, key in enumerate(a))
# Create sorted list of all indexes into a
raw_dst_indexes = sorted(src_indexes[key] for key in s)
# Convert sorted list of indexes into an array of ranges
dst_indexes = [r for r in flattened(raw_dst_indexes)]
print dst_indexes
I think you can throw the elements of S into a set or hashtable, anything with near O(1) to check for membership. Then just do a linear scan on A, with a flag to determine if you are currently covering elements in S, and the start position of that cover. Should be O(n + k).

Resources