Algorithm exercise - algorithm

I'm working on this algorithm exercise but I don't understand completely the formulation. Here is the exercise:
Given a string str and array of pairs that indicates which indices in
the string can be swapped, return the lexicographically largest string
that results from doing the allowed swaps. You can swap indices any
number of times.
Example
For str = "abdc" and pairs = [[1, 4], [3, 4]], the output should be
swapLexOrder(str, pairs) = "dbca".
By swapping the given indices, you get the strings: "cbda", "cbad",
"dbac", "dbca". The lexicographically largest string in this list is
"dbca".
Input/Output
[execution time limit] 4 seconds (js)
[input] string str
A string consisting only of lowercase English letters.
Guaranteed constraints: 1 ≤ str.length ≤ 104.
[input] array.array.integer pairs
An array containing pairs of indices that can be swapped in str
(1-based). This means that for each pairs[i], you can swap elements in
str that have the indices pairs[i][0] and pairs[i][1].
Guaranteed constraints: 0 ≤ pairs.length ≤ 5000, pairs[i].length = 2.
[output] string
My question is, why "abcd" is not a posible answer (just swapping index 3 and 4 on the original string "abdc")? The example says
By swapping the given indices, you get the strings: "cbda", "cbad",
"dbac", "dbca". The lexicographically largest string in this list is
"dbca"
I understand that even if "abcd" is a possible answer "dbca" is lexicographically largest so the answer is the same. But if I don't understand why "abcd" is not a possible answer I think I'm misunderstanding the task

You are reading the question correctly, and their description is broken. Both "abcd" and "abdc" are on the list of possible strings that you can produce, and yet are not in their list.

Related

Find longest word in dictionary that is a subsequence of a given string (Google TechDevGuide)

This a Google interview questions that is now a problem in Google's Foundations of Programming course. I was trying to understand the solution to the following problem:
https://techdevguide.withgoogle.com/paths/foundational/find-longest-word-in-dictionary-that-subsequence-of-given-string#code-challenge
"Given a string S and a set of words D, find the longest word in D that is a subsequence of S.
Word W is a subsequence of S if some number of characters, possibly zero, can be deleted from S to form W, without reordering the remaining characters.
Note: D can appear in any format (list, hash table, prefix tree, etc.
For example, given the input of S = "abppplee" and D = {"able", "ale", "apple", "bale", "kangaroo"} the correct output would be "apple" "
So the main doubt I have is in the section the specifies An Optimal method O(N+L) for small alphabet. In the previous approach i.e O(N + L log N), the author builds a hash table mapping characters -> sorted indexes in String
So for "abppple", the hash table is:
a -> [0]
b -> [1]
p -> [2,3,4]
l -> [5]
e -> [6]
In the optimal approach for small alphabet, it is mentioned using a dense vector representation instead of the above sparse vector representation.
p -> [2, 2, 3, 4, -1, -1, -1, -1]
What are they trying to represent? The string S = "abppplee" does have 8 characters but I don't get what the dense vector is representing? Is it an error or am I missing something?
Thank you.
As far as I understand it, this vector should address the question (with Y set to character p)
Given my last matched character was at index X and my next character
to match is Y, where does this match occur?
In other words, if you are at index i in the original string, the element of the vector p[i] tells you where is the closest next character p in the string.
To be more specific, the original string is s=abppplee and we want to verify:
p -> [2, 2, 3, 4, -1, -1, -1, -1]
This means that if we are at index 0, the closest p is at index 2. For index 1, the closest p is also at 2. If we are at index 2, the closest next p is at position 3, etc. The values -1 signify that there is no p in the rest of the string (for example at position 4, the remaining characters are indeed just lee, thus there is no p).

Pair up strings to form palindromes

Given N strings each of at max 1000 length. We can concatenate pair of strings by ends. Like if one is "abc" and other is "cba" then we can get "abccba" as well as "cbaabc". Some string may be left without concatenation to any other string. Also no string can be concatenated to itself.
We can only concatenate those two strings that form a palindrome. So I need to tell the minimum number of strings left after making such pairs.
Example : Let we have 9 strings :
aabbaabb
bbaabbaa
aa
bb
a
bbaa
bba
bab
ab
Then here answer is 5
Explanation : Here are 5 strings :
"aabbaabb" + "bbaabbaa" = "aabbaabbbbaabbaa"
"aa" + "a = "aaa"
"bba" + "bb" = "bbabb"
"bab" + "ab" = "babab"
"bbaa"
Also there can be 1000 such strings in total.
1) Make a graph where we have one node for each word.
2) Go through all pairs of words and check if they form palindrome if we concatenate them. If they do connect corresponding nodes in graph with edge.
3) Now use matching algorithm to find maximum number of edges you can match: http://en.wikipedia.org/wiki/Blossom_algorithm
Time complexity: O(N) for point 1, O(n*n*1000) for point 2 and O(V^4) for point 3 yielding total complexity of O(n^4).

Find all substrings that don't contain the entire set of characters

This was asked to me in an interview.
I'm given a string whose characters come from the set {a,b,c} only. Find all substrings that dont contain all the characters from the set.For e.g, substrings that contain only a's, only b's, only c's or only a,b's or only b,c's or only c,a's. I gave him the naive O(n^2) solution by generating all substrings and testing them.
The interviewer wanted an O(n) solution.
Edit: My attempt was to have the last indexes of a,b,c and run a pointer from left to right, and anytime all 3 were counted, change the start of the substring to exclude the earliest one and start counting again. It doesn't seem exhaustive
So for e.g, if the string is abbcabccaa,
let i be the pointer that traverses the string. Let start be start of the substring.
1) i = 0, start = 0
2) i = 1, start = 0, last_index(a) = 0 --> 1 substring - a
3) i = 2, start = 0, last_index(a) = 0, last_index(b) = 1 -- > 1 substring ab
4) i = 3, start = 0, last_index(a) = 0, last_index(b) = 2 --> 1 substring abb
5) i = 4, start = 1, last_index(b) = 2, last_index(c) = 3 --> 1 substring bbc(removed a from the substring)
6) i = 5, start = 3, last_index(c) = 3, last_index(a) = 4 --> 1 substring ca(removed b from the substring)
but this isn't exhaustive
Given that the problem in its original definition can't be solved in less than O(N^2) time, as some comments point out, I suggest a linear algorithm for counting the number of substrings (not necessarily unique in their values, but unique in their positions within the original string).
The algorithm
count = 0
For every char C in {'a','b','c'} scan the input S and break it into longest sequences not including C. For each such section A, add |A|*(|A|+1)/2 to count. This addition stands for the number of legal sub-strings inside A.
Now we have the total number of legal strings including only {'a','b'}, only {'a','c'} and only {'b','c'}. The problem is that we counted substrings with a single repeated character twice. To fix this we iterate over S again, this time subtracting |A|*(|A|+1)/2 for every largest sequence A of a single character that we encounter.
Return count
Example
S='aacb'
breaking it using 'a' gives us only 'cb', so count = 3. For C='b' we have 'aac', which makes count = 3 + 6 = 9. With C='c' we get 'aa' and 'b', so count = 9 + 3 + 1 = 13. Now we have to do the subtraction: 'aa': -3, 'c': -1, 'b': -1. So we have count=8.
The 8 substrings are:
'a'
'a' (the second char this time)
'aa'
'ac'
'aac'
'cb'
'c'
'b'
To get something better than O(n) we may need additional assumptions (maybe longest substrings with this property).
Consider a string of the form aaaaaaaaaabbbbbbbbbb of length n. There is at least O(n^2) possible substrings so if we want to list them all we need O(n^2) time.
I came up with a linear solution for the longest substrings.
Take a set S of all substrings separated by a, all substrings separated by b and finally all substrings separated by c. Each of those steps can be done in O(n), so we have O(3n), thus O(n).
Example:
Take aaabcaaccbaa.
In this case set S contains:
substrings separated by a: bc, ccb
substrings separated by b: aaa, caacc
substrings separated by c: aaab, aa, baa.
By the set I mean a data structure with adding and finding element with a given key in O(1).

Finding anagrams for a given word

Two words are anagrams if one of them has exactly same characters as that of the another word.
Example : Anagram & Nagaram are anagrams (case-insensitive).
Now there are many questions similar to this . A couple of approaches to find whether two strings are anagrams are :
1) Sort the strings and compare them.
2) Create a frequency map for these strings and check if they are the same or not.
But in this case , we are given with a word (for the sake of simplicity let us assume a single word only and it will have single word anagrams only) and we need to find anagrams for that.
Solution which I have in mind is that , we can generate all permutations for the word and check which of these words exist in the dictionary . But clearly , this is highly inefficient. Yes , the dictionary is available too.
So what alternatives do we have here ?
I also read in a similar thread that something can be done using Tries but the person didn't explain as to what the algorithm was and why did we use a Trie in first place , just an implementation was provided that too in Python or Ruby. So that wasn't really helpful which is why I have created this new thread. If someone wants to share their implementation (other than C,C++ or Java) then kindly explain it too.
Example algorithm:
Open dictionary
Create empty hashmap H
For each word in dictionary:
Create a key that is the word's letters sorted alphabetically (and forced to one case)
Add the word to the list of words accessed by the hash key in H
To check for all anagrams of a given word:
Create a key that is the letters of the word, sorted (and forced to one case)
Look up that key in H
You now have a list of all anagrams
Relatively fast to build, blazingly fast on look-up.
I came up with a new solution I guess. It uses the Fundamental Theorem of Arithmetic. So the idea is to use an array of the first 26 prime numbers. Then for each letter in the input word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list. All anagrams will have the same signature because
Any integer greater than 1 is either a prime number, or can be written
as a unique product of prime numbers (ignoring the order).
Here's the code. I convert the word to UPPERCASE and 65 is the position of A which corresponds to my first prime number:
private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103,
107, 109, 113 };
This is the method:
private long calculateProduct(char[] letters) {
long result = 1L;
for (char c : letters) {
if (c < 65) {
return -1;
}
int pos = c - 65;
result *= PRIMES[pos];
}
return result;
}
We know that if two words don't have the same length, they are not anagrams. So you can partition your dictionary in groups of words of the same length.
Now we focus on only one of these groups and basically all words have exactly the same length in this smaller universe.
If each letter position is a dimension, and the value in that dimension is based on the letter (say the ASCII code). Then you can calculate the length of the word vector.
For example, say 'A'=65, 'B'=66, then length("AB") = sqrt(65*65 + 66*66). Obviously, length("AB") = length("BA").
Clearly, if two word are anagrams, then their vectors have the same length. The next question is, if two word (of same number of letters) vectors have the same length, are they anagrams? Intuitively, I'd say no, since all vectors with that length forms a sphere, there are many. Not sure, since we're in the integer space in this case, how many there are actually.
But at the very least it allows you to partition your dictionary even further. For each word in your dictionary, calculate the vector's distance:
for(each letter c) { distance += c*c }; distance = sqrt(distance);
Then create a map for all words of length n, and key it with the distance and the value is a list of words of length n that yield that particular distance.
You'll create a map for each distance.
Then your lookup becomes the following algorithm:
Use the correct dictionary map based on the length of the word
Compute the length of your word's vector
Lookup the list of words that match that length
Go through the list and pick the anagrams using a naive algorithm is now the list of candidates is greatly reduced
Reduce the words to - say - lower case (clojure.string/lower-case).
Classify them (group-by) by letter frequency-map (frequencies).
Drop the frequency maps,
... leaving the collections of anagrams.
(These) are the corresponding functions in the Lisp dialect Clojure.
The whole function can be expressed so:
(defn anagrams [dict]
(->> dict
(map clojure.string/lower-case)
(group-by frequencies)
vals))
For example,
(anagrams ["Salt" "last" "one" "eon" "plod"])
;(["salt" "last"] ["one" "eon"] ["plod"])
An indexing function that maps each thing to its collection is
(defn index [xss]
(into {} (for [xs xss, x xs] [x xs])))
So that, for example,
((comp index anagrams) ["Salt" "last" "one" "eon" "plod"])
;{"salt" ["salt" "last"], "last" ["salt" "last"], "one" ["one" "eon"], "eon" ["one" "eon"], "plod" ["plod"]}
... where comp is the functional composition operator.
Well Tries would make it easier to check if the word exists.
So if you put the whole dictionary in a trie:
http://en.wikipedia.org/wiki/Trie
then you can afterward take your word and do simple backtracking by taking a char and recursively checking if we can "walk" down the Trie with any combiniation of the rest of the chars (adding one char at a time). When all chars are used in a recursion branch and there was a valid path in the Trie, then the word exists.
The Trie helps because its a nice stopping condition:
We can check if the part of a string, e.g "Anag" is a valid path in the trie, if not we can break that perticular recursion branch. This means we don't have to check every single permutation of the characters.
In pseudo-code
checkAllChars(currentPositionInTrie, currentlyUsedChars, restOfWord)
if (restOfWord == 0)
{
AddWord(currentlyUsedChar)
}
else
{
foreach (char in restOfWord)
{
nextPositionInTrie = Trie.Walk(currentPositionInTrie, char)
if (nextPositionInTrie != Positions.NOT_POSSIBLE)
{
checkAllChars(nextPositionInTrie, currentlyUsedChars.With(char), restOfWord.Without(char))
}
}
}
Obviously you need a nice Trie datastructure which allows you to progressively "walk" down the tree and check at each node if there is a path with the given char to any next node...
static void Main(string[] args)
{
string str1 = "Tom Marvolo Riddle";
string str2 = "I am Lord Voldemort";
str2= str2.Replace(" ", string.Empty);
str1 = str1.Replace(" ", string.Empty);
if (str1.Length != str2.Length)
Console.WriteLine("Strings are not anagram");
else
{
str1 = str1.ToUpper();
str2 = str2.ToUpper();
int countStr1 = 0;
int countStr2 = 0;
for (int i = 0; i < str1.Length; i++)
{
countStr1 += str1[i];
countStr2 += str2[i];
}
if(countStr2!=countStr1)
Console.WriteLine("Strings are not anagram");
else Console.WriteLine("Strings are anagram");
}
Console.Read();
}
Generating all permutations is easy, I guess you are worried that checking their existence in the dictionary is the "highly inefficient" part. But that actually depends on what data structure you use for the dictionary: of course, a list of words would be inefficient for your use case. Speaking of Tries, they would probably be an ideal representation, and quite efficient, too.
Another possibility would be to do some pre-processing on your dictionary, e.g. build a hashtable where the keys are the word's letters sorted, and the values are lists of words. You can even serialize this hashtable so you can write it to a file and reload quickly later. Then to look up anagrams, you simply sort your given word and look up the corresponding entry in the hashtable.
That depends on how you store your dictionary. If it is a simple array of words, no algorithm will be faster than linear.
If it is sorted, then here's an approach that may work. I've invented it just now, but I guess its faster than linear approach.
Denote your dictionary as D, current prefix as S. S = 0;
You create frequency map for your word. Lets denote it by F.
Using binary search find pointers to start of each letter in dictionary. Lets denote this array of pointers by P.
For each char c from A to Z, if F[c] == 0, skip it, else
S += c;
F[c] --;
P <- for every character i P[i] = pointer to first word beginning with S+i.
Recursively call step 4 till you find a match for your word or till you find that no such match exists.
This is how I would do it, anyway. There should be a more conventional approach, but this is faster then linear.
tried to implement the hashmap solution
public class Dictionary {
public static void main(String[] args){
String[] Dictionary=new String[]{"dog","god","tool","loot","rose","sore"};
HashMap<String,String> h=new HashMap<String, String>();
QuickSort q=new QuickSort();
for(int i=0;i<Dictionary.length;i++){
String temp =new String();
temp= q.quickSort(Dictionary[i]);//sorted word e.g dgo for dog
if(!h.containsKey(temp)){
h.put(temp,Dictionary[i]);
}
else
{
String s=h.get(temp);
h.put(temp,s + " , "+ Dictionary[i]);
}
}
String word=new String(){"tolo"};
String sortedword = q.quickSort(word);
if(h.containsKey(sortedword.toLowerCase())){ //used lowercase to make the words case sensitive
System.out.println("anagrams from Dictionary : " + h.get(sortedword.toLowerCase()));
}
}
Compute the frequency count vector for each word in the dictionary, a vector of length of the alphabet list.
generate a random Gaussian vector of the length of the alphabet list
project each dictionary word's count vector in this random direction and store the value (insert such that the array of values is sorted).
Given a new test word, project it in the same random direction used for the dictionary words.
Do a binary search to find the list of words that map to the same value.
Verify if each word obtained as above is indeed a true anagram. If not, remove it from the list.
Return the remaining elements of the list.
PS: The above procedure is a generalization of the prime number procedure which may potentially lead to large numbers (and hence computational precision issues)
# list of words
words = ["ROOPA","TABU","OOPAR","BUTA","BUAT" , "PAROO","Soudipta",
"Kheyali Park", "Tollygaunge", "AROOP","Love","AOORP",
"Protijayi","Paikpara","dipSouta","Shyambazaar",
"jayiProti", "North Calcutta", "Sovabazaar"]
#Method 1
A = [''.join(sorted(word)) for word in words]
dict ={}
for indexofsamewords,samewords in enumerate(A):
dict.setdefault(samewords, []).append(indexofsamewords)
print(dict)
#{'AOOPR': [0, 2, 5, 9, 11], 'ABTU': [1, 3, 4], 'Sadioptu': [6, 14], ' KPaaehiklry': [7], 'Taeggllnouy': [8], 'Leov': [10], 'Paiijorty': [12, 16], 'Paaaikpr': [13], 'Saaaabhmryz': [15], ' CNaachlortttu': [17], 'Saaaaborvz': [18]}
for index in dict.values():
print( [words[i] for i in index ] )
The Output :
['ROOPA', 'OOPAR', 'PAROO', 'AROOP', 'AOORP']
['TABU', 'BUTA', 'BUAT']
['Soudipta', 'dipSouta']
['Kheyali Park']
['Tollygaunge']
['Love']
['Protijayi', 'jayiProti']
['Paikpara']
['Shyambazaar']
['North Calcutta']
['Sovabazaar']
One solution is -
Map prime numbers to alphabet characters and multiply prime number
For ex -
a -> 2
b -> 3
......
.......
......
z -> 101
So
'ab' -> 6
'ba' -> 6
'bab' -> 18
'abba' -> 36
'baba' -> 36
Get MUL_number for Given word. return all the words from dictionary which have same MUL_number as given word
First check if the length of the strings are the same.
then check if the sum of the characters in both the strings are same (ie the ascii code sum)
then the words are anagrams
else not an anagram

Randomly sampling unique subsets of an array

If I have an array:
a = [1,2,3]
How do I randomly select subsets of the array, such that the elements of each subset are unique? That is, for a the possible subsets would be:
[]
[1]
[2]
[3]
[1,2]
[2,3]
[1,2,3]
I can't generate all of the possible subsets as the real size of a is very big so there are many, many subsets. At the moment, I am using a 'random walk' idea - for each element of a, I 'flip a coin' and include it if the coin comes up heads - but I am not sure if this actually uniformly samples the space. It feels like it biases towards the middle, but this might just be my mind doing pattern-matching, as there will be more middle sized possiblities.
Am I using the right approach, or how should I be randomly sampling?
(I am aware that this is more of a language agnostic and 'mathsy' question, but I felt it wasn't really Mathoverflow material - I just need a practical answer.)
Just go ahead with your original "coin flipping" idea. It uniformly samples the space of possibilities.
It feels to you like it's biased towards the "middle", but that's because the number of possibilities is largest in the "middle". Think about it: there is only 1 possibility with no elements, and only 1 with all elements. There are N possibilities with 1 element, and N possibilities with (N-1) elements. As the number of elements chosen gets closer to (N/2), the number of possibilities grows very quickly.
You could generate random numbers, convert them to binary and choose the elements from your original array where the bits were 1. Here is an implementation of this as a monkey-patch for the Array class:
class Array
def random_subset(n=1)
raise ArgumentError, "negative argument" if n < 0
(1..n).map do
r = rand(2**self.size)
self.select.with_index { |el, i| r[i] == 1 }
end
end
end
Usage:
a.random_subset(3)
#=> [[3, 6, 9], [4, 5, 7, 8, 10], [1, 2, 3, 4, 6, 9]]
Generally this doesn't perform so bad, it's O(n*m) where n is the number of subsets you want and m is the length of the array.
I think the coin flipping is fine.
ar = ('a'..'j').to_a
p ar.select{ rand(2) == 0 }
An array with 10 elements has 2**10 possible combinations (including [ ] and all 10 elements) which is nothing more then 10 times (1 or 0). It does output more arrays of four, five and six elements, because there are a lot more of those in the powerset.
A way to select a random element from the power set is the following:
my_array = ('a'..'z').to_a
power_set_size = 2 ** my_array.length
random_subset = rand(power_set_size)
subset = []
random_subset.to_i(2).chars.each_with_index do |bit, corresponding_element|
subset << my_array[corresponding_element] if bit == "1"
end
This makes use of strings functions instead than working with real "bits" and bitwise operations just for my convenience. You can turn it into a faster (I guess) algorithm by using real bits.
What it does, is to encode the powerset of array as an integer between 0 and 2 ** array.length and then picks one of those integers at random (uniformly random, indeed). Then it decodes back the integer into a particular subset of array using a bitmask (1 = the element is in the subset, 0 = it is not).
In this way you have an uniform distribution over the power set of your array.
a.select {|element| rand(2) == 0 }
For each element, a coin is flipped. If heads ( == 0), then it is selected.

Resources