Create given string from dictionary entries - algorithm

During a recent job interview, I was asked to give a solution to the following problem:
Given a string s (without spaces) and a dictionary, return the words in the dictionary that compose the string.
For example, s= peachpie, dic= {peach, pie}, result={peach, pie}.
I will ask the the decision variation of this problem:
if s can be composed of words in the
dictionary return yes, otherwise
return no.
My solution to this was in backtracking (written in Java)
public static boolean words(String s, Set<String> dictionary)
{
if ("".equals(s))
return true;
for (int i=0; i <= s.length(); i++)
{
String pre = prefix(s,i); // returns s[0..i-1]
String suf = suffix(s,i); // returns s[i..s.len]
if (dictionary.contains(pre) && words(suf, dictionary))
return true;
}
return false;
}
public static void main(String[] args) {
Set<String> dic = new HashSet<String>();
dic.add("peach");
dic.add("pie");
dic.add("1");
System.out.println(words("peachpie1", dic)); // true
System.out.println(words("peachpie2", dic)); // false
}
What is the time complexity of this solution?
I'm calling recursively in the for loop, but only for the prefix's that are in the dictionary.
Any idea's?

You can easily create a case where program takes at least exponential time to complete. Let's just take a word aaa...aaab, where a is repeated n times. Dictionary will contain only two words, a and aa.
b in the end ensure that function never finds a match and thus never exits prematurely.
On each words execution, two recursive calls will be spawned: with suffix(s, 1) and suffix(s, 2). Execution time, therefore, grows like fibonacci numbers: t(n) = t(n - 1) + t(n - 2). (You can verify it by inserting a counter.) So, complexity is certainly not polynomial. (and this is not even the worst possible input)
But you can easily improve your solution with Memoization. Notice, that output of function words depends on one thing only: at which position in original string we're starting. E.e., if we have a string abcdefg and words(5) is called, it doesn't matter how exactly abcde is composed (as ab+c+de or a+b+c+d+e or something else). Thus, we don't have to recalculate words("fg") each time.
In the primitive version, this can be done like this
public static boolean words(String s, Set<String> dictionary) {
if (processed.contains(s)) {
// we've already processed string 's' with no luck
return false;
}
// your normal computations
// ...
// if no match found, add 's' to the list of checked inputs
processed.add(s);
return false;
}
PS Still, I do encourage you to change words(String) to words(int). This way you'll be able to store results in array and even transform the whole algorithm to DP (which would make it much simpler).
edit 2
Since I have not much to do besides work, here's the DP (dynamic programming) solution. Same idea as above.
String s = "peachpie1";
int n = s.length();
boolean[] a = new boolean[n + 1];
// a[i] tells whether s[i..n-1] can be composed from words in the dictionary
a[n] = true; // always can compose empty string
for (int start = n - 1; start >= 0; --start) {
for (String word : dictionary) {
if (start + word.length() <= n && a[start + word.length()]) {
// check if 'word' is a prefix of s[start..n-1]
String test = s.substring(start, start + word.length());
if (test.equals(word)) {
a[start] = true;
break;
}
}
}
}
System.out.println(a[0]);

Here's a dynamic programming solution that counts the total number of ways to decompose the string into words. It solves your original problem, since the string is decomposable if the number of decompositions is positive.
def count_decompositions(dictionary, word):
n = len(word)
results = [1] + [0] * n
for i in xrange(1, n + 1):
for j in xrange(i):
if word[n - i:n - j] in dictionary:
results[i] += results[j]
return results[n]
Storage O(n), and running time O(n^2).

The loop on all the string will take n. Finding all suffixes and prefixes will take n + (n - 1) + (n - 2) + .... + 1 (n for first call of words, (n - 1) for second and so on), which is
n^2 - SUM(1..n) = n^2 - (n^2 + n)/2 = n^2 / 2 - n / 2
which in complexity theory is equivalent to n^2.
Checking for existence in HashSet in normal case is Theta(1), but in worst case it is O(n).
So, normal case complexity of your algorithm is Theta(n^2), and worst case - O(n^3).
EDIT: I confused order of recursion and iteration, so this answer is wrong. Actually time depends on n exponentially (compare with computation of Fibonacci numbers, for example).
More interesting thing is the question how to improve your algorithm. Traditionally for string operations suffix tree is used. You can build suffix tree with your string and mark all the nodes as "untracked" at the start of the algo. Then go through the strings in a set and each time some node is used, mark it as "tracked". If all strings in the set are found in the tree, it will mean, that the original string contains all the substrings from set. And if all the nodes are marked as tracked, it will mean, that string consists only of substring from set.
Actual complexity of this approach depends on many factors like tree building algorithm, but at least it allows to divide the problem into several independent subtasks and so measure final complexity by complexity of the most expensive subtask.

Related

Worst-case time complexity for adding two binary numbers using bit manipulation

I'm looking over the solution to adding 2 binary numbers, using the bit manipulation approach. The input are 2 binary-number strings, and the expected output is the string for the binary addition result.
Here is the Java code similar to the solution,
class BinaryAddition {
public String addBinary(String a, String b) {
BigInteger x = new BigInteger(a, 2);
BigInteger y = new BigInteger(b, 2);
BigInteger zero = new BigInteger("0", 2);
BigInteger carry, answer;
while (y.compareTo(zero) != 0) {
answer = x.xor(y);
carry = x.and(y).shiftLeft(1);
x = answer;
y = carry;
}
return x.toString(2);
}
}
While the algorithm makes a lot of sense, I'm somehow arriving at a different worst-case time complexity than the O(M + N) stated in the Leetcode solution, where M, N refers to the lengths of input strings.
In my opinion, it should be O(max(M, N)^2), since the while loop can run up to max(M, N) times, and each iteration can take max(M, N) operations. For example, adding 1111 and 1001 would take 4 iterations of the while loop.
Appreciate your advice on this or where I might have gone wrong, thanks.

Time Complexity of this Word Break DFS + Memorization solution

when dealing with wordBreak problem, I found this solution is really concise. But not sure about the time complexity. anyone can help?
my understanding is worst case, O(n*k), n is the size of the wordDict, and k is the length of the String.
class Solution {
public boolean wordBreak(String s, List<String> wordDict) {
return wordBreak(s, wordDict, new HashMap<String, Boolean>());
}
private boolean wordBreak(String s, List<String> wordDict, Map<String, Boolean> memo) {
if (s == null) return false;
if (s.isEmpty()) return true;
if (memo.containsKey(s)) return memo.get(s);
for (String dict : wordDict) { //number of words O(n)
//startsWith is bounded by the length of dict word, avg is O(m), can be ignored
//substring is bounded by the length of dict word, avg is O(k), k is the length of s
//wordBreak will be executed k/m times, k is the length of s, worse case k times... when a single letter is in the dict
if (s.startsWith(dict) && wordBreak(s.substring(dict.length()), wordDict, memo)) {
memo.put(s, true);
return true;
}
}
memo.put(s, false);
return false;
}
}
It's worse than O(nk) for several reasons:
You ignore "m", but m is Omega(log k). (Because k < |A|^(m+1) where |A| is the size of your alphabet).
s.substring is probably O(n). Your code looks like Java, and it's O(n) in Java.
Even if s.substring were linear, Your Map requires the string to be hashed, so your map operations are O(n) (where note carefully -- n is the size of the string rather than the size of the hashtable like it would normally be).
Probably this means you have complexity O(n^2k*logk).
You can fix 3 easily -- you can use s.length rather than s as the key to your hashtable.
Problem 2 is easy but slightly annoying to fix -- rather than slicing your string, you can use a variable that indexes into the string. You probably have to re-write startsWith yourself to using this index (or use a trie -- see below). If your programming language has an O(1) slice operation (for example, string_view in C++) then you could use that instead.
Problem 1 is only theoretical, since for real word lists, m is really small compared to either the length of the dictionary or the potential length of input strings.
Note that using a trie for the dictionary rather than a word list is likely to result in a huge time improvement, with realistic examples being linear excluding dictionary construction (although worst-case examples where the dictionary and input strings are chosen maliciously will be O(nk)).

What's the space complexity of this permutations algorithm?

The time complexity of this algorithm to compute permutations recursively should be O(n!*n), but I am not 100% sure about the space complexity.
There are n recursions, and the biggest space required for a recursion is n (space of every permutation * n! (number of permutations). Is the space complexity of the algorithm` O(n!*n^2)?
static List<String> permutations(String word) {
if (word.length() == 1)
return Arrays.asList(word);
String firstCharacter = word.substring(0, 1);
String rest = word.substring(1);
List<String> permutationsOfRest = permutations(rest);
List<String> permutations = new ArrayList<String>(); //or hashset if I don’t want duplicates
for (String permutationOfRest : permutationsOfRest) {
for (int i = 0; i <= permutationOfRest.length(); i++) {
permutations.add(permutationOfRest.substring(0, i) + firstCharacter + permutationOfRest.substring(i));
}
}
return permutations;
}
No, the space complexity is "just" O(n! × n), since you don't simultaneously hold onto all recursive calls' permutationsOfRest / permutations. (You do have two at a time, but that's just a constant factor, so isn't relevant to the asymptotic complexity.)
Note that if you don't actually need a List<String>, it might be better to wrap things up as a custom Iterator<String> implementation, so that you don't need to keep all permutations in memory at once, and don't need to pre-calculate all permutations before you start doing anything with any of them. (Of course, that's a bit trickier to implement, so it's not worth it if the major use of the Iterator<String> will just be to pre-populate a List<String> anyway.)

Example of Big O of 2^n

So I can picture what an algorithm is that has a complexity of n^c, just the number of nested for loops.
for (var i = 0; i < dataset.len; i++ {
for (var j = 0; j < dataset.len; j++) {
//do stuff with i and j
}
}
Log is something that splits the data set in half every time, binary search does this (not entirely sure what code for this looks like).
But what is a simple example of an algorithm that is c^n or more specifically 2^n. Is O(2^n) based on loops through data? Or how data is split? Or something else entirely?
Algorithms with running time O(2^N) are often recursive algorithms that solve a problem of size N by recursively solving two smaller problems of size N-1.
This program, for instance prints out all the moves necessary to solve the famous "Towers of Hanoi" problem for N disks in pseudo-code
void solve_hanoi(int N, string from_peg, string to_peg, string spare_peg)
{
if (N<1) {
return;
}
if (N>1) {
solve_hanoi(N-1, from_peg, spare_peg, to_peg);
}
print "move from " + from_peg + " to " + to_peg;
if (N>1) {
solve_hanoi(N-1, spare_peg, to_peg, from_peg);
}
}
Let T(N) be the time it takes for N disks.
We have:
T(1) = O(1)
and
T(N) = O(1) + 2*T(N-1) when N>1
If you repeatedly expand the last term, you get:
T(N) = 3*O(1) + 4*T(N-2)
T(N) = 7*O(1) + 8*T(N-3)
...
T(N) = (2^(N-1)-1)*O(1) + (2^(N-1))*T(1)
T(N) = (2^N - 1)*O(1)
T(N) = O(2^N)
To actually figure this out, you just have to know that certain patterns in the recurrence relation lead to exponential results. Generally T(N) = ... + C*T(N-1) with C > 1means O(x^N). See:
https://en.wikipedia.org/wiki/Recurrence_relation
Think about e.g. iterating over all possible subsets of a set. This kind of algorithms is used for instance for a generalized knapsack problem.
If you find it hard to understand how iterating over subsets translates to O(2^n), imagine a set of n switches, each of them corresponding to one element of a set. Now, each of the switches can be turned on or off. Think of "on" as being in the subset. Note, how many combinations are possible: 2^n.
If you want to see an example in code, it's usually easier to think about recursion here, but I can't think od any other nice and understable example right now.
Consider that you want to guess the PIN of a smartphone, this PIN is a 4-digit integer number. You know that the maximum number of bits to hold a 4-digit number is 14 bits. So, you will have to guess the value, the 14-bit correct combination let's say, of this PIN out of the 2^14 = 16384 possible values!!
The only way is to brute force. So, for simplicity, consider this simple 2-bit word that you want to guess right, each bit has 2 possible values, 0 or 1. So, all the possibilities are:
00
01
10
11
We know that all possibilities of an n-bit word will be 2^n possible combinations. So, 2^2 is 4 possible combinations as we saw earlier.
The same applies to the 14-bit integer PIN, so guessing the PIN would require you to solve a 2^14 possible outcome puzzle, hence an algorithm of time complexity O(2^n).
So, those types of problems, where combinations of elements in a set S differs, and you will have to try to solve the problem by trying all possible combinations, will have this O(2^n) time complexity. But, the exponentiation base does not have to be 2. In the example above it's of base 2 because each element, each bit, has two possible values which will not be the case in other problems.
Another good example of O(2^n) algorithms is the recursive knapsack. Where you have to try different combinations to maximize the value, where each element in the set, has two possible values, whether we take it or not.
The Edit Distance problem is an O(3^n) time complexity since you have 3 decisions to choose from for each of the n characters string, deletion, insertion, or replace.
int Fibonacci(int number)
{
if (number <= 1) return number;
return Fibonacci(number - 2) + Fibonacci(number - 1);
}
Growth doubles with each additon to the input data set. The growth curve of an O(2N) function is exponential - starting off very shallow, then rising meteorically.
My example of big O(2^n), but much better is this:
public void solve(int n, String start, String auxiliary, String end) {
if (n == 1) {
System.out.println(start + " -> " + end);
} else {
solve(n - 1, start, end, auxiliary);
System.out.println(start + " -> " + end);
solve(n - 1, auxiliary, start, end);
}
In this method program prints all moves to solve "Tower of Hanoi" problem.
Both examples are using recursive to solve problem and had big O(2^n) running time.
c^N = All combinations of n elements from a c sized alphabet.
More specifically 2^N is all numbers representable with N bits.
The common cases are implemented recursively, something like:
vector<int> bits;
int N
void find_solution(int pos) {
if (pos == N) {
check_solution();
return;
}
bits[pos] = 0;
find_solution(pos + 1);
bits[pos] = 1;
find_solution(pos + 1);
}
Here is a code clip that computes value sum of every combination of values in a goods array(and value is a global array variable):
fun boom(idx: Int, pre: Int, include: Boolean) {
if (idx < 0) return
boom(idx - 1, pre + if (include) values[idx] else 0, true)
boom(idx - 1, pre + if (include) values[idx] else 0, false)
println(pre + if (include) values[idx] else 0)
}
As you can see, it's recursive. We can inset loops to get Polynomial complexity, and using recursive to get Exponential complexity.
Here are two simple examples in python with Big O/Landau (2^N):
#fibonacci
def fib(num):
if num==0 or num==1:
return num
else:
return fib(num-1)+fib(num-2)
num=10
for i in range(0,num):
print(fib(i))
#tower of Hanoi
def move(disk , from, to, aux):
if disk >= 1:
# from twoer , auxilart
move(disk-1, from, aux, to)
print ("Move disk", disk, "from rod", from_rod, "to rod", to_rod)
move(disk-1, aux, to, from)
n = 3
move(n, 'A', 'B', 'C')
Assuming that a set is a subset of itself, then there are 2ⁿ possible subsets for a set with n elements.
think of it this way. to make a subset, lets take one element. this element has two possibilities in the subset you're creating: present or absent. the same applies for all the other elements in the set. multiplying all these possibilities, you arrive at 2ⁿ.

Amortized worst case complexity of binary search

For a binary search of a sorted array of 2^n-1 elements in which the element we are looking for appears, what is the amortized worst-case time complexity?
Found this on my review sheet for my final exam. I can't even figure out why we would want amortized time complexity for binary search because its worst case is O(log n). According to my notes, the amortized cost calculates the upper-bound of an algorithm and then divides it by the number of items, so wouldn't that be as simple as the worst-case time complexity divided by n, meaning O(log n)/2^n-1?
For reference, here is the binary search I've been using:
public static boolean binarySearch(int x, int[] sorted) {
int s = 0; //start
int e = sorted.length-1; //end
while(s <= e) {
int mid = s + (e-s)/2;
if( sorted[mid] == x )
return true;
else if( sorted[mid] < x )
start = mid+1;
else
end = mid-1;
}
return false;
}
I'm honestly not sure what this means - I don't see how amortization interacts with binary search.
Perhaps the question is asking what the average cost of a successful binary search would be. You could imagine binary searching for all n elements of the array and looking at the average cost of such an operation. In that case, there's one element for which the search makes one probe, two for which the search makes two probes, four for which it makes three probes, etc. This averages out to O(log n).
Hope this helps!
iAmortized cost is the total cost over all possible queries divided by the number of possible queries. You will get slightly different results depending on how you count queries that fail to find the item. (Either don't count them at all, or count one for each gap where a missing item could be.)
So for a search of 2^n - 1 items (just as an example to keep the math simple), there is one item you would find on your first probe, 2 items would be found on the second probe, 4 on the third probe, ... 2^(n-1) on the nth probe. There are 2^n "gaps" for missing items (remembering to count both ends as gaps).
With your algorithm, finding an item on probe k costs 2k-1 comparisons. (That's 2 compares for each of the k-1 probes before the kth, plus one where the test for == returns true.) Searching for an item not in the table costs 2n comparisons.
I'll leave it to you to do the math, but I can't leave the topic without expressing how irked I am when I see binary search coded this way. Consider:
public static boolean binarySearch(int x, int[] sorted {
int s = 0; // start
int e = sorted.length; // end
// Loop invariant: if x is at sorted[k] then s <= k < e
int mid = (s + e)/2;
while (mid != s) {
if (sorted[mid] > x) e = mid; else s = mid;
mid = (s + e)/2; }
return (mid < e) && (sorted[mid] == x); // mid == e means the array was empty
}
You don't short-circuit the loop when you hit the item you're looking for, which seems like a defect, but on the other hand you do only one comparison on every item you look at, instead of two comparisons on each item that doesn't match. Since half of all items are found at leaves of the search tree, what seems like a defect turns out to be a major gain. Indeed, the number of elements where short-circuiting the loop is beneficial is only about the square root of the number of elements in the array.
Grind through the arithmetic, computing amortized search cost (counting "cost" as the number of comparisons to sorted[mid], and you'll see that this version is approximately twice as fast. It also has constant cost (within ±1 comparison), depending only on the number of items in the array and not on where or even if the item is found. Not that that's important.

Resources