Optimizing construction of a trie over all substrings

Optimizing construction of a trie over all substrings - algorithm

I am solving a trie related problem. There is a set of strings S. I have to create a trie over all substrings for each string in S. I am using the following routine:
String strings[] = { ... }; // array containing all strings
for(int i = 0; i < strings.length; i++) {
String w = strings[i];
for (int j = 0; j < w.length(); j++) {
for (int k = j + 1; k <= w.length(); k++) {
trie.insert(w.substring(j, k));
}
}
}
I am using the trie implementation provided here. However, I am wondering if there are certain optimizations which can be done in order to reduce the complexity of creating trie over all substrings?
Why do I need this? Because I am trying to solve this problem.

If we have N words, each with maximum length L, your algorithm will take O(N*L^3) (supposing that adding to trie is linear with length of adding word). However, the size of the resulting trie (number of nodes) is at most O(N*L^2), so it seems you are wasting time and you could do better.
And indeed you can, but you have to pull a few tricks from you sleeve. Also, you will no longer need the trie.
.substring() in constant time
In Java 7, each String had a backing char[] array as well as starting position and length. This allowed the .substring() method to run in constant time, since String is immutable class. New String object with same backing char[] array was created, only with different start position and length.
You will need to extend this a bit, to support adding at the end of the string, by increasing the length. Always create a new string object, but leave the backing array same.
Recompute hash in constant time after appending single character
Again, let me use Java's hashCode() function for String:
int hash = 0;
for (int i = 0; i < data.length; i++) {
hash = 31 * hash + data[i];
} // data is the backing array
Now, how will the hash change after adding a single character at the end of the word? Easy, just add it's value (ASCII code) multiplied by 31^length. You can keep powers of 31 in some separate table, other primes can be used as well.
Store all substring in single HashMap
With using tricks 1 and 2, you can generate all substrings in time O(N*L^2), which is the total number of substrings. Just always start with string of length one and add one character at a time. Put all your strings into a single HashMap, to reduce duplicities.
(You can skip 2 and 3 and discard duplicities when/after sorting, perhaps it will be even faster.)
Sort your substrings and you are good to go.
Well, when I got to point 4, I realized my plan wouldn't work, because in sorting you need to compare strings, and that can take O(L) time. I came up with several attempts to solve it, among them bucket sorting, but none would be faster than original O(N*L^3)
I will just this answer here in case it inspires someone.
In case you don't know Aho-Corasic algorithm, take look into that, it could have some use for your problem.

What you need may be suffix automaton. It costs only O(n) time and can recognize all substrings.
Suffix array can also solve this problems.
These two algorithms can solve most string problems, and they are really hard to learn. After you learn those you will solve it.

You may consider the following optimization:
Maintain list of processed substrings. While inserting a substring, check if the processed set contains that particular substring and if yes, skip inserting that substring in the trie.
However, the worst case complexity for insertion of all substrings in trie will be of the order of n^2 where n is the size of strings array. From the problem page, this works out to be of the order of 10^8 insertion operations in trie. Therefore, even if each insertion takes 10 operations on an average, you will have 10^9 operations in total which sets you up to exceed the time limit.
The problem page refers to LCP array as a related topic for the problem. You should consider change in approach.

First, notice that it is enough to add only suffixes to the trie, and nodes for every substring will be added along the way.
Second, you have to compress the trie, otherwise it will not fit into memory limit imposed by HackerRank. Also this will make your solution faster.
I just submitted my solution implementing these suggestions, and it was accepted. (the max execution time was 0.08 seconds.)
But you can make your solution even faster by implementing a linear time algorithm to construct the suffix tree. You can read about linear time suffix tree construction algorithms here and here. There is also an explanation of the Ukkonen's algorithm on StackOverflow here.

Related

Is This Big O Evaluation Correct?

I'm having trouble understanding something. I'm not even sure if it's correct.
In Cracking the Code Interview, there is a section that asks you to determine the Big O for a number of functions. For the most part, they're predictable.
However, one of them throws me for a loop.
Apparently, this evaluates to O(ab):
void printUnorderedPairs(int[] arrayA, int[] arrayB) {
for (int i = 0; i < arrayA.length; i++){
for (int j = 0; j < arrayB.length; j++){
for (int k = 0; k < 100000; k++){
System.out.println(arrayA[i] + "," + arrayB[j]);
}
}
}
}
With the rational that
"100,000 units of work is still constant, so the runtime is O(ab).
I'm trying to see why this could make sense, but I just can't yet; naturally, I expected O(abc).
Yes, 100,000 is a constant and arrayA and arrayB are arrays, but we're taking the length of the arrays. At the time of running these for loops, won't array[x].length be a constant (assuming the size of the arrays don't change during their execution)?
So, is the book right? If so, I would really appreciate insight and intuition so I don't fall into the same trap in the future.
Thanks!

Time complexity is generally expressed as the number of required elementary operations on an input of size n, where elementary operations are assumed to take a constant amount of time on a given computer and change only by a constant factor when run on a different computer.
O(ab) is the complexity in the above case as arrayA and arrayB are of variable length and are fully dependent on the calling function , and 100000 is constant, which won't change by any external factors.
Complexity is the measure of Unknown

The arrays A and B have an unspecified length, and all you can do is to give an indication of the complexity that is a function of these two lengths. Nothing else is variable in the given code.

What the authors meant by constant is a value that is a fix, regardless of the input size, unlike the length of input arrays that might change. For instance, the printUnorderedPairs might be called with different arrays as parameter, and those arrays might have different sizes among them.

The point of Big-O is to examine how the calculation grows as the inputs grow. It's clear that it would double if A doubled, and likewise if B doubled. So linear in those two.
What might be confusing you is that you could easily replace the 100k with C, yet another linear input, but it happens it doesn't have the 100k as a variable, it's a constant.
A similar thing in Big-O problems is where you step through an array a fixed number of times. That doesn't change the Big-O. For example if you step through an array to find the max, that's O(n). Stepping through it twice to find the min and the max is... also O(n). And in fact it's the same as stepping through it once to find the min and max in a single sweep.

Find First Unique Element

I had this question in interview which I couldn't answer.
You have to find first unique element(integer) in the array.
For example:
3,2,1,4,4,5,6,6,7,3,2,3
Then unique elements are 1, 5, 7 and first unique of 1.
The Solution required:
O(n) Time Complexity.
O(1) Space Complexity.
I tried saying:
Using Hashmaps, Bitvector...but none of them had space complexity O(1).
Can anyone tell me solution with space O(1)?

Here's a non-rigorous proof that it isn't possible:
It is well known that duplicate detection cannot be better than O(n * log n) when you use O(1) space. Suppose that the current problem is solvable in O(n) time and O(1) memory. If we get the index 'k' of the first non-repeating number as anything other than 0, we know that k-1 is a repeated and hence with one more sweep through the array we can get its duplicate making duplicate detection a O(n) exercise.
Again it is not rigorous and we can get into a worst case analysis where k is always 0. But it helps you think and convince the interviewer that it isn't likely to be possible.
http://en.wikipedia.org/wiki/Element_distinctness_problem says:
Elements that occur more than n/k times in a multiset of size n may be found in time O(n log k). Here k = n since we want elements that appear more than once.

I think that this is impossible. This isn't a proof, but evidence for a conjecture. My reasoning is as follows...
First, you said that there is no bound on value of the elements (that they can be negative, 0, or positive). Second, there is only O(1) space, so we can't store more than a fixed number of values. Hence, this implies that we would have to solve this using only comparisons. Moreover, we can't sort or otherwise swap values in the array because we would lose the original ordering of unique values (and we can't store the original ordering).
Consider an array where all the integers are unique:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
In order to return the correct output 1 on this array, without reordering the array, we would need to compare each element to all the other elements, to ensure that it is unique, and do this in reverse order, so we can check the first unique element last. This would require O(n^2) comparisons with O(1) space.
I'll delete this answer if anyone finds a solution, and I welcome any pointers on making this into a more rigorous proof.

Note: This can't work in the general case. See the reasoning below.
Original idea
Perhaps there is a solution in O(n) time and O(1) extra space.
It is possible to build a heap in O(n) time. See Building a Heap.
So you built the heap backwards, starting at the last element in the array and making that last position the root. When building the heap, keep track of the most recent item that was not a duplicate.
This assumes that when inserting an item in the heap, you will encounter any identical item that already exist in the heap. I don't know if I can prove that . . .
Assuming the above is true, then when you're done building the heap, you know which item was the first non-duplicated item.
Why it won't work
The algorithm to build a heap in place starts at the midpoint of the array and assumes that all of the nodes beyond that point are leaf nodes. It then works backward (towards item 0), sifting items into the heap. The algorithm doesn't examine the last n/2 items in any particular order, and the order changes as items are sifted into the heap.
As a result, the best we could do (and even then I'm not sure we could do it reliably) is find the first non-duplicated item only if it occurs in the first half of the array.

OP's question original doesn't mention the limit of the number(although latter add number can be negative/positive/zero). Here I assume one more condition:
The number in array are all smaller than array length and
non-negative.
Then, giving a O(n) time, O(1) space solution is possible and seems like a interview question, and the the test case OP gives in the question comply to above assumption.
Solution:
for (int i = 0; i < nums.length; i++) {
if (nums[i] != i) {
if (nums[i] == -1) continue;
if (nums[nums[i]] == nums[i]) {
nums[nums[i]] = -1;
} else {
swap(nums, nums[i], i);
i--;
}
}
}
}
for (int i = 0; i < nums.length; i++) {
if (nums[i] == i) {
return i;
}
}
The algorithm here is considering the original array as bucket in bucket sort. Put numbers into its bucket, if more than twice, mark it as -1. Using another loop to find the first number that has nums[i] == i

finding longest similar subsequence in a string

Suppose I want to find the longest subsequence such that first half of subsequence is same as second half of it.
For example: In a string abkcjadfbck , result is abcabc as abc is repeated in first and second half of it. In a stirng aaa, result is aa.

This task may be treated as a combination of two well known problems.
If you know in advance some point between two halves of the subsequence, you just need to find the best match for two strings. This is Pairwise alignment problem. Various dynamic programming methods solve it in O(N2) time.
To find a point where the string should be split optimally, you can use Golden section search or Fibonacci search. These algorithms have O(log N) time complexity.

In a first pass over inputString, we can count how often each character occurs, and remove those with occurrence one.
input: inputString
data strucutres:
Set<Triple<char[], Integer, Integer>> potentialSecondWords;
Map<Char, List<Integer>> lettersList;
for the characters c with increasing index h in inputString do
if (!lettersList.get(c).isEmpty()) {
for ((secondWord, currentIndex, maxIndex) in potentialSecondWords) {
if (there exists a j in lettersList.get(c) between currentIndex and maxIndex) {
update (secondWord, currentIndex, maxIndex) by adding c to secondWord and replacing currentIndex with j;
}
}
if potentialSecondWords contains a triple whose char[] is equal to c, remove it;
put new Triple with value (c,lettersList.get(c).get(0), h-1) into potentialSecondWords;
}
lettersList.get(c).add(h);
}
find the largest secondWord in potentialSecondWords and output secondWord twice;
So this algorithm passes once over the array, creating for each index, where it makes sense, a Triple representing the potential second word starting at the current index, and updates all potential second words.
With a suitable list implementation and n being the size of inputString, this algorithm has worst case runtime O(n²), e.g. for a^n.

Computing the mode (most frequent element) of a set in linear time?

In the book "The Algorithm Design Manual" by Skiena, computing the mode (most frequent element) of a set, is said to have a Ω(n log n) lower bound (this puzzles me), but also (correctly i guess) that no faster worst-case algorithm exists for computing the mode. I'm only puzzled by the lower bound being Ω(n log n).
See the page of the book on Google Books
But surely this could in some cases be computed in linear time (best case), e.g. by Java code like below (finds the most frequent character in a string), the "trick" being to count occurences using a hashtable. This seems obvious.
So, what am I missing in my understanding of the problem?
EDIT: (Mystery solved) As StriplingWarrior points out, the lower bound holds if only comparisons are used, i.e. no indexing of memory, see also: http://en.wikipedia.org/wiki/Element_distinctness_problem
// Linear time
char computeMode(String input) {
// initialize currentMode to first char
char[] chars = input.toCharArray();
char currentMode = chars[0];
int currentModeCount = 0;
HashMap<Character, Integer> counts = new HashMap<Character, Integer>();
for(char character : chars) {
int count = putget(counts, character); // occurences so far
// test whether character should be the new currentMode
if(count > currentModeCount) {
currentMode = character;
currentModeCount = count; // also save the count
}
}
return currentMode;
}
// Constant time
int putget(HashMap<Character, Integer> map, char character) {
if(!map.containsKey(character)) {
// if character not seen before, initialize to zero
map.put(character, 0);
}
// increment
int newValue = map.get(character) + 1;
map.put(character, newValue);
return newValue;
}

The author seems to be basing his logic on the assumption that comparison is the only operation available to you. Using a Hash-based data structure sort of gets around this by reducing the likelihood of needing to do comparisons in most cases to the point where you can basically do this in constant time.
However, if the numbers were hand-picked to always produce hash collisions, you would end up effectively turning your hash set into a list, which would make your algorithm into O(n²). As the author points out, simply sorting the values into a list first provides the best guaranteed algorithm, even though in most cases a hash set would be preferable.

So, what am I missing in my understanding of the problem?
In many particular cases, an array or hash table suffices. In "the general case" it does not, because hash table access is not always constant time.
In order to guarantee constant time access, you must be able to guarantee that the number of keys that can possibly end up in each bin is bounded by some constant. For characters this is fairly easy, but if the set elements were, say, doubles or strings, it would not be (except in the purely academic sense that there are, e.g., a finite number of double values).

Hash table lookups are amortized constant time, i.e., in general, the overall cost of looking up n random keys is O(n). In the worst case, they can be linear. Therefore, while in general they could reduce the order of mode calculation to O(n), in the worst case it would increase the order of mode calculation to O(n^2).

How to find sum of elements from given index interval (i, j) in constant time?

Given an array. How can we find sum of elements in index interval (i, j) in constant time. You are allowed to use extra space.
Example:
A: 3 2 4 7 1 -2 8 0 -4 2 1 5 6 -1
length = 14
int getsum(int* arr, int i, int j, int len);
// suppose int array "arr" is initialized here
int sum = getsum(arr, 2, 5, 14);
sum should be 10 in constant time.

If you can spend O(n) time to "prepare" the auxiliary information, based on which you would be able calculate sums in O(1), you could easily do it.
Preparation (O(n)):
aux[0] = 0;
foreach i in (1..LENGTH) {
aux[i] = aux[i-1] + arr[i];
}
Query (O(1)), arr is numerated from 1 to LENGTH:
sum(i,j) = aux[j] - aux[i-1];
I think it was the intent, because, otherwise, it's impossible: for any length to calculate sum(0,length-1) you should have scanned the whole array; this takes linear time, at least.

It cannot be done in constant time unless you store the information.
You would have to do something like specially modify the array to store, for each index, the sum of all values between the start of the array and this index, then using subtraction on the range to get the difference in sums.
However, nothing in your code sample seems to allow this. The array is created by the user (and can change at any time) and you have no control over it.
Any algorithm that needs to scan a group of elements in a sequential unsorted list will be O(n).

Previous answers are absolutely fine for the question asked. I am just adding a point, if this question is changed a bit like:
Find the sum of the interval, if the array gets changed dynamically.
If array elements get changed, then we have to recompute whatever sum we have stored in the auxiliary array as mentioned in #Pavel Shved's approach.
Recomputing is O(n) operation and hence we need to reduce the complexity down to O(nlogn) by making use of Segment Tree.
http://www.geeksforgeeks.org/segment-tree-set-1-sum-of-given-range/

There are three known algorithms for range based queries given [l,r]
1.Segment tree: total query time O(NlogN)
2.Fenwick tree: total query time O(NlogN)
3.Mo's algorithm(square root decomposition)
The first two algorithms can deal with modifications in the list/array given to you. The third algorithm or Mo's algorithm is an offline algorithm means all the queries need to be given to you prior. Modifications in the list/array are not allowed in this algorithm. For implementation, runtime and further reading of this algorithm you can check out this Medium blog. It explains with code. And a very few people actually know about this method.

this question will solve O(n^2)time,O(n)space or O(n)time,O(n)space..
Now the best optimal solution in this case (i.e O(n)time,O(n))
suppose a[]={1,3,5,2,6,4,9} is given
if we create an array(sum[]) in which we kept the value of sum of 0 index to that particular index.like for array a[],sum array will be sum[]={1,4,9,11,17,21,30};like
{1,3+1,3+1+5......} this takes O(n)time and O(n) space..
when we give index then it directly fetch from sum array it means add(i,j)=sum[j]-sum[i-1]; and this takes O(1) times and O(1) spaces...
so,this program takes O(n) time and O(N) spaces..
int sum[]=new int[l];
sum[0]=a[0];
System.out.print(cumsum[0]+" ");
for(int i=1;i<l;i++)
{
sum[i]=sum[i-1]+a[i];
System.out.print(sum[i]+" ");
}
?* this gives 1,4,9,11,17,21,30 and take O(n)time and O(n) spaces */
sum(i,j)=sum[j]-sum[i-1]/this gives sum of indexes from i to j and take O(1)time and O(1) spaces/
so,this program takes O(n) time and O(N) spaces..emphasized text

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Optimizing construction of a trie over all substrings - algorithm

What you need may be suffix automaton. It costs only O(n) time and can recognize all substrings. Suffix array can also solve this problems. These two algorithms can solve most string problems, and they are really hard to learn. After you learn those you will solve it.

Related

Is This Big O Evaluation Correct?

Find First Unique Element

finding longest similar subsequence in a string

Computing the mode (most frequent element) of a set in linear time?

How to find sum of elements from given index interval (i, j) in constant time?

Categories

Resources