Computing the mode (most frequent element) of a set in linear time? - algorithm

In the book "The Algorithm Design Manual" by Skiena, computing the mode (most frequent element) of a set, is said to have a Ω(n log n) lower bound (this puzzles me), but also (correctly i guess) that no faster worst-case algorithm exists for computing the mode. I'm only puzzled by the lower bound being Ω(n log n).
See the page of the book on Google Books
But surely this could in some cases be computed in linear time (best case), e.g. by Java code like below (finds the most frequent character in a string), the "trick" being to count occurences using a hashtable. This seems obvious.
So, what am I missing in my understanding of the problem?
EDIT: (Mystery solved) As StriplingWarrior points out, the lower bound holds if only comparisons are used, i.e. no indexing of memory, see also: http://en.wikipedia.org/wiki/Element_distinctness_problem
// Linear time
char computeMode(String input) {
// initialize currentMode to first char
char[] chars = input.toCharArray();
char currentMode = chars[0];
int currentModeCount = 0;
HashMap<Character, Integer> counts = new HashMap<Character, Integer>();
for(char character : chars) {
int count = putget(counts, character); // occurences so far
// test whether character should be the new currentMode
if(count > currentModeCount) {
currentMode = character;
currentModeCount = count; // also save the count
}
}
return currentMode;
}
// Constant time
int putget(HashMap<Character, Integer> map, char character) {
if(!map.containsKey(character)) {
// if character not seen before, initialize to zero
map.put(character, 0);
}
// increment
int newValue = map.get(character) + 1;
map.put(character, newValue);
return newValue;
}

The author seems to be basing his logic on the assumption that comparison is the only operation available to you. Using a Hash-based data structure sort of gets around this by reducing the likelihood of needing to do comparisons in most cases to the point where you can basically do this in constant time.
However, if the numbers were hand-picked to always produce hash collisions, you would end up effectively turning your hash set into a list, which would make your algorithm into O(n²). As the author points out, simply sorting the values into a list first provides the best guaranteed algorithm, even though in most cases a hash set would be preferable.

So, what am I missing in my understanding of the problem?
In many particular cases, an array or hash table suffices. In "the general case" it does not, because hash table access is not always constant time.
In order to guarantee constant time access, you must be able to guarantee that the number of keys that can possibly end up in each bin is bounded by some constant. For characters this is fairly easy, but if the set elements were, say, doubles or strings, it would not be (except in the purely academic sense that there are, e.g., a finite number of double values).

Hash table lookups are amortized constant time, i.e., in general, the overall cost of looking up n random keys is O(n). In the worst case, they can be linear. Therefore, while in general they could reduce the order of mode calculation to O(n), in the worst case it would increase the order of mode calculation to O(n^2).

Related

Search a Sorted Array for First Occurrence of K

I'm trying to solve question 11.1 in Elements of Programming Interviews (EPI) in Java: Search a Sorted Array for First Occurrence of K.
The problem description from the book:
Write a method that takes a sorted array and a key and returns the index of the first occurrence of that key in the array.
The solution they provide in the book is a modified binary search algorithm that runs in O(logn) time. I wrote my own algorithm also based on a modified binary search algorithm with a slight difference - it uses recursion. The problem is I don't know how to determine the time complexity of my algorithm - my best guess is that it will run in O(logn) time because each time the function is called it reduces the size of the candidate values by half. I've tested my algorithm against the 314 EPI test cases that are provided by the EPI Judge so I know it works, I just don't know the time complexity - here is the code:
public static int searchFirstOfKUtility(List<Integer> A, int k, int Lower, int Upper, Integer Index)
{
while(Lower<=Upper){
int M = Lower + (Upper-Lower)/2;
if(A.get(M)<k)
Lower = M+1;
else if(A.get(M) == k){
Index = M;
if(Lower!=Upper)
Index = searchFirstOfKUtility(A, k, Lower, M-1, Index);
return Index;
}
else
Upper=M-1;
}
return Index;
}
Here is the code that the tests cases call to exercise my function:
public static int searchFirstOfK(List<Integer> A, int k) {
Integer foundKey = -1;
return searchFirstOfKUtility(A, k, 0, A.size()-1, foundKey);
}
So, can anyone tell me what the time complexity of my algorithm would be?
Assuming that passing arguments is O(1) instead of O(n), performance is O(log(n)).
The usual theoretical approach for analyzing recursion is calling the Master Theorem. It is to say that if the performance of a recursive algorithm follows a relation:
T(n) = a T(n/b) + f(n)
then there are 3 cases. In plain English they correspond to:
Performance is dominated by all the calls at the bottom of the recursion, so is proportional to how many of those there are.
Performance is equal between each level of recursion, and so is proportional to how many levels of recursion there are, times the cost of any layer of recursion.
Performance is dominated by the work done in the very first call, and so is proportional to f(n).
You are in case 2. Each recursive call costs the same, and so performance is dominated by the fact that there are O(log(n)) levels of recursion times the cost of each level. Assuming that passing a fixed number of arguments is O(1), that will indeed be O(log(n)).
Note that this assumption is true for Java because you don't make a complete copy of the array before passing it. But it is important to be aware that it is not true in all languages. For example I recently did a bunch of work in PL/pgSQL, and there arrays are passed by value. Meaning that your algorithm would have been O(n log(n)).

Time Complexity of this Word Break DFS + Memorization solution

when dealing with wordBreak problem, I found this solution is really concise. But not sure about the time complexity. anyone can help?
my understanding is worst case, O(n*k), n is the size of the wordDict, and k is the length of the String.
class Solution {
public boolean wordBreak(String s, List<String> wordDict) {
return wordBreak(s, wordDict, new HashMap<String, Boolean>());
}
private boolean wordBreak(String s, List<String> wordDict, Map<String, Boolean> memo) {
if (s == null) return false;
if (s.isEmpty()) return true;
if (memo.containsKey(s)) return memo.get(s);
for (String dict : wordDict) { //number of words O(n)
//startsWith is bounded by the length of dict word, avg is O(m), can be ignored
//substring is bounded by the length of dict word, avg is O(k), k is the length of s
//wordBreak will be executed k/m times, k is the length of s, worse case k times... when a single letter is in the dict
if (s.startsWith(dict) && wordBreak(s.substring(dict.length()), wordDict, memo)) {
memo.put(s, true);
return true;
}
}
memo.put(s, false);
return false;
}
}
It's worse than O(nk) for several reasons:
You ignore "m", but m is Omega(log k). (Because k < |A|^(m+1) where |A| is the size of your alphabet).
s.substring is probably O(n). Your code looks like Java, and it's O(n) in Java.
Even if s.substring were linear, Your Map requires the string to be hashed, so your map operations are O(n) (where note carefully -- n is the size of the string rather than the size of the hashtable like it would normally be).
Probably this means you have complexity O(n^2k*logk).
You can fix 3 easily -- you can use s.length rather than s as the key to your hashtable.
Problem 2 is easy but slightly annoying to fix -- rather than slicing your string, you can use a variable that indexes into the string. You probably have to re-write startsWith yourself to using this index (or use a trie -- see below). If your programming language has an O(1) slice operation (for example, string_view in C++) then you could use that instead.
Problem 1 is only theoretical, since for real word lists, m is really small compared to either the length of the dictionary or the potential length of input strings.
Note that using a trie for the dictionary rather than a word list is likely to result in a huge time improvement, with realistic examples being linear excluding dictionary construction (although worst-case examples where the dictionary and input strings are chosen maliciously will be O(nk)).

Does a HashMap with string keys really have a lower time complexity than a Trie?

Say I want to store a dictionary of strings and I want to know if some string exists or not. I can use a Trie or a HashMap. The HashMap has a time complexity of O(1) with a high probability while the Trie in that case would have a time complexity of O(k) where k is the length of the string.
Now my question is: Doesn't calculating the hash value of the string have a time complexity of O(k) thus making the complexity of the HashMap the same? If not, why?
The way I see it is that a Trie here would have lower time complexity than a HashMap for looking up a string since the HashMap -in addition to calculating the hash value- might hit collisions. Am I missing something?
Update:
Which data structure would you use to optimize for speed when constructing a dictionary?
Apart from the complexity of implementation of a trie, certain optimizations are done in the implementation of the hashCode method that determines the buckets in a hash table. For java.lang.String, an immutable class, here is what JDK-8 does:
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
Thus, it is cached (and is thread-safe). Once calculated, the hash code of a string need not be recalculated. This saves you from having to spend the O(k) time in the case of hash table (or hash set, hash map).
While implementing dictionaries, I think tries shine where you are more interested in possible partial matches rather than exact matches. Generally speaking hash based solutions work best in case of exact matches.
The time complexity of performing operations on a hash table is typically measured in the number of hashes and compares that have to be performed. On expectation, the cost, when measured this way, is O(1) because on expectation only a constant number of hashed and compares must be used.
To determine the cost of using a hash table for strings, you do indeed need to factor in the cost of these operations, which will be O(k) each for a string of length k. Therefore, the cost of a hash table operation on a string is O(1) · O(k) = O(k), matching the trie cost, though only on expectation and with a different constant factor.

Optimizing construction of a trie over all substrings

I am solving a trie related problem. There is a set of strings S. I have to create a trie over all substrings for each string in S. I am using the following routine:
String strings[] = { ... }; // array containing all strings
for(int i = 0; i < strings.length; i++) {
String w = strings[i];
for (int j = 0; j < w.length(); j++) {
for (int k = j + 1; k <= w.length(); k++) {
trie.insert(w.substring(j, k));
}
}
}
I am using the trie implementation provided here. However, I am wondering if there are certain optimizations which can be done in order to reduce the complexity of creating trie over all substrings?
Why do I need this? Because I am trying to solve this problem.
If we have N words, each with maximum length L, your algorithm will take O(N*L^3) (supposing that adding to trie is linear with length of adding word). However, the size of the resulting trie (number of nodes) is at most O(N*L^2), so it seems you are wasting time and you could do better.
And indeed you can, but you have to pull a few tricks from you sleeve. Also, you will no longer need the trie.
.substring() in constant time
In Java 7, each String had a backing char[] array as well as starting position and length. This allowed the .substring() method to run in constant time, since String is immutable class. New String object with same backing char[] array was created, only with different start position and length.
You will need to extend this a bit, to support adding at the end of the string, by increasing the length. Always create a new string object, but leave the backing array same.
Recompute hash in constant time after appending single character
Again, let me use Java's hashCode() function for String:
int hash = 0;
for (int i = 0; i < data.length; i++) {
hash = 31 * hash + data[i];
} // data is the backing array
Now, how will the hash change after adding a single character at the end of the word? Easy, just add it's value (ASCII code) multiplied by 31^length. You can keep powers of 31 in some separate table, other primes can be used as well.
Store all substring in single HashMap
With using tricks 1 and 2, you can generate all substrings in time O(N*L^2), which is the total number of substrings. Just always start with string of length one and add one character at a time. Put all your strings into a single HashMap, to reduce duplicities.
(You can skip 2 and 3 and discard duplicities when/after sorting, perhaps it will be even faster.)
Sort your substrings and you are good to go.
Well, when I got to point 4, I realized my plan wouldn't work, because in sorting you need to compare strings, and that can take O(L) time. I came up with several attempts to solve it, among them bucket sorting, but none would be faster than original O(N*L^3)
I will just this answer here in case it inspires someone.
In case you don't know Aho-Corasic algorithm, take look into that, it could have some use for your problem.
What you need may be suffix automaton. It costs only O(n) time and can recognize all substrings.
Suffix array can also solve this problems.
These two algorithms can solve most string problems, and they are really hard to learn. After you learn those you will solve it.
You may consider the following optimization:
Maintain list of processed substrings. While inserting a substring, check if the processed set contains that particular substring and if yes, skip inserting that substring in the trie.
However, the worst case complexity for insertion of all substrings in trie will be of the order of n^2 where n is the size of strings array. From the problem page, this works out to be of the order of 10^8 insertion operations in trie. Therefore, even if each insertion takes 10 operations on an average, you will have 10^9 operations in total which sets you up to exceed the time limit.
The problem page refers to LCP array as a related topic for the problem. You should consider change in approach.
First, notice that it is enough to add only suffixes to the trie, and nodes for every substring will be added along the way.
Second, you have to compress the trie, otherwise it will not fit into memory limit imposed by HackerRank. Also this will make your solution faster.
I just submitted my solution implementing these suggestions, and it was accepted. (the max execution time was 0.08 seconds.)
But you can make your solution even faster by implementing a linear time algorithm to construct the suffix tree. You can read about linear time suffix tree construction algorithms here and here. There is also an explanation of the Ukkonen's algorithm on StackOverflow here.

Average case algorithm analysis using Kolmogorov Incompressibility Method

The Incompressibility Method is said to simplify the analysis of algorithms for the average case. From what I understand, this is because there is no need to compute all of the possible combinations of input for that algorithm and then derive an average complexity. Instead, a single incompressible string is taken as the input. As an incompressible string is typical, we can assume that this input can act as an accurate approximation of the average case.
I am lost in regard to actually applying the Incompressibility Method to an algorithm. As an aside, I am not a mathematician, but think that this theory has practical applications in everyday programming.
Ultimately, I would like to learn how I can deduce the average case of any given algorithm, be it trivial or complex. Could somebody please demonstrate to me how the method can be applied to a simple algorithm? For instance, given an input string S, store all of the unique characters in S, then print each one individually:
void uniqueChars(String s) {
char[] chars = chars[ s.length() ];
int free_idx = 0;
for (int i = 0; i < s.length(); i++) {
if (! s[i] in chars) {
chars[free_idx] = s[i];
free_idx++;
}
}
for (int i = 0; i < chars.length(); i++) {
print (chars[i]);
}
}
Only for the sake of argument. I think pseudo-code is sufficient. Assume a linear search for checking whether the array contains an element.
Better algorithms by which the theory can be demonstrated are acceptable, of course.
This question maybe nonsensical and impractical, but I would rather ask than hold misconceptions.
reproducing my answer on the CS.Se question, for inter-reference purposes
Kolmogorov Complexity (or Algorithmic Complexity) deals with optimal descriptions of "strings" (in the general sense of strings as sequences of symbols)
A string is (sufficiently) incompressible or (sufficiently) algorithmicaly random if its (algorithmic) description (kolmogorov comlplexity K) is not less than its (literal) size. In other words the optimal description of the string, is the string itself.
Major result of the theory is that most strings are (algorithmicaly) random (or typical) (which is also related to other areas like Goedel's Theorems, through Chaitin's work)
Kolmogorov Complexity is related to Probabilistic (or Shannon) Entropy, in fact Entropy is an upper bound on KC. And this relates analysis based on descriptive complexity to probabilistic-based analysis. They can be inter-changeable.
Sometimes it might be easier to use probabilisrtic analysis, others descriptive complexity (views of the same lets say)
So in the light of the above, assuming an algorithmicaly random input to an algorithm, one asumes the following:
The input is typical, thus the analysis describes average-case scenario (point 3 above)
The input size is related in certain way to its probability (point 2 above)
One can pass from algorithmic view to probabilistic view (point 4 above)

Resources