I'm trying to rate the efficiency of a function where the input is an array of strings. The algorithm always iterates through every item in this array. This strings contained in this array are of variable length. In this initial for loop, a character replace function is called on each string. I believe the replace function on its own would be O(n) where n is the length of the string.
So I'm confused how to evaluate big o efficiency here. If n is the size of the array, I know it will at least be O(n). But with variable string lengths, how would you rate the overall efficiency with the string replacement? Would you say n is the size of the array and use other variables to represent the different sizes of each string?
Personally, I would express efficiency through size of input data (as opposed to the length of array). So, if input is t bytes, running time will be O(t). And t here is also a common length of all strings.
I see two ways to say this (out of many possible).
The first is to say it is O(N) where N is the total number of characters.
The other you could say is O(N*M) where N is the number of strings and M is the average number of characters per string. Note that this is actually the same as my above answer. You could say M = k/N, so you get O(N*k/N) or O(k) where k is the total characters in all the strings.
Related
Describe the MOST efficient (in the worst case big O) algorithm, and required data structures to determine the frequencies of characters in an English text document that can have any character on the keyboard, with upper or lower case for letters, and print the (character, frequencies) pairs at the end. What operations would you count for worst case, and give the resulting big O time.
It is better to use an array of size equal to 256 ( total ascii characters ).
Initially all values in array are 0. While reading characters from English text document , we can simply increment the value in index equal to the ascii value of given characters. Hence, these operations can be done in O(1) time compleixty without any overhead ( if we use hashMap, we will have overhead for collision in worst case ).
As we have to loop over all characters in given text document, the overall time complexity of proposed method will be O(n) , where n is the length of text document.
A hashmap where key=character and value=frequency would work best for any kind of characters and encodings.
If you only need whatever a keyboard can produce, you can also use a frequency array where F[character ASCII code]=frequency.
Both solutions have a constant O(1) runtime per operation.
I was reading text about hashing , I found out that naive hash code of char string can be implemented as polynomial hash function
h(S0,S1,S2,...SN-1) = S0*A^N-1 + S1*A^N-2 + S2*A^N-3 ..... SN-1*A^0. Where Si is character at index i and A is some integer.
But cannot we straightaway sum as
h(S0,S1,S2,...SN-1) = S0*(N)+S1*(N-1)+S2*(N-2) ...... SN-1*1.
I see this function also as good since two values 2*S0+S1 != 2*S1+S0 (which are reverse) are not hashed to same values. But nowhere i find this type of hash function
Suppose we work with strings of 30 characters. That's not long, but it's not so short that problems with the hash should arise purely because the strings are too short.
The sum of the weights is 465 (1+2+...+30), with printable ASCII characters that makes the maximum hash 58590, attained by "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~". There are a lot more possible printable ASCII strings of 30 characters than that (9530 ≈ 2E59), but they all hash into the range of 0 to 58590. Naturally you cannot actually have that many strings at the same time, but you could have a lot more than 58590, and that would guarantee collisions just based on counting (it is very likely to happen much sooner of course).
The maximum hash grows only slowly, you'd need strings of 34 million characters before the entire range of a 32bit integer is used.
The other way, multiplying by powers of A (this can be evaluated with Horner's scheme so no powers needs to be calculated explicitly, it still only costs an addition and a multiplication per character, though the naive way is not the fastest way to compute that hash), does not have this problem. The powers of A quickly get big (and start wrapping, which is fine as long as A is odd), so strings with 30 characters stand a good chance to cover the entire range of whatever integer type you're using.
The problem with a linear hash function is that it's much easier to generate collisions.
Consider a string with 3 chars: S0, S1, S2.
The proposed hash code would be 3 * S0 + 2 * S1 + S2.
Every time we decrease char S2 by two (e.g. e --> c), and increase char S1 by one (e.g. m --> n), we obtain the same hash code.
Even only the fact that it is possible to describe an operation preserving hash so easily would be an alarm (because some algorithm might process the string exactly in that manner). As a more extreme case consider just summing the characters. In this situation all the anagrams of the original string would generate the same hash code (thus this hash would be useless in an application processing anagrams).
What i know,
Hashtable size depends on load factor.
It must be largest prime number, and use that prime number as the
modulo value in hash function.
Prime number must not be too close to power of 2 and power of 10.
Doubt I am having,
Does size of hashtable depends on length of key?
Following paragraph from the book Introduction to Algorithms by Cormen.
Does n=2000 mean length of string or number of element which will be store in hash table?
Good values for m are primes not too close to exact powers of 2. For
example, suppose we wish to allocate a hash table, with collisions
resolved by chaining, to hold roughly n = 2000 character strings,
where a character has 8 bits. We don't mind examining an average of 3
elements in an unsuccessful search, so we allocate a hash table of
size m = 701. The number 701 is chosen because it is a prime near =
2000/3 but not near any power of 2. Treating each key k as an integer,
our hash function would be
h(k) = k mod 701 .
Can somebody explain it>
Here's a general overview of the tradeoff with hash tables.
Suppose you have a hash table with m buckets with chains storing a total of n objects.
If you store only references to objects, the total memory consumed is O (m + n).
Now, suppose that, for an average object, its size is s, it takes O (s) time to compute its hash once, and O (s) to compare two such objects.
Consider an operation checking whether an object is present in the hash table.
The bucket will have n / m elements on average, so the operation will take O (s n / m) time.
So, the tradeoff is this: when you increase the number of buckets m, you increase memory consumption but decrease average time for a single operation.
For the original question - Does size of hashtable depends on length of key? - No, it should not, at least not directly.
The paragraph you cite only mentions the strings as an example of an object to store in a hash table.
One mentioned property is that they are 8-bit character strings.
The other is that "We don't mind examining an average of 3 elements in an unsuccessful search".
And that wraps the properties of the stored object into the form: how many elements on average do we want to place in a single bucket?
The length of strings themselves is not mentioned anywhere.
(2) and (3) are false. It is common for a hash table with 2^n buckets (ref) as long as you use the right hash function. On (1), the memory a hash table takes equals the number of buckets times the length of key. Note that for string keys, we usually keep pointers to strings, not the strings themselves, so the length of key is the length of a pointer, which is 8 bytes on 64-bit machines.
Algorithmic-wise, No!
The length of the key is irrelevant here.
Moreover, the key itself is not important, what's important is the number of different keys you predict you'll have.
Implementation-wise, Yes! Since you must save the key itself in your hashtable, it reflects on its size.
For your second question, 'n' means the number of different keys to hold.
I was going through this paper about counting number of distinct common subsequences between two strings which has described a DP approach to do the same. Now, when there are more than two strings whose number of distinct common subsequences must be found, it might take an approach different from this one. What I want is that whether this task is achievable in time complexity less than exponential and how can it be done?
If you have an alphabet of size k, and m strings of size at most n then (assuming that all individual math operations are O(1)) this problem is solvable with dynamic programming in time at most O(k nm+1) and memory O(k nm). Those are not tight bounds, and in practice performance and memory should be significantly better than that. But in practice with long strings you will wind up needing big integer arithmetic, which will make math operations not O(1). Still it is polynomial.
Here is the trick in an unfortunately confusing sentence. We want to build up a series of tables listing, for each possible length of subsequence and each set of ways to pick one copy of a character from each string, the number of distinct subsequences there are whose minimal expression in each string ends at the chosen spot. If we do that, then the sum of all of those values is our final answer.
Here is an outline of how to do it (which you can do without understanding the above description).
For each string, build a transition table mapping (position in string, character) to the position of the next occurrence of that character. The tables should start with position 0 being before the first character. You can use -1 for running off of the end of the string.
Create a data structure that maps a list of integers the same size as the number of strings you have to another integer. This will be the count of subsequences of a fixed length whose shortest representation in each string ends at that set of positions.
Insert as the sole value (0, 0, ..., 0) -> 1 to represent the fact that there is 1 subsequence of length 0 and its shortest representation in each string ends at the start.
Set the total count of common subsequences to 0.
While that map is not empty:
Add the sum of values in that map to the total count of common subsequences.
Create a second map of the same type, with no data.
For each key/value pair in the first map:
For each possible character in your alphabet:
Construct a new vector of integers to be a new key by taking each string, looking at the position, then taking the next position of that character. Of course if you run off of the end of the string, break out of the loop.
If that key is not in your second map, insert it with value 0.
Increase the value for that key in the second map by your current value in the current map. (Basically add the number of subsequences that just had this minimal character transition.)
Copy the second data structure to the first.
The total count of distinct subsequences in common across all of the strings should now be correct.
According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this?
Edit (December 2011):
The wikipedia article has since been updated and no longer claims to match multiple patterns of differing length in O(n).
I'm not sure if this is the correct answer, but anyway:
While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.
Of course, we must pick m to have the maximum string length from the set of strings.
Update: From Wikipedia,
[...]
for i from 1 to n-m+1
if hs ∈ hsubs
if s[i..i+m-1] = a substring with hash hs
return i
hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]
We calculate current hash in m steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.
Update 2: an amortized (average) O(n) time complexity ?
Above I said that m must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m size we can achieve O(n) complexity.
If we have variable length strings we can set m to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.
This technique will increase the false alarms but on average it has O(n) time complexity.
It's because the hash values of the substrings are related mathematically. Computing the hash H(S,j) (the hash of the characters starting from the jth position of string S) takes O(m) time on a string of length m. But once you have that, computing H(S, j+1) can be done in constant time, because H(S, j+1) can be expressed as a function of H(S, j).
O(m) + O(1) => O(m), i.e. linear time.
Here's a link where this is described in more detail (see e.g. the section "What makes Rabin-Karp fast?")