Fastest algorithm to find a string in an array of strings? - algorithm

This question is merely about algorithm.
In pseudo code is like this:
A = Array of strings; //let's say count(A) = N
S = String to find; //let's say length(S) = M
for (Index=0; Index<count(A); Index++)
if (A[Index]==S) {
print "First occurrence at index\x20"+Index;
break;
}
This for loop requires string comparison N times (or byte comparison N*M times, O(N*M)). This is bad when array A has lots of items, or when string S is too long.
Any better method to find out the first occurrence? Some algorithm at O(K*logK) is OK, but preferable at O(K) or best at O(logK), where K is either N or M.
I don't mind adding in some other structures or doing some data processing before the comparison loop.

You could convert the whole array of strings to a finite state machine, where the transitions are the characters of the strings and put the smallest index of the strings that produced a state into the state. This takes a lot of time, and may be considered indexing.

Put the strings into a hash based set, and test to see if a given string is contained in the set should give you more or less constant performance once the set is built.

You can first sort the array of strings, which will take O(m*nlogn) time. And after A is sorted, you can do a binary search instead of linear search, which could reduce the total running time to O(m*logn).
The advantage of this method is that it's quite easy to implement. For example, in Java you can do this with just 2 lines of codes:
Arrays.sort(A);
int index = Arrays.binarySearch(A, "S");

You could use a Self-balancing binary search tree. Most implementations have O(log(n)) to insert, and O(log(n)) to search.
If your set is not very big, and you have a good hash functions for your values, the hash based set is a better solution, because in that case you will have O(1) to insert and O(1) to search. But if your hash function is bad or your set is too big, it will be O(n) to insert and O(n) to search.

The best way to search as fast as possible, is to have the array sorted
As you describe, there seems to be no possible information a priori which would allow for some heuristics or constraints in the search
Sort the array first (Quicksort for example, O(NlogN)),
and do binary search next O(log(N))

Related

Given O(n) sets, what is complexity of figuring out distinct ones amongst them?

I have an application where I have a list of O(n) sets.
Each set Set(i) is an n-vector. Suppose n=4, for instance,
Set(1) could be [0|1|1|0]
Set(2) could be [1|1|1|0]
Set(3) could be [1|1|0|0]
Set(4) could be [1|1|1|0]
I'd like to process these sets so that as output, I only get the unique ones amongst them. So, in the example above, I would get as output:
Set(1), Set(2), Set(3). Note that Set(4) is discarded since it is same as Set(2).
A rather brute force way of figuring this gives me a worst-case bound of O(n^3):
Given: Input List of size O(n)
Output List L = Set(1)
for(j = 2 to Length of Input List){ // Loop Outer, check if Set(j) should be added to L
for(i = 1 to Length of L currently){ // Loop Inner
check if Set(i) is same as Set(j) //This step is O(n) since Set() has O(n) elements
if(they are same) exit inner loop
else
if( i is length of L currently) //so, Set(j) is unique thus far
Append Set(j) to L
}
}
There is no a priori bound on n: it can be arbitrarily large. This seems to preclude use of simple hash function which maps the binary set into decimal. I could be wrong.
Is there any other way this can be done in better worst-case running time other than O(n^3)?
O(n) sequences of length n makes an input of size O(n^2). You won't get complexity better than that, since you may at least be required to read all the input. All sequences might be the same, for example, but you'd have to read them all to know that.
A binary sequence of length n can be inserted into a trie or radix tree, while checking whether or not it already exists, in O(n) time. That's O(n^2) for all the sequences together, so simply using a trie or radix tree to find duplicates is optimal.
See: https://en.wikipedia.org/wiki/Trie
and: https://en.wikipedia.org/wiki/Radix_tree
You may consider implementing your set using a balanced binary tree. The cost of inserting a new node into such a tree is O(lgm), where m is the number of elements in the tree. Duplicates would implicitly be weeded out because if we detect that such a node already exists, then it would just not be added.
In your example, the total number of lookup/insertion operations would be n*n, since there are n sets, and each set has n values. So, the overall time might scale as O(n^2*lg(n^2)). This outperforms O(n^3) by some amount.
First of all, these are not sets but bitstrings.
Next, for every bitstring you can convert it to a number and put that number in a hashset (or simply store the original bitstrings, most hashset implementations can do that). Afterwards, your hashset contains all the unique items. O(N) time, O(N) space. If you need to maintain the original order of strings, then in the first loop check for each string if it is in the hashset already, and if not, output it and insert in the hashset.
If you can use O(n) extra space, you can try this:
First of all, let's assume the vectors are binary numbers, so 0110 becomes 6.
This is in case numbers in vectors are [0,1], else you can multiply by 10 instead of 2.
Converting all vectors into decimals would take O(4n).
For each converted number we'll map the vector by the decimal number. To implement this, we'll be using an n-sized hash-map.
HM <- n-sized hash-map
for each vector v:
num <- decimal number converted of v
map v into HM by num
loop over HM and take only one for each index
runtime by steps:
O(n)
O(n*(4+1)) , when 1 is the time for mapping, 4 is the vector length
O(n)

What is the run-time of inserting the words in a string into a hash table?

More info:
n is the number of characters in the string
the hash table should keep track of each word's frequency; i.e., the hash table should store key-value pairs, where the key is a word in the input string, and the value is the number of times that word occurs in the input string
We've had some heated debates about this question at work, and I'd like to see what you guys think the answer is.
Important thing to consider during implementation of insert function is how do we handle collisions and resolution techniques. This will have a greater influence in both put() and get() operations.
The collision resolution techniques are implemented diffently in each libraries. The core idea is to maintain all colliding keys in the same bucket. And during retrieval traverse all the colliding keys and apply some equality check to retrieve the given key. Important thing to note is we need to maintain both 'keys' and 'values' in the bucket, to facilicate the above mentioned equality check.
So the key(words) is also being stored in hash table along with the count.
Another thing to consider is, during insertion operation a hashcode will be generated for the given key. We can consider this to be constant O(1) for every key.
Now, answering the question.
Given a string of length 'n'
Inserting all the words and frequencies will have following steps.
1. split given string in to words, with given delimiter - O(n)
2. For word in words - O(n)
# Considering copy of word of length k as constant and very small compared to 'n'.
# And collision resolution implementation amortized across all inserts
if MAP.exists(word) - O(1)
MAP.set(word, MAP.get(word)+1) - amortized to O(1)
else
MAP.set(word, 1) - O(1)
Over all, O(n) run-time for inserting the words in a string into a hash table. Because the for loop runs 'n/k' times and we know 'k' is constant and small compared to n.
If H is your hashtable mapping words to counts, then H[s] and H[s] = <new value> are both O(len(s)). That's because computing the hashcode for s requires you to read every character of s, and also once you've found the relevant line in the hashtable, you need to compare s to whatever's stored there. Of course, the usual hashtable complexities apply to -- there's O(1) of these comparisons performed.
With respect to your original problem, you can break your string of length n into words in O(n) time. Then for each word, you need an O(len(word)) operation to update the hashtable. For all the strings, O(len(word1) + len(word2) + ... + len(word_n)) = O(n) overall, since the sum of the length of the words is always less than n, the length of the original string.

Complexity of binary search on a string

I have an sorted array of strings: eg: ["bar", "foo", "top", "zebra"] and I want to search if an input word is present in an array or not.
eg:
search (String[] str, String word) {
// binary search implemented + string comaparison.
}
Now binary search will account for complexity which is O(logn), where n is the length of an array. So for so good.
But, at some point we need to do a string compare, which can be done in linear time.
Now the input array can contain of words of different sizes. So when I
am calculating final complexity will the final answer be O(m*logn)
where m is the size of word we want to search in the array, which in our case
is "zebra" the word we want to search?
Yes, your thinking as well your proposed solution, both are correct. You need to consider the length of the longest String too in the overall complexity of String searching.
A trivial String compare is an O(m) operation, where m is the length of the larger of the two strings.
But, we can improve a lot, given that the array is sorted. As user "doynax" suggests,
Complexity can be improved by keeping track of how many characters got matched during
the string comparisons, and store the present count for the lower and
upper bounds during the search. Since the array is sorted we know that
the prefix of the middle entry to be tested next must match up to at
least the minimum of the two depths, and therefore we can skip
comparing that prefix. In effect we're always either making progress
or stopping the incremental comparisons immediately on a mismatch, and
thereby never needing to keep going over old ground.
So, overall m number of character comparisons would have to be done till the end of the string, if found OR else not even that much(if fails at early stage).
So, the overall complexity would be O(m + log n).
I was under the impression that what original poster said was correct by saying time complexity is O(m*logn).
If you use the suggested enhancement to improve the time complexity (to get O(m + logn)) by tracking previously matched letters I believe the below inputs would break it.
arr = [“abc”, “def”, “ghi”, “nlj”, “pfypfy”, “xyz”]
target = “nljpfy”
I expect this would incorrectly match on “pfypfy”. Perhaps one of the original posters can weigh in on this. Definitely curious to better understand what was proposed. It sounds like matched number of letters are skipped in next comparison.

Find a common element within N arrays

If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).

Using Rabin-Karp to search for multiple patterns in a string

According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this?
Edit (December 2011):
The wikipedia article has since been updated and no longer claims to match multiple patterns of differing length in O(n).
I'm not sure if this is the correct answer, but anyway:
While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.
Of course, we must pick m to have the maximum string length from the set of strings.
Update: From Wikipedia,
[...]
for i from 1 to n-m+1
if hs ∈ hsubs
if s[i..i+m-1] = a substring with hash hs
return i
hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]
We calculate current hash in m steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.
Update 2: an amortized (average) O(n) time complexity ?
Above I said that m must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m size we can achieve O(n) complexity.
If we have variable length strings we can set m to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.
This technique will increase the false alarms but on average it has O(n) time complexity.
It's because the hash values of the substrings are related mathematically. Computing the hash H(S,j) (the hash of the characters starting from the jth position of string S) takes O(m) time on a string of length m. But once you have that, computing H(S, j+1) can be done in constant time, because H(S, j+1) can be expressed as a function of H(S, j).
O(m) + O(1) => O(m), i.e. linear time.
Here's a link where this is described in more detail (see e.g. the section "What makes Rabin-Karp fast?")

Resources