Time complexity of unordered_set<int> find method - c++11

What is the time complexity of find method in unordered_set<int>?
And also is it possible to change the hash functions ?

what is the time complexity of find method in unordered_set?
...it's right there in the page you linked:
Complexity:
Average case: constant.
Worst case: linear in container size.
and also it it possible to change the hash functions?
Yes. Again, look at the documentation!
std::unordered_map takes an Hash template parameter. It's a customization point where you can inject your own hashing logic. A custom Hash must satisfy the Hash concept.

I guess you are getting confused by the default max_load_factor being 1. When you insert an int x in the unordered_set, it goes to the bucket i (i=x%number of buckets). So as you can imagine even if the hash function wont have collisions, as it maps each int with itself, the mod operation can have "collisions" in some cases. For example, if you insert 1, 4 and 6 in that order, both 1 and 6 will be in the same bucket, and the find function will need to go through the bucket to find them. The number of buckets is only increased when the load factor reaches the max load factor. The load factor is the arithmetic mean of the number of elements per bucket. So you can actually have more than one element in each bucket, and you can even have all elements in the same in the same bucket. In that case, finding an element that's inside the set would need a traditional sequential search (O(n)) inside the bucket. Here you have an example:
unordered_set<int> n;
n.insert(1);
n.insert(12);
n.insert(23);
n.insert(34);
n.insert(45);
In that case, every int is in the bucket 1, so when you look for 56 (56%11 = 1) you need to go through the whole bucket (size n, O(n)). The load factor is 0.4545 (5 elements / 11 buckets), so no buckets are added. You can reduce the max_load_factor (some languages use a load factor of 0.75), but that would increase the number of rehashes, as you would need to reserve buckets more frequently (the process of reserving is amortized constant, as it uses the same method std::vector uses, that's why in the example we have 11 buckets)

Related

Can i predict how many swaps a algorithm will do knowing how an array was shuffled?

I need to take a snapshot of an array each N times two elements gets swapped by a user-defined sorting algorithm. This N dipends by the total number of swaps M which the algorithm will perform once the array is ordered.
The size of the array can get up millions of elements, so I realized that running the algorithm two times (one for counting M, and one for taking these snapshots) gets too long on time when working with slow algorithm like BubbleSort.
Since I am the one who shuffle this algorithm I was wondering: is there a way to know how many swaps (or at least a superior limit of it) a precise sorting algorithm will do?
N is defined like:
Is it possible for you to modify the object class you are working with? You could try to pass a user-defined array class which owns a counter. Using operator overloading you could modify the assigment operator and increment the counter everytime
myarray[i]= newvalue
is called.

Counting Sort different approach

In a counting sort algorithm, we initialize an count array with a size of Maximum Value in a given array. Runtime of this method is O(n + Max value). However with an extra loop, we can look for minimum and maximum value of given array;
for 0 -> Length(given_array)
if given_array[i] > max
max = given_array[i]
if given_array[i] < min
min = given_array[i]
Then use that data to create the count array, lets say between 95-100. We could decrease the runtime in some cases tremendously. However, I haven't seen an approach like this. Would it be still a counting sort algorithm, or does it have another name that I don't know.
Counting sort is typically used when we know upfront that values will be restricted to a certain range.
This range doesn't need to start at zero; it's absolutely fine to use an array of length six whose elements represent the counts of values 95 through 100 (or, for that matter, the counts of values from −2 to 3). So, yes, your approach is still "counting sort".
But if you don't know this restriction upfront, you're not likely to get faster results by doing a complete pass over the data to check.
For example: suppose you have 1,000,000 elements, and you know they're all somewhere in the range 0–200, but you think they're probably all in a much narrower range. Well, the cost of prescanning the entire input array is going to be greater than the cost of working with a 201-element working array, which means it costs more than it can possibly save compared to just doing a counting sort with the range 0–200.
Runtime of this method is O(n + Max value).
The runtime is O(max(num_elements, range_size)), which — due to the magic of Landau (big-O) notation — is the same as O(num_elements + range_size). Your approach only affects the asymptotic complexity if max_value is asymptotically greater than both num_elements and range_size.

hash table about the load factor

I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?
The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.
Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

Updating & Querying all elements in array >= X where X is variable fast

Formally we are given an array with some initial values. Then we have 3 types of Queries :-
Point updates : Increment by 1 at a given position
Range Queries : To count number of elements>=x where x is taken as input
Range Updates : To decrement by 1 all elements>=x, where x is given as input.
N=105 , Q=105 (number of elements in array, number of Queries resp.)
I tried doing this with segment Tree but operations 2,3 can be worse than O(n) even as we don't know which 'range' is to be updated exactly so we may end up traversing whole of segment tree.
NOTE : I wish to clear that if we need to do all 3 operations in logarithmic Worst case ,ie O(log n) ,cause only then we can do this fast , linear approach doesn't works as Q=10^5 n N=10^5 , so worst case could be O(n^2) ,ie 10^10 operation which is clearly not feasible.
Given that you're talking about 105 items, and don't mention needing to add or remove items, it seems to me that the obvious data structure would be a simple sorted vector.
Operation complexities:
point update: O(1) + O(m) (where m is the number of subsequent elements equal to the value before the update).
Range query: O(log n) + O(m) (where n is start of range, m is elements in range).
Range update (same as range query).
It's a little difficult to be sure what "fast" means to you, but the fastest theoretically possible for 1 is O(1), so we're already within some constant factor of optimal.
For 2 and 3, even if we could do the find with constant complexity, we're pretty much stuck with O(m) for the update. Since Log2100000 = ~16.6, most of the time the O(m) term is going to dominate (i.e., the update part will involve as many operations as the search unless the given x is one of the last 17 items in the collection.
I doubt there's any point for this small of a collection, but if you might have to deal with a substantially larger collection and the items in the collection are reasonably predictably distributed, it might be worth considering doing an interpolating search instead of a binary search. With predictable distribution this reduces the expected number of comparisons to approximately O(log log n). In this case, that would be roughly 4 (but normally with a higher constant factor). This might be a win for 105 items, but then again it might not. If you might have to deal with a collection of (say) 108 items or more, it would be much more likely to be a substantial win.
The following may not be optimal, but is the best I could think of tonight.
Let's start by trying to turn the problem sideways. Instead of a map from indices to values, let's consider a map from values to sets of indices. A point update now involves removing an index from one set and adding it to another. A range update involves either simply moving an index set from one value to another or taking the union of two index sets. A range query involves folding over the sets corresponding to the values in range. A quick peek at Wikipedia suggests a traditional disjoint-set data structure is really great for set unions. Unfortunately, it's no good at all for removing an element from a set.
Fortunately, there is a newer data structure supporting union-find with constant time deletion! That takes care of both point updates and range updates quite naturally. Range queries, unfortunately, will require checking all array elements, even if very few elements are in range.

Finding the repeated element

In an array with integers between 1 and 1,000,000 or say some very larger value ,if a single value is occurring twice twice. How do you determine which one?
I think we can use a bitmap to mark the elements , and then traverse allover again to find out the repeated element . But , i think it is a process with high complexity.Is there any better way ?
This sounds like homework or an interview question ... so rather than giving away the answer, here's a hint.
What calculations can you do on a range of integers whose answer you can determine ahead of time?
Once you realize the answer to this, you should be able to figure it out .... if you still can't figure it out ... (and it's not homework) I'll post the solution :)
EDIT: Ok. So here's the elegant solution ... if the list contains ALL of the integers within the range.
We know that all of the values between 1 and N must exist in the list. Using Guass' formula we can quickly compute the expected value of a range of integers:
Sum(1..N) = 1/2 * (1 + N) * Count(1..N).
Since we know the expected sum, all we have to do is loop through all the values and sum their values. The different between this sum and the expected sum is the duplicate value.
EDIT: As other's have commented, the question doesn't state that the range contains all of the integers ... in this case, you have to decide whether you want to optimize for memory or time.
If you want to perform the operation using O(1) storage, you can perform an in-place sort of the list. As you're sorting you have to check adjacent elements. Once you see a duplicate, you know you can stop. Optimal sorting is an O(n log n) operation on average - which establishes an upper bound for find the duplicate in this manner.
If you want to optimize for speed, you can use an additional O(n) storage. Using a HashSet (or similar structure), insert values from your list until you determine you are inserting a duplicate into the HashSet. Inserting n items into a HashSet is an O(n) operation on average, which establishes that as an upper bound for this method.
you may try to use bits as hashmap:
1 at position k means that number k occured before
0 at position k means that number k did not occured before
pseudocode:
0. assume that your array is A
1. initialize bitarray(there is nice class in c# for this) of 1000000 length filled with zeros
2. for each num in A:
if bitarray[num]
return num
else
bitarray[num] = 1
end
The time complexity of the bitmap solution is O(n) and it doesn't seem like you could do better than that. However it will take up a lot of memory for a generic list of numbers. Sorting the numbers is an obvious way to detect duplicates and doesn't require extra space if you don't mind the current order changing.
Assuming the array is of length n < N (i.e. not ALL integers are present -- in this case LBushkin's trick is the answer to this homework problem), there is no way to solve this problem using less than O(n) memory using an algorithm that just takes a single pass through the array. This is by reduction to the set disjointness problem.
Suppose I made the problem easier, and I promised you that the duplicate elements were in the array such that the first one was in the first n/2 elements, and the second one was in the last n/2 elements. Now we can think of playing a game in which two people each hold a string of n/2 elements, and want to know how many messages they have to send to be sure that none of their elements are the same. Since the first player could simulate the run of any algorithm that takes a pass through the array, and send the contents of its memory to the second player, a lower bound on the number of messages they need to send implies a lower bound on the memory requirements of any algorithm.
But its easy to see in this simple game that they need to send n/2 messages to be sure that they don't hold any of the same elements, which yields the lower bound.
Edit: This generalizes to show that for algorithms that make k passes through the array and use memory m, that m*k = Omega(n). And it is easy to see that you can in fact trade off memory for time in this way.
Of course, if you are willing to use algorithms that don't simply take passes through the array, you can do better as suggested already: sort the array, then take 1 pass through. This takes time O(nlogn) and space O(1). But note curiously that this proves that any sorting algorithm that just makes passes through the array must take time Omega(n^2)! Sorting algorithms that break the n^2 bound must make random accesses.

Resources