Is partitioning easier than sorting? - algorithm

This is a question that's been lingering in my mind for some time ...
Suppose I have a list of items and an equivalence relation on them, and comparing two items takes constant time.
I want to return a partition of the items, e.g. a list of linked lists, each containing all equivalent items.
One way of doing this is to extend the equivalence to an ordering on the items and order them (with a sorting algorithm); then all equivalent items will be adjacent.
But can it be done more efficiently than with sorting? Is the time complexity of this problem lower than that of sorting? If not, why not?

You seem to be asking two different questions at one go here.
1) If allowing only equality checks, does it make partition easier than if we had some ordering? The answer is, no. You require Omega(n^2) comparisons to determine the partitioning in the worst case (all different for instance).
2) If allowing ordering, is partitioning easier than sorting? The answer again is no. This is because of the Element Distinctness Problem. Which says that in order to even determine if all objects are distinct, you require Omega(nlogn) comparisons. Since sorting can be done in O(nlogn) time (and also have Omega(nlogn) lower bounds) and solves the partition problem, asymptotically they are equally hard.
If you pick an arbitrary hash function, equal objects need not have the same hash, in which case you haven't done any useful work by putting them in a hashtable.
Even if you do come up with such a hash (equal objects guaranteed to have the same hash), the time complexity is expected O(n) for good hashes, and worst case is Omega(n^2).
Whether to use hashing or sorting completely depends on other constraints not available in the question.
The other answers also seem to be forgetting that your question is (mainly) about comparing partitioning and sorting!

If you can define a hash function for the items as well as an equivalence relation, then you should be able to do the partition in linear time -- assuming computing the hash is constant time. The hash function must map equivalent items to the same hash value.
Without a hash function, you would have to compare every new item to be inserted into the partitioned lists against the head of each existing list. The efficiency of that strategy depends on how many partitions there will eventually be.
Let's say you have 100 items, and they will eventually be partitioned into 3 lists. Then each item would have to be compared against at most 3 other items before inserting it into one of the lists.
However, if those 100 items would eventually be partitioned into 90 lists (i.e., very few equivalent items), it's a different story. Now your runtime is closer to quadratic than linear.

If you don't care about the final ordering of the equivalence sets, then partitioning into equivalence sets could be quicker. However, it depends on the algorithm and the numbers of elements in each set.
If there are very few items in each set, then you might as well just sort the elements and then find the adjacent equal elements. A good sorting algorithm is O(n log n) for n elements.
If there are a few sets with lots of elements in each then you can take each element, and compare to the existing sets. If it belongs in one of them then add it, otherwise create a new set. This will be O(n*m) where n is the number of elements, and m is the number of equivalence sets, which is less then O(n log n) for large n and small m, but worse as m tends to n.
A combined sorting/partitioning algorithm may be quicker.

If a comparator must be used, then the lower bound is Ω(n log n) comparisons for sorting or partitioning. The reason is all elements must be inspected Ω(n), and a comparator must perform log n comparisons for each element to uniquely identify or place that element in relation to the others (each comparison divides the space in 2, and so for a space of size n, log n comparisons are needed.)
If each element can be associated with a unique key which is derived in constant time, then the lowerbound is Ω(n), for sorting ant partitioning (c.f. RadixSort)

Comparison based sorting generally has a lower bound of O(n log n).
Assume you iterate over your set of items and put them in buckets with items with the same comparative value, for example in a set of lists (say using a hash set). This operation is clearly O(n), even after retreiving the list of lists from the set.
--- EDIT: ---
This of course requires two assumptions:
There exists a constant time hash-algorithm for each element to be partitioned.
The number of buckets does not depend on the amount of input.
Thus, the lower bound of partitioning is O(n).

Partitioning is faster than sorting, in general, because you don't have to compare each element to each potentially-equivalent already-sorted element, you only have to compare it to the already-established keys of your partitioning. Take a close look at radix sort. The first step of radix sort is to partition the input based on some part of the key. Radix sort is O(kN). If your data set has keys bounded by a given length k, you can radix sort it O(n). If your data are comparable and don't have a bounded key, but you choose a bounded key with which to partition the set, the complexity of sorting the set would be O(n log n) and the partitioning would be O(n).

This is a classic problem in data structures, and yes, it is easier than sorting. If you want to also quickly be able to look up which set each element belongs to, what you want is the disjoint set data structure, together with the union-find operation. See here: http://en.wikipedia.org/wiki/Disjoint-set_data_structure

The time required to perform a possibly-imperfect partition using a hash function will be O(n+bucketcount) [not O(n*bucketcount)]. Making the bucket count large enough to avoid all collisions will be expensive, but if the hash function works at all well there should be a small number of distinct values in each bucket. If one can easily generate multiple statistically-independent hash functions, one could take each bucket whose keys don't all match the first one and use another hash function to partition the contents of that bucket.
Assuming a constant number of buckets on each step, the time is going to be O(NlgN), but if one sets the number of buckets to something like sqrt(N), the average number of passes should be O(1) and the work in each pass O(n).

Related

Time Complexity of Hash Map Traversal

What is the best, average and worst case time complexity for traversing a hash map under the assumption that the hash map uses chaining with linked lists.
I've read multiple times that the time complexity is O(m+n) for traversal for all three cases (m=number of buckets, n=number of elements). However, this differs from my time complexity analysis: In the worst case all elements are linearly chained in the last bucket which leads to a time complexity of O(m+n). In the best case no hash collisions happen and therefore time complexity should be O(m). In the average case I assume that the elements are uniformly distributed, i.e. each bucket on average has n/m elements. This leads to a time complexity of O(m * n/m) = O(n). Is my analysis wrong?
In practice, a good implementation can always achieve O(n). GCC's C++ Standard Library implementation for the hash table containers unordered_map and unordered_set, for example, maintains a forward/singly linked list between the elements inserted into the hash table, wherein elements that currently hash to the same bucket are grouped together in the list. Hash table buckets contain iterators into the singly-linked list for the point where the element before that bucket's colliding elements start (so if erasing an element, the previous link can be rewired to skip over it).
During traversal, only the singly-linked list need be consulted - the hash table buckets are not visited. This becomes especially important when the load factor is very low (many elements were inserted, then many were erased, but in C++ the table never reduces size, so you can end up with a very low load factor.
IF instead you have a hash table implementation where each bucket literally maintains a head pointer for its own linked list, then the kind of analysis you attempted comes into play.
You're right about worst case complexity.
In the best case no hash collisions happen and therefore time complexity should be O(m).
It depends. In C++ for example, values/elements are never stored in the hash table buckets (which would waste a huge amount of memory if the values were large in size and many buckets were empty). If instead the buckets contain the "head" pointer/iterator for the list of colliding elements, then even if there's no collision at a bucket, you still have to follow the pointer to a distinct memory area - that's just as bothersome as following a pointer between nodes on the same linked list, and is therefore normally included in the complexity calculation, so it's still O(m + n).
In the average case I assume that the elements are uniformly
distributed, i.e. each bucket on average has n/m elements.
No... elements being uniformly distributed across buckets is the best case for a hash table: see above. An "average" or typical case is where there's more variation in the number of elements hashing to any given bucket. For example, if you have 1 million buckets and 1 million values and a cryptographic strength hash function, you can statistically expect 1/e (~36.8%) buckets to be empty, 1/1!e (simplifies to 1/1e) buckets to have 1 element, 1/2!e (~18.4%) buckets to have 2 colliding elements, 1/3!e (~6.1%) buckets to have 3 colliding elements and so on (the "!" is for factorial...).
Anyway, the key point is that a naive bucket-visiting hash table traversal (as distinct from actually being able to traverse a list of elements without bucket-visiting), always has to visit all the buckets, then if you imagine each element being tacked onto a bucket somewhere, there's always one extra link to traverse to reach it. Hence O(m+n).

Data structure with fast sorted insertion, sorted deletion and lookup

I'm looking for a very specific data structure. Suppose the maximum number of elements is known. All elements are integers. Duplicates are allowed.
The operations are:
Look up. If I inserted n elements, a[0] is the smallest elements, a[a.length - 1] is the highest. a[k] is the k smallest element. Required runtime: O(1)
Insertion. Does a sorted insertion, insert(b) where b is an integer. Required runtime: O(log n)
Deletion. delete(i) deletes the ith element. Required runtime: O(log n)
I want kind of data structure is this? My question is language independent, but I'm coding in C++.
I believe such data structure does not exist. Constant lookup for any element (e.g. indexing) requires contiguous memory, which makes insertion impossible to do in less than O(n) if you want to keep the range sorted.
There are arguments for hash tables and wrappers around hash tables, but there are two things to keep in mind when mentioning them:
Hash tables have average access (insertion, deletion, find) in O(1), but that assumes very few hash collisions. If you wish to meet the requirements in regard to pessimistic complexities, hash tables are out of the quesion since their pessimistic access time is O(n).
Hash tables are, by their nature, unordered. They, most of the time, do have internal arrays for storing (for example) buckets of data, but the elements themselves are neither all in contiguous memory nor ordered by some property (other than maybe modulo of their hash, which itself should produce very different values for similar objects).
To not leave you empty handed - if you wish to get as close to your requirements as possible, you need to specify which complexities would you sacrifice to achieve the others. I'd like to propose either std::priority_queue or std::multiset.
std::priority_queue provides access only to the top element - it is guaranteed that it will be either the smallest or the greatest (depending on the comparator you use to specify the relation) element in the collection. Insertion and deletion are both achieved in O(log_n) time.
std::multiset* provides access to every element inside it, but at a higher cost - O(log_n). It also achieves O(log_n) in insertion and deletion.
*Careful - do not confuse with std::set, which does not allow for duplicate elements.

Does a multiset or bag offer better time complexity than a simple hash that keeps the counts of keys?

Java has multiset, and SmallTalk has a Bag class, and they are said to be the same function: keep a set of values but allow multiple values (with a count for each).
But seems like multiset can be implemented by a hash table that keeps counts on the keys (or maybe there are other possible implementations).
Do some implementation of multisets or bags offer better time complexity than the hash above? The required operations can be
insert item
delete item
get the minimum value in the set
get the maximum value in the set
In particular, I am hoping that for each of the 4 operations above, it can be done better than O(n), possibly all O(log n) or better. (the 3 and 4 above is expected to be O(log n) only if the multiset is maintained in some sorted way whenever a value is added or removed from it.)
It depends if the Multiset is implemented by a hastable or a sorted set.
For a hashtable, the insert/update operations are O(1) (min and max are O(n)), but the downside is that a hashtable doesn't keep the elements in order. Furthermore, this requires that for each element to be stored, it is possible to compute a consistent hash value.
For a sorted set, all operations are O(log n), but with the upside that the elements are kept in sorted order (hence, it is efficient in reporting ranges of elements). This requires that each element to be stored is comparable (i.e. an ordering exists over the elements).

Data structure to build and lookup set of integer ranges

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.
I need to:
Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.
Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.
Is there any data structure or algorithm specialized in integer ranges to solve this task?
Regarding the second issue:
You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).
In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.
Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.
single elements may be best stored in a hash-table
actual ranges can be stored in a sorted array
This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10
I would therefore suggest the following steps:
Bloom Filter: if no, stop
Sorted Array of real ranges: if yes, stop
Hash Table of single elements
The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)
One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)
If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. It's not O(1), but it's still quite fast in practice.
The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. This is a total ordering because all of the ranges are disjoint. You can thus put all of the intervals together into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O(n lg n) time.
To check if a some interval contains a given integer, you can do a binary search on this array. Starting at the middle interval, check if the integer is contained in that interval. If so, you're done. Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. This is essentially a standard binary search, and it should run in O(lg n) time.
Hope this helps!
AFAIK there is no such algorithm that search over integer list in O(1).
One only can do O(1) search with vast amount of memory.
So it is not very promising to try to find O(1) search algorithm over list of range of integer.
On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table).
You can use y-fast trees or van Emde Boas trees to achieve O(lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O(lg_w n) time queries. The optimal tradeoff in terms of n is O(sqrt(lg(n))).
The easiest of these to implement is probably y-fast trees. They are probably faster than doing binary search, though they require roughly O(lg w) = O(lg 32) = O(5) hash table queries, while binary search requires roughly O(lg n) = O(lg 10000) = O(13) comparisons, so binary search may be faster.
Rather than a 'comparison' based storage/retrieval ( which will always be O(log(n)) ),
You need to work on 'radix' based storage/retrieval .
In other words .. extract nibbles from the uint32, and make a trie ..
Keep your ranges into a sorted array and use binary search for lookups.
It's easy to implement, O(log N), and uses less memory and needs less memory accesses than any other tree based approach, so it will probably be also much faster.
From the description of you problem it sounds like the following might be a good compromise. I've described it using an Object oriented language, but is easily convertible to C using a union type or structure with a type member and a pointer.
Use the first 16 bits to index an array of objects (of size 65536). In that array there are 5 possible objects
a NONE object means no elements beginning with those 16bits are in the set
an ALL object means all elements beginning with 16 bits are in the set
a RANGE object means all items with the final 16bits between an upper and lower bound are in the set
a SINGLE object means just one element beginning with the 16bits is in the array
a BITSET object handles all remaining cases with a 65536 bit bitset
Of course, you don't need to split at 16bits, you can adjust to reflect the statistics of your set. In fact you don't need to use consecutive bits, but it speeds up the bit twiddling, and if many of your elements are consecutive as you claim will give good properties.
Hopefully this makes sense, please comment if I need to explain more fully. Effectively you've combined a depth 2 binary tree with a ranges and a bitset for a time/speed tradeoff. If you need to save memory then make the tree deeper with a corresponding slight increase in lookup time.

Number of different elements in an array

Is it possible to compute the number of different elements in an array in linear time and constant space? Let us say it's an array of long integers, and you can not allocate an array of length sizeof(long).
P.S. Not homework, just curious. I've got a book that sort of implies that it is possible.
This is the Element uniqueness problem, for which the lower bound is Ω( n log n ), for comparison-based models. The obvious hashing or bucket sorting solution all requires linear space too, so I'm not sure this is possible.
You can't use constant space. You can use O(number of different elements) space; that's what a HashSet does.
You can use any sorting algorithm and count the number of different adjacent elements in the array.
I do not think this can be done in linear time. One algorithm to solve in O(n log n) requires first sorting the array (then the comparisons become trivial).
If you are guaranteed that the numbers in the array are bounded above and below, by say a and b, then you could allocate an array of size b - a, and use it to keep track of which numbers have been seen.
i.e., you would move through your input array take each number, and mark a true in your target array at that spot. You would increment a counter of distinct numbers only when you encounter a number whose position in your storage array is false.
Assuming we can partially destroy the input, here's an algorithm for n words of O(log n) bits.
Find the element of order sqrt(n) via linear-time selection. Partition the array using this element as a pivot (O(n)). Using brute force, count the number of different elements in the partition of length sqrt(n). (This is O(sqrt(n)^2) = O(n).) Now use an in-place radix sort on the rest, where each "digit" is log(sqrt(n)) = log(n)/2 bits and we use the first partition to store the digit counts.
If you consider streaming algorithms only ( http://en.wikipedia.org/wiki/Streaming_algorithm ), then it's impossible to get an exact answer with o(n) bits of storage via a communication complexity lower bound ( http://en.wikipedia.org/wiki/Communication_complexity ), but possible to approximate the answer using randomness and little space (Alon, Matias, and Szegedy).
This can be done with a bucket approach when assuming that there are only a constant number of different values. Make a flag for each value (still constant space). Traverse the list and flag the occured values. If you happen to flag an already flagged value, you've found a duplicate. You have to traverse the buckets for each element in the list. But that's still linear time.

Resources