Is this implementation of Bucket-Sort considered "in-place"? - algorithm

Consider the following implementation of bucket-sort:
Algorithm BucketSort(S)
input: Sequence S of items with integer keys in range [0,N-1]
output: Sequence S sorted in nondecreasing order of keys.
let B be an array of N sequences, each of which is initially empty
for each item x in S do
let k be the key of x
remove x from S and insert it at the end of bucket (sequence) B[k].
for i←0 to N-1 do
for each item x in sequence B[i] do
remove x from B[i] and insert it at the end of S.
Is this implementation considered "In-Place"?
My textbook gives the following definition for "In-Place":
Remember that a sorting algorithm is
in-place if it uses only a constant
amount of memory in addition to that
needed for the objects being sorted.
Now, I know the above algorithm uses O(n+N) memory, where N is the upper bound on the range. However, and I may be wrong, I think N would be a constant, even if it is a large one. So I'm guessing it is "in-place" per this definition but I'm unsure.
So given the above algorithm and definition of "in-place", is this implementation considered in-place?

The algorithm you have listed is decidedly not in-place.
You have another pointer (B) which must GROW to the same size as S, but is not in-place with S for any reason. Because of this you must have at least O(S) extra space. Just because you are removing the values from S does not keep you from still needing the same amount of space in another variable B.
Also just because the number of buckets might be constant doesn't mean you can forget that all elements in S must end up in a different place (B). Note the case where len(S) > N.
If you want to do an in-place sort, you need to keep all your elements in S and shuffle them around such that the constant extra space is a temporary holder for a swap routine and possibly some stack memory if you are using a recursive solution.

Bucketsort is definitely not an "in-place" sorting algorithm.
The whole idea is that elements sort themselves as they are moved to the buckets. In the worst of the good cases (sequential values, but no repetition) the additional space needed is as big as the original array.

The number of buckets N is only bounded (after mapping the keys) by the length n of the input sequence S because every item in S might have a different key. Hence the algorithm requires linear additional space and is in consequence not in-place. If the keys are not remapped, the additional space is unbound by n.

Related

data structure similar to array but supporting deletion

I am thinking of the following data structure question:
given integers between 1 and n in sorted order, every operation queries and then removes (in a single call) kth smallest number. How to make the query and removal both constant time operations?
It is similar to an array structure but requiring constant removing. Though an order balanced binary tree can do this, but it is O(lg n) complexity.
Can one take the advantage of the range property (numbers only between 1 and n) to make it work?
LinkedHashSet is what you are looking for . If you want index as in arrays then use this LinkedHashMap. But you need to insert them in order from 1 ton
What is the maximal value of N? You mentioned that you are going to work with positive numbers - Van Emde Boas tree probably the best choice for you.
Short description:
- allows to store only positive numbers from [0,2^k), where k is is a number of bits required to store maximal number N. - all operations (insert,delete,lookup,find_next,find_prev) works in log(K).Not log(N). So, for integer 32-bit numbers complexity is log(32)=5
- disadvantage is memory consumption. requires 2^k ~ O(N) memory, so for storing integers you need ~1GB RAM. Remember, that usually O(N) memory means O(number of elements) but here it means O(maximal stored value).
Note: I'm not sure about supporting k-th element query but description looks nice:
FindNext: find the key/value pair with the smallest key at least a
given k
FindPrevious: find the key/value pair with the largest key at most a
given k
UPDATE
As Dukeling mentioned below, K-th element query is not supported. I see the only way to implement it.
int x = getMin();
for(int i=0;i<k-1;i++) x = getNext(x);
after this loop x will store k-th element. But complexity is O(K*log(bits)). Too bad for large values of K(

Looking for a data container with O(1) indexing and O(log(n)) insertion and deletion

I'm not sure if it's possible but it seems a little bit reasonable to me, I'm looking for a data structure which allows me to do these operations:
insert an item with O(log n)
remove an item with O(log n)
find/edit the k'th-smallest element in O(1), for arbitrary k (O(1) indexing)
of course editing won't result in any change in the order of elements. and what makes it somehow possible is I'm going to insert elements one by one in increasing order. So if for example I try inserting for the fifth time, I'm sure all four elements before this one are smaller than it and all the elements after this this are going to be larger.
I don't know if the requested time complexities are possible for such a data container. But here is a couple of approaches, which almost achieve these complexities.
First one is tiered vector with O(1) insertion and indexing, but O(sqrt N) deletion. Since you expect only about 10000 elements in this container and sqrt(10000)/log(10000) = 7, you get almost the required performance here. Tiered vector is implemented as an array of ring-buffers, so deleting an element requires moving all elements, following it in the ring-buffer, and moving one element from each of the following ring-buffers to the one, preceding it; indexing in this container means indexing in the array of ring-buffers and then indexing inside the ring-buffer.
It is possible to create a different container, very similar to tiered vector, having exactly the same complexities, but working a little bit faster because it is more cache-friendly. Allocate a N-element array to store the values. And allocate a sqrt(N)-element array to store index corrections (initialized with zeros). I'll show how it works on the example of 100-element container. To delete element with index 56, move elements 57..60 to positions 56..59, then in the array of index corrections add 1 to elements 6..9. To find 84-th element, look up eighth element in the array of index corrections (its value is 1), then add its value to the index (84+1=85), then take 85-th element from the main array. After about half of elements in main array are deleted, it is necessary to compact the whole container to attain contiguous storage. This gets only O(1) cumulative complexity. For real-time applications this operation may be performed in several smaller steps.
This approach may be extended to a Trie of depth M, taking O(M) time for indexing, O(M*N1/M) time for deletion and O(1) time for insertion. Just allocate a N-element array to store the values, N(M-1)/M, N(M-2)/M, ..., N1/M-element arrays to store index corrections. To delete element 2345, move 4 elements in main array, increase 5 elements in the largest "corrections" array, increase 6 elements in the next one and 7 elements in the last one. To get element 5678 from this container, add to 5678 all corrections in elements 5, 56, 567 and use the result to index the main array. Choosing different values for 'M', you can balance the complexity between indexing and deletion operations. For example, for N=65000 you can choose M=4; so indexing requires only 4 memory accesses and deletion updates 4*16=64 memory locations.
I wanted to point out first that if k is really a random number, then it might be worth considering that the problem might be completely different: asking for the k-th smallest element, with k uniformly at random in the range of the available elements is basically... picking an element at random. And it can be done much differently.
Here I'm assuming you actually need to select for some specific, if arbitrary, k.
Given your strong pre-condition that your elements are inserted in order, there is a simple solution:
Since your elements are given in order, just add them one by one to an array; that is you have some (infinite) table T, and a cursor c, initially c := 1, when adding an element, do T[c] := x and c := c+1.
When you want to access the k-th smallest element, just look at T[k].
The problem, of course, is that as you delete elements, you create gaps in the table, such that element T[k] might not be the k-th smallest, but the j-th smallest with j <= k, because some cells before k are empty.
It then is enough to keep track of the elements which you have deleted, to know how many have been deleted that are smaller than k. How do you do this in time at most O(log n)? By using a range tree or a similar type of data structure. A range tree is a structure that lets you add integers and then query for all integers in between X and Y. So, whenever you delete an item, simply add it to the range tree; and when you are looking for the k-th smallest element, make a query for all integers between 0 and k that have been deleted; say that delta have been deleted, then the k-th element would be in T[k+delta].
There are two slight catches, which require some fixing:
The range tree returns the range in time O(log n), but to count the number of elements in the range, you must walk through each element in the range and so this adds a time O(D) where D is the number of deleted items in the range; to get rid of this, you must modify the range tree structure so as to keep track, at each node, of the number of distinct elements in the subtree. Maintaining this count will only cost O(log n) which doesn't impact the overall complexity, and it's a fairly trivial modification to do.
In truth, making just one query will not work. Indeed, if you get delta deleted elements in range 1 to k, then you need to make sure that there are no elements deleted in range k+1 to k+delta, and so on. The full algorithm would be something along the line of what is below.
KthSmallest(T,k) := {
a = 1; b = k; delta
do {
delta = deletedInRange(a, b)
a = b + 1
b = b + delta
while( delta > 0 )
return T[b]
}
The exact complexity of this operation depends on how exactly you make your deletions, but if your elements are deleted uniformly at random, then the number of iterations should be fairly small.
There is a Treelist (implementation for Java, with source code), which is O(lg n) for all three ops (insert, delete, index).
Actually, the accepted name for this data structure seems to be "order statistic tree". (Apart from indexing, it's also defined to support indexof(element) in O(lg n).)
By the way, O(1) is not considered much of an advantage over O(lg n). Such differences tend to be overwhelmed by the constant factor in practice. (Are you going to have 1e18 items in the data structure? If we set that as an upper bound, that's just equivalent to a constant factor of 60 or so.)
Look into heaps. Insert and removal should be O(log n) and peeking of the smallest element is O(1). Peeking or retrieval of the K'th element, however, will be O(log n) again.
EDITED: as amit stated, retrieval is more expensive than just peeking
This is probably not possible.
However, you can make certain changes in balanced binary trees to get kth element in O(log n).
Read more about it here : Wikipedia.
Indexible Skip lists might be able to do (close) what you want:
http://en.wikipedia.org/wiki/Skip_lists#Indexable_skiplist
However, there's a few caveats:
It's a probabilistic data structure. That means it's not necessarily going to be O(log N) for all operations
It's not going to be O(1) for indexing, just O(log N)
Depending on the speed of your RNG and also depending on how slow traversing pointers are, you'll likely get worse performance from this than just sticking with an array and dealing with the higher cost of removals.
Most likely, something along the lines of this is going to be the "best" you can do to achieve your goals.

Finding number of pairs of integers differing by a value

If we have an array of integers, then is there any efficient way other than O(n^2) by which one can find the number of pairs of integers which differ by a given value?
E.g for the array 4,2,6,7 the number of pairs of integers differing by 2 is 2 {(2,4),(4,6)}.
Thanks.
Create a set from your list. Create another set which has all the elements incremented by the delta. Intersect the two sets. These are the upper values of your pairs.
In Python:
>>> s = [4,2,6,7]
>>> d = 2
>>> s0 = set(s)
>>> sd = set(x+d for x in s0)
>>> set((x-d, x) for x in (s0 & sd))
set([(2, 4), (4, 6)])
Creating the sets is O(n). Intersecting the sets is also O(n), so this is a linear-time algorithm.
Store the elements in a multiset, implemented by a hash table. Then for each element n, check the number of occurences of n-2 in the multiset and sum them up. There is no need to check n+2 because that would cause you to count each pair twice.
The time efficiency is O(n) in the average case, and O(n*logn) or O(n^2) in the worst case (depending on the hash table implementation). It will be O(n*logn) if the multiset is implemented by a balanced tree.
Sort the array, then scan through with two pointers. Supposing the first one points to a, then step the second one forward until you've found where a+2 would be if it was present. Increment the total if it's there. Then increment the first pointer and repeat. At each step, the second pointer starts from the place it ended up on the previous step.
If duplicates are allowed in the array, then you need to remember how many duplicates the second one stepped over, so that you can add this number to the total if incrementing the first pointer yields the same integer again.
This is O(n log n) worst case (for the sort), since the scan is linear time.
It's O(n) worst case on the same basis that hashtable-based solutions for fixed-width integers can say that they're expected O(n) time, since sorting fixed-width integers can be done using radix sort in O(n). Which is actually faster is another matter -- hashtables are fast but might involve a lot of memory allocation (for nodes) and/or badly-localized memory access, depending on implementation.
Note that if the desired difference is 0 and all the elements in the array are identical, then the size of the output is O(n²), so the worst-case of any algorithm is necessarily O(n²). (On the other hand, average-case or expected-case behavior can be significantly better, as others have noted.)
Just hash the numbers in an array as you do in counting sort.Then take two variables, first pointing to index 0 and the other pointing to index 2(or index d in general case) initially.
Now check whether value at both indices are non-zero, if yes then increment the counter with larger of the two values else leave the counter unchanged as the pair does not exist. Now increment both the indices and continue until the second index reaches the end of the array.The total value of counter is the number of pairs with difference d.
Time complexity: O(n)
Space complexity: O(n)

3rd way to find a duplicate

Two common ways to detect duplicates in an array:
1) sort first, time complexity O (n log n), space complexity O (1)
2) hash set, time complexity O (n), space complexity O (n)
Is there a 3rd way to detect a duplicate?
Please do not answer brute force.
Yet another option is the Bloom filter. Complexity O(n), space varies (almost always less than a hash), but there is a possibility of false positives. There is a complex relationship between the size of the data set, the size of the filter, and the number of false positives that you can expect.
Bloom filters are often used as quick "sanity checks" before doing a more expensive duplicate check.
Depends on the information.
If you know the range of the numbers, like 1-1000, you can use a bit array.
Lets say the range is a...b
Make a bit array with (b-a) bits. initialize them to 0.
Loop through the array, when you get to the number x , change the bit in place x-a to 1.
If there's a 1 there already, you have a duplicate.
time complexity: O(n)
space complexity: (b-a) bits
Another method to find duplicates provided an n element array contains elements from 0 to n-1. <Find Duplicates>

Number of different elements in an array

Is it possible to compute the number of different elements in an array in linear time and constant space? Let us say it's an array of long integers, and you can not allocate an array of length sizeof(long).
P.S. Not homework, just curious. I've got a book that sort of implies that it is possible.
This is the Element uniqueness problem, for which the lower bound is Ω( n log n ), for comparison-based models. The obvious hashing or bucket sorting solution all requires linear space too, so I'm not sure this is possible.
You can't use constant space. You can use O(number of different elements) space; that's what a HashSet does.
You can use any sorting algorithm and count the number of different adjacent elements in the array.
I do not think this can be done in linear time. One algorithm to solve in O(n log n) requires first sorting the array (then the comparisons become trivial).
If you are guaranteed that the numbers in the array are bounded above and below, by say a and b, then you could allocate an array of size b - a, and use it to keep track of which numbers have been seen.
i.e., you would move through your input array take each number, and mark a true in your target array at that spot. You would increment a counter of distinct numbers only when you encounter a number whose position in your storage array is false.
Assuming we can partially destroy the input, here's an algorithm for n words of O(log n) bits.
Find the element of order sqrt(n) via linear-time selection. Partition the array using this element as a pivot (O(n)). Using brute force, count the number of different elements in the partition of length sqrt(n). (This is O(sqrt(n)^2) = O(n).) Now use an in-place radix sort on the rest, where each "digit" is log(sqrt(n)) = log(n)/2 bits and we use the first partition to store the digit counts.
If you consider streaming algorithms only ( http://en.wikipedia.org/wiki/Streaming_algorithm ), then it's impossible to get an exact answer with o(n) bits of storage via a communication complexity lower bound ( http://en.wikipedia.org/wiki/Communication_complexity ), but possible to approximate the answer using randomness and little space (Alon, Matias, and Szegedy).
This can be done with a bucket approach when assuming that there are only a constant number of different values. Make a flag for each value (still constant space). Traverse the list and flag the occured values. If you happen to flag an already flagged value, you've found a duplicate. You have to traverse the buckets for each element in the list. But that's still linear time.

Resources