Batch updating node priorities in a binary heap? - data-structures

I posted quite confusing question, so I rewrote it from scratch...
This is actually purely theoretical question.
Say, we have binary heap. Let the heap be a MaxHeap, so root node has the biggest value and every node has bigger value than it's children. We can do some common low-level operations on this heap: "Swap two nodes", "compare two nodes".
Using those low-level operation, we can implement usual higher level recursive operations: "sift-up", "sift-down".
Using those sift-up and sift-downs, we can implement "insert", "repair" and "update". I am interested in the "update" function. Let's assume that I already have the position of the node to be changed. Therefore, update function is very simple:
function update (node_position, new_value){
heap[node_position] = new_value;
sift_up(node_position);
sift_down(node_position);
}
My question is: Is it (mathematicaly) possible, to make more advanced "update" function, that could update more nodes at once, in a way, that all nodes change their values to new_values, and after that, their position is corrected? Something like this:
function double_update (node1_pos, node2_pos, node1_newVal, node2_newVal){
heap[node1_pos] = node1_newVal;
heap[node2_pos] = node2_newVal;
sift_up(node1_position);
sift_down(node1_position);
sift_up(node2_position);
sift_down(node2_position);
}
I did some tests this with this "double_update" and it worked, although it doesn't prove anything.
What about "triple updates", and so on...
I did some other tests with "multi updates", where I changed values of all nodes and then called { sift-up(); sift-down(); } once for each of them in random order. This didn't work, but the result wasn't far from correct.
I know this doesn't sound useful, but I am interested in the theory behind it. And if I make it work, I actually do have one use for it.

It's definitely possible to do this, but if you're planning on changing a large number of keys in a binary heap, you might want to look at other heap structures like the Fibonacci heap or the pairing heap which can do this much faster than the binary heap. Changing k keys in a binary heap with n nodes takes O(k log n) time, while in a Fibonacci heap it takes time O(k). This is asymptotically optimal, since you can't even touch k nodes without doing at least Ω(k) work.
Another thing to consider is that if you change more than Ω(n / log n) keys at once, you are going to do at least Ω(n) work. In that case, it's probably faster to implement updates by just rebuilding the heap from scratch in Θ(n) time using the standard heapify algorithm.
Hope this helps!

Here's a trick and possibly funky algorithm, for some definition of funky:
(Lots of stuff left out, just to give the idea):
template<typename T> class pseudoHeap {
private:
using iterator = typename vector<T>::iterator;
iterator max_node;
vector<T> heap;
bool heapified;
void find_max() {
max_node = std::max_element(heap.begin(), heap.end());
}
public:
void update(iterator node, T new_val) {
if (node == max_node) {
if (new_val < *max_node) {
heapified = false;
*max_node = new_val;
find_max();
} else {
*max_node = new_val;
}
} else {
if (new_val > *max_node) max_node = new_val;
*node = new_val;
heapified = false;
}
T& front() { return &*max_node; }
void pop_front() {
if (!heapified) {
std::iter_swap(vector.end() - 1, max_node);
std::make_heap(vector.begin(), vector.end() - 1);
heapified = true;
} else {
std::pop_heap(vector.begin(), vector.end());
}
}
};
Keeping a heap is expensive. If you do n updates before you start popping the heap, you've done the same amount of work as just sorting the vector when you need it to be sorted (O(n log n)). If it's useful to know what the maximum value is all the time, then there is some reason to keep a heap, but if the maximum value is no more likely to be modified than any other value, then you can keep the maximum value always handy at amortized cost O(1) (that is, 1/n times it costs O(n) and the rest of the time it's O(1). That's what the above code does, but it might be even better to be lazy about computing the max as well, making front() amortized O(1) instead of constant O(1). Depends on your requirements.
As yet another alternative, if the modifications normally don't cause the values to move very far, just do a simple "find the new home and rotate the subvector" loop, which although it's O(n) instead of O(log n), is still faster on short moves because the constant is smaller.
In other words, don't use priority heaps unless you're constantly required to find the top k values. When there are lots of modifications between reads, there is usually a better approach.

Related

Design of a data structure that can search over objects with 2 attributes

I'm trying to think of a way to desing a data structure that I can efficiently insert to, remove from and search in it.
The catch is that the search function is getting a similar object as input, with 2 attributes, and I need to find an object in my dataset, such that both the 1st and 2nd of the object in my dataset are equal to or bigger than the one in search function's input.
So for example, if I send as input, the following object:
object[a] = 9; object[b] = 14
Then a valid found object could be:
object[a] = 9; object[b] = 79
but not:
object[a] = 8; object[b] = 28
Is there anyway to store the data such that the search complexity is better than linear?
EDIT:
I forgot to include in my original question. The search has to return the smallest possible object in the dataset, by multipication of the 2 attributes.
Meaning that the value of object[a]*object[b] of an object that fits the original condition, is smaller than any other object in the dataset that also fits.
You may want to use k-d tree data structure, which is typically use to index k dimensional points. The search operation, like what you perform, requires O(log n) in average.
This post may help when attributes are hierarchically linked like name, forename. For point in a 2D space k-d tree is more adapted as explain by fajarkoe.
class Person {
string name;
string forename;
... other non key attributes
}
You have to write a comparator function which take two objects of class X as input and returns -1, 0 or +1 for <, = and > cases.
Libraries like glibc(), with qsort() and bsearch or more higher languages like Java and its java.util.Comparator class and java.util.SortedMap (implementation java.util.TreeMap) as containers use comparators.
Other languages use equivalent concept.
The comparator method may be wrote followin your spec like:
int compare( Person left, Person right ) {
if( left.name < right.name ) {
return -1;
}
if( left.name > right.name ) {
return +1;
}
if( left.forename < right.forename ) {
return -1;
}
if( left.forename > right.forename ) {
return +1;
}
return 0;
}
Complexity of qsort()
Quicksort, or partition-exchange sort, is a sorting algorithm
developed by Tony Hoare that, on average, makes O(n log n) comparisons
to sort n items. In the worst case, it makes O(n2) comparisons, though
this behavior is rare. Quicksort is often faster in practice than
other O(n log n) algorithms.1 Additionally, quicksort's sequential
and localized memory references work well with a cache. Quicksort is a
comparison sort and, in efficient implementations, is not a stable
sort. Quicksort can be implemented with an in-place partitioning
algorithm, so the entire sort can be done with only O(log n)
additional space used by the stack during the recursion.2
Complexity of bsearch()
If the list to be searched contains more than a few items (a dozen,
say) a binary search will require far fewer comparisons than a linear
search, but it imposes the requirement that the list be sorted.
Similarly, a hash search can be faster than a binary search but
imposes still greater requirements. If the contents of the array are
modified between searches, maintaining these requirements may even
take more time than the searches. And if it is known that some items
will be searched for much more often than others, and it can be
arranged so that these items are at the start of the list, then a
linear search may be the best.

Why is the following two duplicate finder algorithms have different time complexity?

I was reading this question. The selected answer contains the following two algorithms. I couldn't understand why the first one's time complexity is O(ln(n)). At the worst case, if the array don't contain any duplicates it will loop n times so does the second one. Am I wrong or am I missing something? Thank you
1) A faster (in the limit) way
Here's a hash based approach. You gotta pay for the autoboxing, but it's O(ln(n)) instead of O(n2). An enterprising soul would go find a primitive int-based hash set (Apache or Google Collections has such a thing, methinks.)
boolean duplicates(final int[] zipcodelist)
{
Set<Integer> lump = new HashSet<Integer>();
for (int i : zipcodelist)
{
if (lump.contains(i)) return true;
lump.add(i);
}
return false;
}
2)Bow to HuyLe
See HuyLe's answer for a more or less O(n) solution, which I think needs a couple of add'l steps:
static boolean duplicates(final int[] zipcodelist) {
final int MAXZIP = 99999;
boolean[] bitmap = new boolean[MAXZIP+1];
java.util.Arrays.fill(bitmap, false);
for (int item : zipcodeList)
if (!bitmap[item]) bitmap[item] = true;
else return true;
}
return false;
}
The first solution should have expected complexity of O(n), since the whole zip code list must be traversed, and processing each zip code is O(1) expected time complexity.
Even taking into consideration that insertion into HashMap may trigger a re-hash, the complexity is still O(1). This is a bit of non sequitur, since there may be no relation between Java HashMap and the assumption in the link, but it is there to show that it is possible.
From HashSet documentation:
This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets.
It's the same for the second solution, which is correctly analyzed: O(n).
(Just an off-topic note, BitSet is faster than array, as seen in the original post, since 8 booleans are packed into 1 byte, which uses less memory).

Efficient datastructure for pooling integers

I'm looking for a data structure to help me manage a pool of integers. It's a pool in that I remove integers from the pool for a short while then put them back with the expectation that they will be used again. It has some other odd constraints however, so a regular pool doesn't work well.
Hard requirements:
constant time access to what the largest in use integer is.
the sparseness of the integers needs to be bounded (even if only in principal).
I want the integers to be close to each other so I can quickly iterate over them with minimal unused integers in the range.
Use these if they help with selecting a data structure, otherwise ignore them:
Integers in the pool are 0 based and contiguous.
The pool can be constant sized.
Integers from the pool are only used for short periods with a high churn rate.
I have a working solution but it feels inelegant.
My (sub-optimal) Solution
Constant sized pool.
Put all available integers into a sorted set (free_set).
When a new integer is requested retrieve the smallest from the free_set.
Put all in-use integers into another sorted set (used_set).
When the largest is requested, retrieve the largest from the used_set.
There are a few optimization that may help with my particular solution (priority queue, memoization, etc). But my whole approach seems wasteful.
I'm hoping there some esoteric data structure that fits my problem perfectly. Or at least a better pooling algorithm.
pseudo class:
class IntegerPool {
int size = 0;
Set<int> free_set = new Set<int>();
public int Acquire() {
if(!free_set.IsEmpty()) {
return free_set.RemoveSmallest();
} else {
return size++;
}
}
public void Release(int i) {
if(i == size - 1) {
size--;
} else {
free_set.Add(i);
}
}
public int GetLargestUsedInteger() {
return size;
}
}
Edit
RemoveSmallest isn't useful as all. RemoveWhatever is good enough. So Set<int> can be replaced by LinkedList<int> as a faster alternative (or even Stack<int>).
Why not use a balanced binary search tree? You can store a pointer/iterator to the min element and access it for free, and updating it after an insert/delete is an O(1) operation. If you use a self balancing tree, insert/delete is O(log(n)). To elaborate:
insert : Just compare new element to previous min; if it is better make the iterator point to the new min.
delete : If min was deleted, then before removing find the successor (which you can do by just walking the iterator forward 1 step), and then take that guy to be the new min.
While it is theoretically possible to do slightly better using some kind of sophisticated uber-heap data structure (ie Fibonacci heaps), in practice I don't think you would want to deal with implementing something like that just to save a small log factor. Also, as a bonus you get fast in-order traversal for free -- not to mention that most programming languages these days^ come with fast implementations of self-balancing binary search trees out of the box (like red-black trees/avl etc.).
^ with the exception of javascript :P
EDIT: Thought of an even better answer.

O(1) Make, Find, Union in Disjoint Sets Data Structure

Today, I had discussion with someone about Kruskal Minimum Spanning Tree algorithm because of page 13 of this slide.
The author of the presentation said that if we implement disjoint sets using (doubly) linked list, the performance for Make and Find will be O(1) and O(1) respectively. The time for operation Union(u,v) is min(nu,nv), where nu and nv are the sizes of the sets storing u and v.
I said that we can improve the time for the Union(u,v) to be O(1) by making the representation pointer of each member pointing a locator that contains the pointer to the real representation of the set.
In Java, the data structure would look like this :
class DisjointSet {
LinkedList<Vertex> list = new LinkedList<Vertex>(); // for holding the members, we might need it for print
static Member makeSet(Vertex v) {
Member m = new Member();
DisjointSet set = new DisjointSet();
m.set = set;
set.list.add(m);
m.vertex = v;
Locator loc = new Locator();
loc.representation = m;
m.locator = loc;
return m;
}
}
class Member {
DisjointSet set;
Locator locator;
Vertex vertex;
Member find() {
return locator.representation;
}
void union(Member u, Member v) { // assume nv is less than nu
u.set.list.append(v.set.list); // hypothetical method, append a list in O(1)
v.set = u.set;
v.locator.representation = u.locator.representation;
}
}
class Locator {
Member representation;
}
Sorry for the minimalistic code. If it can be made this way, than running time for every disjoint set operation (Make,Find,Union) will be O(1). But the one whom I had discussion with can't see the improvement. I would like to know your opinion on this.
And also what is the fastest performance of Find/Union in various implementations? I'm not an expert in data structure, but by quick browsing on the internet I found out there is no constant time data structure or algorithm to do this.
My intuition agrees with your colleague. You say:
u.set.list.append(v.set.list); // hypothetical method, append a list in O(1)
It looks like your intent is that the union is done via the append. But, to implement Union, you would have to remove duplicates for the result to be a set. So I can see an O(1) algorithm for a fixed set size, for example...
Int32 set1;
Int32 set2;
Int32 unionSets1And2 = set1 | set2;
But that strikes me as cheating. If you're doing this for general cases of N, I don't see how you avoid some form of iterating (or hash lookup). And that would make it O(n) (or at best O(log n)).
FYI: I had a hard time following your code. In makeSet, you construct a local Locator that never escapes the function. It doesn't look like it does anything. And it's not clear what your intent is in the append. Might want to edit and elaborate on your approach.
Using Tarjan's version of the Union-Find structure (with path compression and rank-weighed union), a sequence of m Finds and (n-1) intermixed Unions would be in O(m.α(m,n)), where α(m,n) is the inverse of Ackermann function which for all practical values of m and n has value 4. So this basically means that Union-Find has worst case amortized constant operations, but not quite.
To my knowledge, it is impossible to obtain a better theoretical complexity, though improvements have led to better practical efficiency.
For special cases of disjoint-sets such as those used in language theory, it has been shown that linear (i.e., everything in O(1)) adaptations are possible---essentially by grouping nodes together---but these improvements cannot be translated to the general problem. On the other hand of the spectrum, a somewhat similar core idea has been used with great success and ingenuity to make an O(n) algorithm for minimum spanning tree (Chazelle's algorithm).
So your code cannot be correct. The error is what Moron pointed out: when you make the union of two sets, you only update the "representation" of the lead of each list, but not of all other elements---while simultaneously assuming in the find function that every element directly knows its representation.

Algorithm to tell if two arrays have identical members

What's the best algorithm for comparing two arrays to see if they have the same members?
Assume there are no duplicates, the members can be in any order, and that neither is sorted.
compare(
[a, b, c, d],
[b, a, d, c]
) ==> true
compare(
[a, b, e],
[a, b, c]
) ==> false
compare(
[a, b, c],
[a, b]
) ==> false
Obvious answers would be:
Sort both lists, then check each
element to see if they're identical
Add the items from one array to a
hashtable, then iterate through the
other array, checking that each item
is in the hash
nickf's iterative search algorithm
Which one you'd use would depend on whether you can sort the lists first, and whether you have a good hash algorithm handy.
You could load one into a hash table, keeping track of how many elements it has. Then, loop over the second one checking to see if every one of its elements is in the hash table, and counting how many elements it has. If every element in the second array is in the hash table, and the two lengths match, they are the same, otherwise they are not. This should be O(N).
To make this work in the presence of duplicates, track how many of each element has been seen. Increment while looping over the first array, and decrement while looping over the second array. During the loop over the second array, if you can't find something in the hash table, or if the counter is already at zero, they are unequal. Also compare total counts.
Another method that would work in the presence of duplicates is to sort both arrays and do a linear compare. This should be O(N*log(N)).
Assuming you don't want to disturb the original arrays and space is a consideration, another O(n.log(n)) solution that uses less space than sorting both arrays is:
Return FALSE if arrays differ in size
Sort the first array -- O(n.log(n)) time, extra space required is the size of one array
For each element in the 2nd array, check if it's in the sorted copy of
the first array using a binary search -- O(n.log(n)) time
If you use this approach, please use a library routine to do the binary search. Binary search is surprisingly error-prone to hand-code.
[Added after reviewing solutions suggesting dictionary/set/hash lookups:]
In practice I'd use a hash. Several people have asserted O(1) behaviour for hashes, leading them to conclude a hash-based solution is O(N). Typical inserts/lookups may be close to O(1), and some hashing schemes guarantee worst-case O(1) lookup, but worst-case insertion -- in constructing the hash -- isn't O(1). Given any particular hashing data structure, there would be some set of inputs which would produce pathological behaviour. I suspect there exist hashing data structures with the combined worst-case to [insert-N-elements then lookup-N-elements] of O(N.log(N)) time and O(N) space.
You can use a signature (a commutative operation over the array members) to further optimize this in the case where the array are usually different, saving the o(n log n) or the memory allocation.
A signature can be of the form of a bloom filter(s), or even a simple commutative operation like addition or xor.
A simple example (assuming a long as the signature side and gethashcode as a good object identifier; if the objects are, say, ints, then their value is a better identifier; and some signatures will be larger than long)
public bool MatchArrays(object[] array1, object[] array2)
{
if (array1.length != array2.length)
return false;
long signature1 = 0;
long signature2 = 0;
for (i=0;i<array1.length;i++) {
signature1=CommutativeOperation(signature1,array1[i].getHashCode());
signature2=CommutativeOperation(signature2,array2[i].getHashCode());
}
if (signature1 != signature2)
return false;
return MatchArraysTheLongWay(array1, array2);
}
where (using an addition operation; use a different commutative operation if desired, e.g. bloom filters)
public long CommutativeOperation(long oldValue, long newElement) {
return oldValue + newElement;
}
This can be done in different ways:
1 - Brute force: for each element in array1 check that element exists in array2. Note this would require to note the position/index so that duplicates can be handled properly. This requires O(n^2) with much complicated code, don't even think of it at all...
2 - Sort both lists, then check each element to see if they're identical. O(n log n) for sorting and O(n) to check so basically O(n log n), sort can be done in-place if messing up the arrays is not a problem, if not you need to have 2n size memory to copy the sorted list.
3 - Add the items and count from one array to a hashtable, then iterate through the other array, checking that each item is in the hashtable and in that case decrement count if it is not zero otherwise remove it from hashtable. O(n) to create a hashtable, and O(n) to check the other array items in the hashtable, so O(n). This introduces a hashtable with memory at most for n elements.
4 - Best of Best (Among the above): Subtract or take difference of each element in the same index of the two arrays and finally sum up the subtacted values. For eg A1={1,2,3}, A2={3,1,2} the Diff={-2,1,1} now sum-up the Diff = 0 that means they have same set of integers. This approach requires an O(n) with no extra memory. A c# code would look like as follows:
public static bool ArrayEqual(int[] list1, int[] list2)
{
if (list1 == null || list2 == null)
{
throw new Exception("Invalid input");
}
if (list1.Length != list2.Length)
{
return false;
}
int diff = 0;
for (int i = 0; i < list1.Length; i++)
{
diff += list1[i] - list2[i];
}
return (diff == 0);
}
4 doesn't work at all, it is the worst
If the elements of an array are given as distinct, then XOR ( bitwise XOR ) all the elements of both the arrays, if the answer is zero, then both the arrays have the same set of numbers. The time complexity is O(n)
I would suggest using a sort first and sort both first. Then you will compare the first element of each array then the second and so on.
If you find a mismatch you can stop.
If you sort both arrays first, you'd get O(N log(N)).
What is the "best" solution obviously depends on what constraints you have. If it's a small data set, the sorting, hashing, or brute force comparison (like nickf posted) will all be pretty similar. Because you know that you're dealing with integer values, you can get O(n) sort times (e.g. radix sort), and the hash table will also use O(n) time. As always, there are drawbacks to each approach: sorting will either require you to duplicate the data or destructively sort your array (losing the current ordering) if you want to save space. A hash table will obviously have memory overhead to for creating the hash table. If you use nickf's method, you can do it with little-to-no memory overhead, but you have to deal with the O(n2) runtime. You can choose which is best for your purposes.
Going on deep waters here, but:
Sorted lists
sorting can be O(nlogn) as pointed out. just to clarify, it doesn't matter that there is two lists, because: O(2*nlogn) == O(nlogn), then comparing each elements is another O(n), so sorting both then comparing each element is O(n)+O(nlogn) which is: O(nlogn)
Hash-tables:
Converting the first list to a hash table is O(n) for reading + the cost of storing in the hash table, which i guess can be estimated as O(n), gives O(n). Then you'll have to check the existence of each element in the other list in the produced hash table, which is (at least?) O(n) (assuming that checking existance of an element the hash-table is constant). All-in-all, we end up with O(n) for the check.
The Java List interface defines equals as each corresponding element being equal.
Interestingly, the Java Collection interface definition almost discourages implementing the equals() function.
Finally, the Java Set interface per documentation implements this very behaviour. The implementation is should be very efficient, but the documentation makes no mention of performance. (Couldn't find a link to the source, it's probably to strictly licensed. Download and look at it yourself. It comes with the JDK) Looking at the source, the HashSet (which is a commonly used implementation of Set) delegates the equals() implementation to the AbstractSet, which uses the containsAll() function of AbstractCollection using the contains() function again from hashSet. So HashSet.equals() runs in O(n) as expected. (looping through all elements and looking them up in constant time in the hash-table.)
Please edit if you know better to spare me the embarrasment.
Pseudocode :
A:array
B:array
C:hashtable
if A.length != B.length then return false;
foreach objA in A
{
H = objA;
if H is not found in C.Keys then
C.add(H as key,1 as initial value);
else
C.Val[H as key]++;
}
foreach objB in B
{
H = objB;
if H is not found in C.Keys then
return false;
else
C.Val[H as key]--;
}
if(C contains non-zero value)
return false;
else
return true;
The best way is probably to use hashmaps. Since insertion into a hashmap is O(1), building a hashmap from one array should take O(n). You then have n lookups, which each take O(1), so another O(n) operation. All in all, it's O(n).
In python:
def comparray(a, b):
sa = set(a)
return len(sa)==len(b) and all(el in sa for el in b)
Ignoring the built in ways to do this in C#, you could do something like this:
Its O(1) in the best case, O(N) (per list) in worst case.
public bool MatchArrays(object[] array1, object[] array2)
{
if (array1.length != array2.length)
return false;
bool retValue = true;
HashTable ht = new HashTable();
for (int i = 0; i < array1.length; i++)
{
ht.Add(array1[i]);
}
for (int i = 0; i < array2.length; i++)
{
if (ht.Contains(array2[i])
{
retValue = false;
break;
}
}
return retValue;
}
Upon collisions a hashmap is O(n) in most cases because it uses a linked list to store the collisions. However, there are better approaches and you should hardly have collisions anyway because if you did the hashmap would be useless. In all regular cases it's simply O(1). Besides that, it's not likely to have more than a small n of collisions in a single hashmap so performance wouldn't suck that bad; you can safely say that it's O(1) or almost O(1) because the n is so small it's can be ignored.
Here is another option, let me know what you guys think.It should be T(n)=2n*log2n ->O(nLogn) in the worst case.
private boolean compare(List listA, List listB){
if (listA.size()==0||listA.size()==0) return true;
List runner = new ArrayList();
List maxList = listA.size()>listB.size()?listA:listB;
List minList = listA.size()>listB.size()?listB:listA;
int macthes = 0;
List nextList = null;;
int maxLength = maxList.size();
for(int i=0;i<maxLength;i++){
for (int j=0;j<2;j++) {
nextList = (nextList==null)?maxList:(maxList==nextList)?minList:maList;
if (i<= nextList.size()) {
MatchingItem nextItem =new MatchingItem(nextList.get(i),nextList)
int position = runner.indexOf(nextItem);
if (position <0){
runner.add(nextItem);
}else{
MatchingItem itemInBag = runner.get(position);
if (itemInBag.getList != nextList) matches++;
runner.remove(position);
}
}
}
}
return maxLength==macthes;
}
public Class MatchingItem{
private Object item;
private List itemList;
public MatchingItem(Object item,List itemList){
this.item=item
this.itemList = itemList
}
public boolean equals(object other){
MatchingItem otheritem = (MatchingItem)other;
return otheritem.item.equals(this.item) and otheritem.itemlist!=this.itemlist
}
public Object getItem(){ return this.item}
public Object getList(){ return this.itemList}
}
The best I can think of is O(n^2), I guess.
function compare($foo, $bar) {
if (count($foo) != count($bar)) return false;
foreach ($foo as $f) {
foreach ($bar as $b) {
if ($f == $b) {
// $f exists in $bar, skip to the next $foo
continue 2;
}
}
return false;
}
return true;
}

Resources