Data structure that allows accessing elements by index and delete them in O(1) - algorithm

I have following task (as part of bigger task):
I need to take an k element from array like data structure and delete it (k is any possible index). Array have O(n) for deleting elements, and List have O(n) for searching element. I would like to do both operations in O(1) time.
Which data structure should I use to meet this requirement?
Clarification:
Deleting element on index(5) will move element from index(6) to index(5).
This particular task is topcoder srm 300 div2 500 points problem. It does not require such sophisticated data structure (simple java methods will do the job since max data is really small), but I am curious how to deal with much bigger problem using c-like thinking about data.
So maybe I am sticked to much to array for this problem? But I will analyze it and edit question later, after work (if you are really curious, you can see task on top coder).

I believe what you're asking for is impossible.
However, if you can relax your requirement for indexing to O(log n), then ropes may be able to satisfy it, although I'm not sure if they have a probabilistic or deterministic guarantee (I think it's probabilistic).

Given the nature of the "dating" problem as given, it involves continuously choosing and removing the "best" member of a set--a classic priority queue. In fact, you'll need to build two of those (for men and women). You'll either have to build them in O(NlogN) time (a sorted list) for constant O(1) removal, or else build them in linear time (a heap) for O(logN) removal. Overall you get O(NlogN) either way, since you'll be removing all of one queue and most of the other.
So then the question is what structure supports the other part of the task, choosing the "chooser" from the circle and removing him and his choice. Since this too must be done N times, any method that accomplishes the removal in O(logN) time won't increase the overall complexity of your algorithm. You can't get O(1) indexed access with fast deletions given the re-indexing requirement. But you can in fact get O(logN) for both indexed access and deletion with a tree (something like a rope as mentioned). This will give you O(NlogN) overall, which is the best you can do anyway.

There is a solution, that may be satisfying in some cases. You have to use an array and a vector for saving deletions. Every time you delete an element, you put its index in a vector. Every time you read an element of some index, you recalculate its index depending on previous deletions.
Say, you have an array of:
A = [3, 7, 6, 4, 3]
You delete 3-rd element:
A = [3, 7, 6, 4, 3] (no actual deletion)
d = [3]
And then read the 4-th:
i = 4
3 < 4 => i += 1
A[i] = 3
This is not exactly O(1), but yet it does not depend on array length. Only on a number of deleted elements.

The only data-structure that has a small overhead in adding and removing element is an hashtable. The only overhead is the cost of the hash function (and it is considered as O(1), if you take a purely theoretic approach).
But, if you want it to be extremely efficient, you will need to:
Have an approximate of the number of elements you will have to get inside your data-structure (and allocate this number once for all at the beginning).
Choose an hash function that will avoid collision given the way your keys are distributed (collisions are just breaking the efficiency of hashtables).
If you manage to get everything right, then you should be optimal.

Related

Data structure / algorithms for getting best and worst-scoring object from set

My algorithm runs a loop where a set of objects is maintained. In each iteration there are objects being added and removed from the set. Also, there are some "measures" (integer values, possibly several of them) for each object, which can change at any time. From those measures, a score can be calculated based on the measures and the iteration number.
Whenever the number of objects passes a certain threshold, I want to identify and remove the lowest-scoring objects until the number of objects is again below that threshold. That is: if there are n objects with threshold t, if n>t then remove the n-t lowest-scoring objects.
But also, periodically I want to get the highest-scoring
I'm really at a loss as to what data structure I should use here to do this efficiently. A priority queue doesn't really work as measures are changed all the time and anyway the "score" I want to use can be any arbitrarily complex function of those measures and the current iteration number. The obvious approach is probably a hash-table storing associations object -> measures, with amortized O(1) add/remove/update operations, but then finding the lowest or highest scoring objects would be O(n) in the number of elements. n can be easily in the millions after a short while so this isn't ideal. Is this the best I can do?
I realise this probably isn't very trivial but I'd like to know if anyone has any suggestions as to how this could be best implemented.
PS: The language is OCaml but it really doesn't matter.
For this level of generality the best would be to have something for quick access to the measures (storing them in object or via a pointer would be best, but a hash-table would also work) and having an additional data-structure for keeping an ordered view of your objects.
Every time you update the measures you would want to refresh the score and update the ordered data-structure. Something like a balanced BST would work well (RB-tree, AVL) and would guarantee LogN update complexity.
You can also keep a min-max heap instead of the BST. This has the advantage of using less pointers, which should lower the overhead of the solution. Complexity remains LogN per update.
You've mentioned that the score depends on iteration number. This is bad for performance because it requires all entries to be updated every iteration. However, if you can isolate the impact (say the score is g(all_metrics) - f(iteration_number)) so that all elements are impacted the same then the relative order should remain consistent and you can skip updating the score every iteration.
If it's not constant, but it's still isolated (something like f(iteration_number, important_time)) you can use the balanced BST and calculate when the iteration will swap each element with one of it's neighbours, then keep the swap times in a heap, and only update the elements that would swap.
If it's not isolated at all then you would need at each iteration to update all the elements, so you might as well keep track of the highest value and the lowest ones when you go through them to recompute the scores. This at least will have a complexity of O(NlogK) where K is the number of lowest values you want to remove (hopefully it's very small so it should behave almost like O(N)).

Algorithm to find all values repeating more than floor(n/k) times in O(n log k) time [duplicate]

This problem is 4-11 of Skiena. The solution to finding majority elements - repeated more than half times is majority algorithm. Can we use this to find all numbers repeated n/4 times?
Misra and Gries describe a couple approaches. I don't entirely understand their paper, but a key idea is to use a bag.
Boyer and Moore's original majority algorithm paper has a lot of incomprehensible proofs and discussion of formal verification of FORTRAN code, but it has a very good start of an explanation of how the majority algorithm works. The key concept starts with the idea that if the majority of the elements are A and you remove, one at a time, a copy of A and a copy of something else, then in the end you will have only copies of A. Next, it should be clear that removing two different items, neither of which is A, can only increase the majority that A holds. Therefore it's safe to remove any pair of items, as long as they're different. This idea can then be made concrete. Take the first item out of the list and stick it in a box. Take the next item out and stick it in the box. If they're the same, let them both sit there. If the new one is different, throw it away, along with an item from the box. Repeat until all items are either in the box or in the trash. Since the box is only allowed to have one kind of item at a time, it can be represented very efficiently as a pair (item type, count).
The generalization to find all items that may occur more than n/k times is simple, but explaining why it works is a little harder. The basic idea is that we can find and destroy groups of k distinct elements without changing anything. Why? If w > n/k then w-1 > (n-k)/k. That is, if we take away one of the popular elements, and we also take away k-1 other elements, then the popular element remains popular!
Implementation: instead of only allowing one kind of item in the box, allow k-1 of them. Whenever you see a group of k different items show up (that is, there are k-1 types in the box, and the one arriving doesn't match any of them), you throw one of each type in the trash, including the one that just arrived. What data structure should we use for this "box"? Well, a bag, of course! As Misra and Gries explain, if the elements can be ordered, a tree-based bag with O(log k) basic operations will give the whole algorithm a complexity of O(n log k). One point to note is that the operation of removing one of each element is a bit expensive (O(k) for a typical implementation), but that cost is amortized over the arrivals of those elements, so it's no big deal. Of course, if your elements are hashable rather than orderable, you can use a hash-based bag instead, which under certain common assumptions will give even better asymptotic performance (but it's not guaranteed). If your elements are drawn from a small finite set, you can guarantee that. If they can only be compared for equality, then your bag gets much more expensive and I'm pretty sure you end up with something like O(nk) instead.
Find the majority element that appears n/2 times by Moore-Voting Algorithm
See method 3 of the given link for Moore's Voting Algo (http://www.geeksforgeeks.org/majority-element/).
Time:O(n)
Now after finding majority element, scan the array again and remove the majority element or make it -1.
Time:O(n)
Now apply Moore Voting Algorithm on the remaining elements of array (but ignore -1 now as it has already been included earlier). The new majority element appears n/4 times.
Time:O(n)
Total Time:O(n)
Extra Space:O(1)
You can do it for element appearing more than n/8,n/16,.... times
EDIT:
There may exist a case when there is no majority element in the array:
For e.g. if the input arrays is {3, 1, 2, 2, 1, 2, 3, 3} then the output should be [2, 3].
Given an array of of size n and a number k, find all elements that appear more than n/k times
See this link for the answer:
https://stackoverflow.com/a/24642388/3714537
References:
http://www.cs.utexas.edu/~moore/best-ideas/mjrty/
See this paper for a solution that uses constant memory and runs in linear time, which will find 3 candidates for elements that occur more than n/4 times. Note that if you assume that your data is given as a stream that you can only go through once, this is the best you can do -- you have to go through the stream one more time to test each of the 3 candidates to see if it occurs more than n/4 times in the stream. However, if you assume a priori that there are 3 elements that occur more than n/4 times then you only need to go through the stream once so you get a linear time online algorithm (only goes through the stream once) that only requires constant storage.
As you didnt mention space complexity , one possible solution is using hashtable for the elements which maps to count then you can just increment count if the element is found.

Data structure for non overlapping ranges of integers?

I remember learning a data structure that stored a set of integers as ranges in a tree, but it's been 10 years and I can't remember the name of the data structure, and I'm a bit fuzzy on the details. If it helps, it's a functional data structure that was taught at CMU, I believe in 15-212 (Principles of Programming) in 2002.
Basically, I want to store a set of integers, most of which are consecutive. I want to be able to query for set membership efficiently, add a range of integers efficiently, and remove a range of integers efficiently. In particular, I don't care to preserve what the original ranges are. It's better if adjacent ranges are coalesced into a single larger range.
A naive implementation would be to simply use a generic set data structure such as a HashSet or TreeSet, and add all integers in a range when adding a range, or remove all integers in a range when removing a range. But of course, that would waste a lot of memory in addition to making add and remove slow.
I'm thinking of a purely functional data structure, but for my current use I don't need it to be. IIRC, lookup, insertion, and deletion were all O(log N), where N was the number of ranges in the set.
So, can you tell me the name of the data structure I'm trying to remember, or a suitable alternative?
I found the old homework and the data structure I had in mind were Discrete Interval Encoding Trees or diets for short. They are described in detail in Diets for Fat Sets, Martin Erwig. Journal of Functional Programming, Vol. 8, No. 6, 627-632, 1998. It is basically a tree of intervals with the invariant that all of the intervals are non-overlapping and non-touching. There is a Haskell implementation in Hackage. I was hoping there would be an existing implementation for Scala, but I'm not seeing any.
The homework also included another data structure they called a Recursive Interval-Occluding Tree (RIOT), which rather than keeping only an interval at each node keeps an interval and another (possibly empty) RIOT of things removed from the interval. The assignment included benchmarks showing it did better than diets for random insertions and deletions. AFAICT it is simply something the TAs made up and never published as it no longer seems to exist anywhere on the Internets, at least not under that name.
You probably are looking for segment trees. This might be helpful: http://www.topcoder.com/tc?d1=tutorials&d2=lowestCommonAncestor&module=Static
You can also use binary search trees for the same, for which each node will have two data fields: min_val and max_val.
During insertion algorithm, you just need to call another merging operation to check if the left-child,parent,right-child create a sequence, so as to club them into a single node. This will take O(log n) time.
Other operations like deletion and look-up will take O(log n) time as usual, but special measures need to be taken while deletion.

Binary Search Tree for specific intent

We all know there are plenty of self-balancing binary search trees (BST), being the most famous the Red-Black and the AVL. It might be useful to take a look at AA-trees and scapegoat trees too.
I want to do deletions insertions and searches, like any other BST. However, it will be common to delete all values in a given range, or deleting whole subtrees. So:
I want to insert, search, remove values in O(log n) (balanced tree).
I would like to delete a subtree, keeping the whole tree balanced, in O(log n) (worst-case or amortized)
It might be useful to delete several values in a row, before balancing the tree
I will most often insert 2 values at once, however this is not a rule (just a tip in case there is a tree data structure that takes this into account)
Is there a variant of AVL or RB that helps me on this? Scapegoat-trees look more like this, but would also need some changes, anyone who has got experience on them can share some thougts?
More precisely, which balancing procedure and/or removal procedure would help me keep this actions time-efficient?
It is possible to delete a range of values a BST in O(logn + objects num).
The easiest way I know is to work with the Deterministic Skip List data structure (you might want to read a bit about this data structure before you go on).
In the deterministic skip list all of the real values are stored in the bottom level, and there are pointers on upper levels to them. Insert, search and remove are done in O(logn).
The range deletion operation can be done according to the following algorithm:
Find the first element in the range - O(logn)
Go forward in the linked list, and remove all elements that are still in the range. If there are elements with pointers to the upper levels - remove them too, until reaching the topmost level (removal from a linked list) - O(number of deleted objects)
Fix the pointers to fit deterministic skip list (2-3 elements between every pointer upward)
The total complexity of the range delete is O(logn + number of objects in the range).
Notice that if you choose to work with a random skip list, you get the same complexity, but on average, and not worst case. The plus is that you don't have to fix the upper level pointers to meet the 2-3 demand.
A deterministic skip list has a 1-1 mapping to a 2-3 tree, so with some more work, the procedure described above could work for a 2-3 tree as well.
Long ago in the pre-STL days I wrote my own B-Tree (BST) algorithm because I had a rather large data set at the time (roughly 700K items in 2 trees that were interdependent). I found that rebalancing after every 100-200 insertions/deletions was the peak performance I could get at the time based on experimentation on 486 and SGI hardware. This number may be different now, or maybe not since it does appear to be an algorithmic optimization limit unless you convert to a parallel model.
In short, you could apply a modification trigger for the rebalancing, and allow for forced rebalancing when you've completed all your modifications.
The improvement was remarkable. The initial straight load was not complete after 25m (killed the process). Rebalancing as we went also was killed after 15m. The restricted modification loads with a rebalance every 100 mods loaded and ran in less than 3m. Note that during the "run" portion, there were 0-8 modification to the tree per initial entry. You really need to consider whether you always need to be in-balance when the tree will be modified again in the near term.
Hmm, what about B-trees? They are also balanced, and if you choose a big-order one --- it depends on how many items do you have ---, you will save a bunch of object creation/destruction times.
To 2. If you have a B-tree of order 100, you can remove up to 100 items by one function call.
To 3. This feature can be applied to almost any of the trees, just implement a RemoveSome() function that removes N items and does a rebalance. For B-trees, it's a bit trickier, but can be done.
Note: I supposed you're a programmer. If you need a complete, tested, off-the-shelf solution, you need another answer.
It should be easy to implement deleting a node and its sub nodes in an AVL tree if every node stores its height instead of a balance factor. After deleting a node keep rotating until the two child nodes differ by no more than one. Then move up the tree and repeat. The only real difference from a normal deletion will be a while instead of an if for testing the heights.
The Set implementation in the OCaml standard library is a purely functional AVL tree that satisfies all of your requirements and, in particular, has very efficient implementations of set theoretic operations (union, intersection, difference). Insertion and deletion are O(log n). You can remove subtrees and runs of elements by representing them as a set and using set difference. You can insert two elements simultaneously by creating a 2-element set and applying set union.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources