What is the best data structure for this case? Given N resources with ID from 0 to N-1, you can get a resource or free a resource.
We also need to consider the time & space complexity for get and free operations.
interface ResourcePool {
int get(); // return an available ID
void free(int id); // mark ID as available
}
Follow up: what if N is a super large number, say 1 billion or 1 trillion.
Generally, you need 2 things:
A variable like int nextUnused that contains the smallest ID that's never been allocated
A list of free IDs less than nextUnused.
Allocating an ID will take it from the free list if it's non empty. Otherwise it will increment nextUnused.
Freeing an ID will just add it to the free list.
There are lots of different representations for the free list, but if you need to reserve memory for allocated resources, then it's common to reuse the memory of the free ones as linked list nodes in the free list, so the free list itself doesn't consume any space. This kind of data structure is called... a "free list": https://en.wikipedia.org/wiki/Free_list
Alternatively, you can store the free list separately. Since IDs can be freed in any order, and you need to remember which ones are free, there is no choice but to store the whole list somehow.
If your ID space is really big, it's conceivable that you could adopt strategies for keeping this representation as small as possible, but I've never seen much effort put into that in practice. The other possibility is to move parts of the free list into disk storage when it gets too big.
If N is very large, you can represent your resource pool using a balanced binary search tree. Each node in the tree is a range of free ids, represented by an upper and lower bound of ints. get() removes an arbitrary node from the tree, increments the lower bound, then re-inserts the node if the range it represents is still non-empty. free(i) inserts a new node (i,i), then coalesces that nodes with its two neighbors, if possible. For instance, if the tree contains (7,9) and (11,17), then free(10) results in a tree with fewer nodes - (7,9), (10, 10), and (11,17) are all removed, and (7,17) is there in their place. On the other hand, if the two neighbors of (10,10) are (7,9) and (12,17), then the result is (7,10) and (12,17), while if the two neighbors are (7,8) and (12,17), then no coalescing is possible and all three nodes, (7,8), (10,10), and (12,17), remain in the tree.
Both operations, get() and free(), take O(log P) time, where P is the size of the number of reserved elements at the moment the operation begins. This is slower than a free list, but the advantage of this over a plain free list is that the size of the structure will be no more than P, so as long as P is much smaller than N, the space usage is low.
Related
I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.
Disclaimer: I really believe that this is not a duplicate of similar questions. I've read those, and they (mostly) recommend using a heap or a priority queue. My question is more of the "I don't understand how those would work in this case" kind.
In short:
I'm referring to the typical A* (A-star) pathfinding algorithm, as described (for example) on Wikipedia:
https://en.wikipedia.org/wiki/A*_search_algorithm
More specifically, I'm wondering about what's the best data structure (which can be a single well known data structure, or a combination of those) to use so that you never have O(n) performance on any of the operations that the algorithm requires to do on the open list.
As far as I understand (mostly from the Wikipedia article), the operations needed to be done on the open list are as follows:
The elements in this list need to be Node instances with the following properties:
position (or coordinates). For the sake of argument, let's say this is a positive integer ranging in value from 0 to 64516 (I'm limiting my A* area size to 254x254, which means that any set of coordinates can be bit-encoded on 16 bits)
F score. This is positive floating point value.
Given these, the operations are:
Add a node to the open list: if a node with the same position (coordinates) exists (but, potentially, with a different F score), replace it.
Retrieve (and remove) from the open list the node with the lowest F score
(Check if exists and) retrieve from the list a node for a given position (coordinates)
As far as I can see, the problem with using a Heap or Priority Queue for the open list are:
These data structure will use the F-score as sorting criteria
As such, adding a node to this kind of data structure is problematic: how do you check optimally that a node with a similar set of coordinates (but a different F Score) doesn't already exist. Furthermore, even if you somehow are able to do this check, if you actually find such a node, but it is not on the top of the Heap/Queue, how to you optimally remove it such that the Heap/Queue keeps its correct order
Also, checking for existence and removing a node based on its position is not optimal or even possible: if we use a Priority Queue, we have to check every node in it, and remove the corresponding one if found. For a heap, if such a removal is necessary, I imagine that all remaining elements need to be extracted and re-inserted, so that the heap still remains a heap.
The only remaining operating where such a data structure would be good is when we want to remove the node with the lowest F-score. In this case the operation would be O(Log(n)).
Also, if we make a custom data structure, such as one that uses a Hashtable (with position as key) and a Priority Queue, we would still have some operations that require suboptimal processing on either of these: In order to keep them in sync (both should have the same nodes in them), for a given operation, that operation will always be subomtimal on one of the data structures: adding or removing a node by position would be fast on the Hashtable but slow on the Priority Queue. Removing the node with the lowest F score would be fast on the Priority Queue but slow on the Hashtable.
What I've done is make a custom Hashtable for the nodes that uses their position as key, that also keeps track of the current node with the lowest F score. When adding a new node, it checks if its F score is lower than the currently stored lowest F score node, and if so, it replaces it. The problem with this data structure comes when you want to remove a node (whether by position or the lowest F scored one). In this case, in order to update the field holding the current lowest F score node, I need to iterate through all the remaining node in order to find which one has the lowest F score now.
So my question is: is there a better way to store these ?
You can combine the hash table and the heap without slow operations showing up.
Have the hash table map position to index in the heap instead of node.
Any update to the heap can sync itself (which requires the heap to know about the hash table, so this is invasive and not just a wrapper around two off-the-shelf implementations) to the hash table with as many updates (each O(1), obviously) as the number of items that move in the heap, of course only log n items can move for an insertion, remove-min or update-key. The hash table finds the node (in the heap) to update the key of for the parent-updating/G-changing step of A* so that's fast too.
Why there is no any information in Google / Wikipedia about unrolled skip list? e.g. combination between unrolled linked list and skip list.
Probably because it wouldn't typically give you much of a performance improvement, if any, and it would be somewhat involved to code correctly.
First, the unrolled linked list typically uses a pretty small node size. As the Wikipedia article says: " just large enough so that the node fills a single cache line or a small multiple thereof." On modern Intel processors, a cache line is 64 bytes. Skip list nodes have, on average, two pointers per node, which means an average of 16 bytes per node for the forward pointers. Plus whatever the data for the node is: 4 or 8 bytes for a scalar value, or 8 bytes for a reference (I'm assuming a 64 bit machine here).
So figure 24 bytes, total, for an "element." Except that the elements aren't fixed size. They have a varying number of forward pointers. So you either need to make each element a fixed size by allocating an array for the maximum number of forward pointers for each element (which for a skip list with 32 levels would require 256 bytes), or use a dynamically allocated array that's the correct size. So your element becomes, in essence:
struct UnrolledSkipListElement
{
void* data; // 64-bit pointer to data item
UnrolledSkipListElement* forward_pointers; // dynamically allocated
}
That would reduce your element size to just 16 bytes. But then you lose much of the cache-friendly behavior that you got from unrolling. To find out where you go next, you have to dereference the forward_pointers array, which is going to incur a cache miss, and therefore eliminate the savings you got by doing the unrolling. In addition, that dynamically allocated array of pointers isn't free: there's some (small) overhead involved in allocating that memory.
If you can find some way around that problem, you're still not going to gain much. A big reason for unrolling a linked list is that you must visit every node (up to the node you find) when you're searching it. So any time you can save with each link traversal adds up to very big savings. But with a skip list you make large jumps. In a perfectly organized skip list, for example, you could skip half the nodes on the first jump (if the node you're looking for is in the second half of the list). If your nodes in the unrolled skip list only contain four elements, then the only savings you gain will be at levels 0, 1, and 2. At higher levels you're skipping more than three nodes ahead and as a result you will incur a cache miss.
So the skip list isn't unrolled because it would be somewhat involved to implement and it wouldn't give you much of a performance boost, if any. And it might very well cause the list to be slower.
Linked list complexity is O(N)
Skip list complexity is O(Log N)
Unrolled Linked List complexity can be calculate as following:
O (N / (M / 2) + Log M) = O (2N/M + Log M)
Where M is number of elements in single node.
Because Log M is not significant,
Unrolled Linked List complexity is O(N/M)
If we suppose to combine Skip list with Unrolled linked list, the new complexity will be
O(Log N + "something from unrolled linked list such N1/M")
This means the "new" complexity will not be as better as first someone will think. New complexity might be even worse than original O(Log N). The implementation will more complex as well. So gain is questionable and rather dubious.
Also, since single node will have lots of data, but only single "forward" array, the "tree" will not be so-balanced either and this will ruin O(Log N) part of the equation.
I have been working on a problem from glass door that was being asked in one of the firm interviews by the firm that I ought to go to. The problem goes as :
If you have all the companies that are traded, and live inputs are coming of which company is being traded and what is the volume, how do you maintain the data, so that you can carry out operation of giving the top 10 most traded companies by volume of shares most efficiently
I thought of following solution for the same. Though I am not sure whether it is the efficient one or not: what about you maintain a binary search tree. With every insert you insert the company name and the volume of shares traded for it.
My basic node for the tree would then be:
class Node
{
String key; // company name
int volume; // volume
Node leftNode;
Node rightNode;
}
So at every new insert I will keep on inserting in the tree. And at the time of final retrieval , I can run the following code until the count of global count reaches 10.
traversal(Node a)
{
if(a!=null)
{
traverse(a.getRightNode());
System.out.println(a.getKey()+a.getValue());
traverse(a.getLeftNode());
}
}
What are your views on this solution?
This question is very similar to another question but with little twist. First of all, if somebody ask me this question I would ask a lot of questions. Do I know name of companies in advance? What is number of companies? Is there upper bound of their number? Do you mean time efficiency or memory consumption efficiency or mix of both? What is ratio of trades and top companies requests? It is not specified but I will assume high amount of trades and displaying Top 10 on demand or in some time interval. In case of requesting Top 10 after every trade arrival heap will be useless even for bigger N than 10 and whole algorithm would can be simpler. I also assume time efficiency. Memory efficiency is then constrained by CPU cache behaviour so we should not waste it anyway.
So we will store top N in some structure which will give me least member fast. This is for big N obviously heap. I can use any heap implementation even those which does have bad IncKey and Merge operations or doesn't have them at all. I will need only Insert, Peek and Remove operations. Number 10 is pretty small one and I would not even need heap for this especially in compiled languages with good compiler. I can use ordered array or list or even unordered one. So in every place where I will mention heap bellow, you can use ordered or unordered array or list. Heap is necessary only for bigger N in Top N.
So this is it, we will store Top N companies name and it's volume when inserted in heap.
Then we need track company trade volume in some K/V storage. Key is name. K/V storage for this purpose can be hashmap, trie or Judy. It will be good if we know company names in advance. It will allow us to compute perfect hash for hashmap or construct optimal trie. Otherwise it will be nice if we know upper bound company number otherwise to choose good hash length and number of buckets. Otherwise we will have to make resizable hash or use Judy. There is not know trie implementation for dynamic K/V better than hashmap or Judy. All of this K/V storages has O(k) access complexity, where k is length of Key which is name in this case. In every place, where I will mention hashmap bellow, you can use Judy or trie. You can use trie only when all of company names are known in advance and you can tailor super fast optimized code.
So we sill store company name as Key and trade volume so far and flag indicating storing in heap in hashmap.
So there is algorithm here. We will have state which contain heap, number of companies in heap and hashmap. For each arrived company mane and volume we will increase volume in hashmap. Then if companies in heap is less than N (10) we will add this company name and volume from hashmap to the heap if is not there yet (according to flag and set this flag in hashmap). Otherwise if heap is full and current company is not in heap, we will peek into heap and if current company has less volume traded so far (in hashmap) than company in heap we can finish this trade and go for next. Otherwise we have to update companies in heap first. While company in top of heap (it means with least volume) has volume in heap less than in current one and also different than in hashmap, we will update this volume. It can be done by removing from heap, and insert right value. Then check again top of heap and so on. Note, that we don't need update all companies in heap and even not all top heap companies which are not up to date. It's pretty lazy. If current company has still bigger volume than in top of heap, we will remove company from heap and insert current one and update flags in hashmap. That`s all.
Brief recapitulation:
min-heap storing top N companies ordered by volume and containing company name or direct index into hashmap
volume in heap can be out of date
hashmap with company name as key and up-to-date volume and flag indicating heap member as value
first update current company volume in hashmap and remember
repeatedly update heap top if less than current traded company
remove heap top if still less than current company and add current one in heap
This algorithm gain advantage that trade volume can be only positive number so volume in heap can be only less than right value and if top of heap has least value from all of heap and still less than right value and still bigger than any company in hasmap everything is perfect. Otherwise we would have to store all companies in heap, use max heap instead min heap, implement IncKey and perform this operation for all trades and keep back-references to heap in hashmap and everything is far more complicated.
Processing of new trade time complexity is nice O(1). O(1) is hashmap lookup, O(1) is Peek in heap. Insert and Delete in heap are amortized O(1) or O(logN) where N is constant so still O(1). Number of updates in heap is O(N) so O(1). You can also compute upper bound of processing time when there is upper bound of companies number (hashmap size problem mentioned at the beginning) so with good implementation you can consider it real time. Keep in mind that simpler solution (as ordered or unordered list, updating all Top members and so) can bring better performance in compiled code for small N as 10 especially on modern HW.
This algorithm can be nicely implemented even in functional language except there is not pure functional hash table but trie should have O(1) behavior or there will be some impure module for this. For example Erlang implementation using ordered list as heap and dictionary for hashmap. (Mine favorite functional heap is pairing heap but for 10 it is overkill.)
-module(top10trade).
-record(top10, {
n = 0,
heap = [],
map = dict:new()
}).
-define(N, 10).
-export([new/0, trade/2, top/1, apply_list/2]).
new() ->
#top10{}.
trade({Name, Volume}, #top10{n = N, map = Map} = State)
% heap is not full
when N < ?N ->
case dict:find(Name, Map) of
% it's already in heap so update hashmap only
{ok, {V, true}} ->
State#top10{map = dict:store(Name, {V+Volume, true}, Map)};
% otherwise insert to heap
error ->
State#top10{
n = N+1,
heap = insert({Volume, Name}, State#top10.heap),
map = dict:store(Name, {Volume, true}, Map)
}
end;
% heap is full
trade({Name, Volume}, #top10{n = ?N, map = Map} = State) ->
% look-up in hashmap
{NewVolume, InHeap} = NewVal = case dict:find(Name, Map) of
{ok, {V, In}} -> {V+Volume, In};
error -> {Volume, false}
end,
if InHeap ->
State#top10{map = dict:store(Name, NewVal, Map)};
true -> % current company is not in heap so peek in heap and try update
update(NewVolume, Name, peek(State#top10.heap), State)
end.
update(Volume, Name, {TopVal, _}, #top10{map = Map} = State)
% Current Volume is smaller than heap Top so store only in hashmap
when Volume < TopVal ->
State#top10{map = dict:store(Name, {Volume, flase}, Map)};
update(Volume, Name, {TopVal, TopName}, #top10{heap = Heap, map = Map} = State) ->
case dict:fetch(TopName, Map) of
% heap top is up-to-date and still less than current
{TopVal, true} ->
State#top10{
% store current to heap
heap = insert({Volume, Name}, delete(Heap)),
map = dict:store( % update current and former heap top records in hashmap
Name, {Volume, true},
dict:store(TopName, {TopVal, false}, Map)
)
};
% heap needs update
{NewVal, true} ->
NewHeap = insert({NewVal, TopName}, delete(Heap)),
update(Volume, Name, peek(NewHeap), State#top10{heap = NewHeap})
end.
top(#top10{heap = Heap, map = Map}) ->
% fetch up-to-date volumes from hashmap
% (in impure language updating heap would be nice)
[ {Name, element(1, dict:fetch(Name, Map))}
|| {_, Name} <- lists:reverse(to_list(Heap)) ].
apply_list(L, State) ->
lists:foldl(fun apply/2, State, L).
apply(top, State) ->
io:format("Top 10: ~p~n", [top(State)]),
State;
apply({_, _} = T, State) ->
trade(T, State).
%%%% Heap as ordered list
insert(X, []) -> [X];
insert(X, [H|_] = L) when X < H -> [X|L];
insert(X, [H|T]) -> [H|insert(X, T)].
-compile({inline, [delete/1, peek/1, to_list/1]}).
delete(L) -> tl(L).
peek(L) -> hd(L).
to_list(L) -> L.
It performs nice 600k trades per second. I would expect few millions per second in C implementation depending of number of companies. More companies means slower K/V look-up and update.
You can do it using min binary heap data structure where you maintain a heap of size 10 and delete the top element every time you have a new company which has greater volume than top and insert new company into heap. All the element currently in the heap are current top 10 companies.
Note: Add all the first 10 companies at the start.
Well, there are trade-offs here. You are going to need to choose what you prefer - an efficient look-up (get top K) or an efficient insertion. As it seems, you cannot get both.
You can get O(logN) insertion and lookup by using two Data-structures:
Map<String,Node> - that maps from the company name a node in the second data structure. This will be a trie or a self balancing tree.
Map<Integer,String> - that maps from volumes to the company's name. This can be a map (hash/tree based) or it can also be a heap, since we have a link to the direct node, we can actually delete a node efficiently when needed.
Getting the top 10 can be done on the 2nd data structure in O(logN), and inserting each element requires looking by string - O(|S| * logN) (you can use a trie to get O(|S|) here) - and than modifying the second tree - which is O(logN)
Using a trie totals in O(|S|+logN) complexity for both get top K and insertions.
If number of data inserted is exponential in the number of getTopK() ops - it will be better to just keep a HashMap<String,Integer> and modify it as new data arrives, and when you get a findTopK() - do it in O(N) as described in this thread - using selection algorithm or a heap.
This results in O(|S|) insertion (on average) and O(N + |S|) get top K.
|S| is the length of the input/result string where it appears.
This answer assumes each company can appear more than once in the
input stream.
So, suppose you have a collection of items. Each item has an identifier which can be represented using a bitfield. As a simple example, suppose your collection is:
0110, 0111, 1001, 1011, 1110, 1111
So, you then want to implement a function, Remove(bool bitval, int position). For example, a call to Remove(0, 2) would remove all items where index 2(i.e. 3rd bit) was 0. In this case, that would be 1001, only. Remove(1,1) would remove 1110, 1111, 0111, and 0110. It is trivial to come up with an O(n) collection where this is possible (just use a linked list), with n being the number of items in the collection. In general the number of items to be removed is going to be O(n) (assuming a given bit has a ≥ c% chance of being 1 and a ≥ c% chance of being 0, where c is some constant > 0), so "better" algorithms which somehow are O(l), with l being the number of items being removed, are unexciting.
Is it possible to define a data structure where the average (or better yet, worst case) removal time is better than O(n)? A binary tree can do pretty well (just remove all left/right branches at the height m, where m is the index being tested), but I'm wondering if there is any way to do better (and quite honestly, I'm not sure how to removing all left or right branches at a particular height in an efficient manner). Alternatively, is there a proof that doing better is not possible?
Edit: I'm not sure exactly what I'm expecting in terms of efficiency (sorry Arno), but a basic explanation of it's possible application is thus: Suppose we are working with a binary decision tree. Such a tree could be used for a game tree or a puzzle solver or whatever. Further suppose the tree is small enough that we can fit all of the leaf nodes into memory. Each such node is basically just a bitfield listing all of the decisions. Now, if we want to prune arbitrary decisions from this tree, one method would be to just jump to the height where a particular decision is made and prune the left or right side of every node (left meaning one decision, right meaning the other). Normally in a decision tree you only want to prune subtree at a time (since the parent of that subtree is different from the parent of other subtrees and thus the decision which should be pruned in one subtree should not be pruned from others), but in some types of situations this may not be the case. Further, you normally only want to prune everything below a particular node, but in this case you'll be leaving some stuff below the node but also pruning below other nodes in the tree.
Anyhow, this is somewhat of a question based on curiousity; I'm not sure it's practical to use any results, but am interested in what people have to say.
Edit:
Thinking about it further, I think the tree method is actually O(n / logn), assuming it's reasonably dense. Proof:
Suppose you have a binary tree with n items. It's height is log(n). Removing half the bottom will require n/2 removals. Removing the half the row above will require n/4. The sum of operations for each row is n-1. So the average number of removals is n-1 / log(n).
Provided the length of your bitfields is limited, the following may work:
First, represent the bitfields that are in the set as an array of booleans, so in your case (4 bit bitfields), new bool[16];
Transform this array of booleans into a bitfield itself, so a 16-bit bitfield in this case, where each bit represents whether the bitfield corresponding to its index is included
Then operations become:
Remove(0, 0) = and with bitmask 1010101010101010
Remove(1, 0) = and with bitmask 0101010101010101
Remove(0, 2) = and with bitmask 1111000011110000
Note that more complicated 'add/remove' operations could then also be added as O(1) bit-logic.
The only down-side is that extra work is needed to interpret the resulting 16-bit bitfield back into a set of values, but with lookup arrays that might not turn out too bad either.
Addendum:
Additional down-sides:
Once the size of an integer is exceeded, every added bit to the original bit-fields will double the storage space. However, this is not much worse than a typical scenario using another collection where you have to store on average half the possible bitmask values (provided the typical scenario doesn't store far less remaining values).
Once the size of an integer is exceeded, every added bit also doubles the number of 'and' operations needed to implement the logic.
So basically, I'd say if your original bitfields are not much larger than a byte, you are likely better off with this encoding, beyond that you're probably better off with the original strategy.
Further addendum:
If you only ever execute Remove operations, which over time thins out the set state-space further and further, you may be able to stretch this approach a bit further (no pun intended) by making a more clever abstraction that somehow only keeps track of the int values that are non-zero. Detecting zero values may not be as expensive as it sounds either if the JIT knows what it's doing, because a CPU 'and' operation typically sets the 'zero' flag if the result is zero.
As with all performance optimizations, this one'd need some measurement to determine if it is worthwile.
If each decision bit and position are listed as objects, {bit value, k-th position}, you would end up with an array of length 2*k. If you link to each of these array positions from your item, represented as a linked list (which are of length k), using a pointer to the {bit, position} object as the node value, you can "invalidate" a bunch of items by simply deleting the {bit, position} object. This would require you, upon searching the list of items, to find "complete" items (it makes search REALLY slow?).
So something like:
[{0,0}, {1,0}, {0,1}, {1, 1}, {0,2}, {1, 2}, {0,3}, {1,3}]
and linked from "0100", represented as: {0->3->4->6}
You wouldn't know which items were invalid until you tried to find them (so it doesn't really limit your search space, which is what you're after).
Oh well, I tried.
Sure, it is possible (even if this is "cheating"). Just keep a stack of Remove objects:
struct Remove {
bool set;
int index;
}
The remove function just pushes an object on the stack. Viola, O(1).
If you wanted to get fancy, your stack couldn't exceed (number of bits) without containing duplicate or impossible scenarios.
The rest of the collection has to apply the logic whenever things are withdrawn or iterated over.
Two ways to do insert into the collection:
Apply the Remove rules upon insert, to clear out the stack, making in O(n). Gotta pay somewhere.
Each bitfield has to store it's index in the remove stack, to know what rules apply to it. Then, the stack size limit above wouldn't matter
If you use an array to store your binary tree, you can quickly index any element (the children of the node at index n are at index (n+1)*2 and (n+1)*2-1. All the nodes at a given level are stored sequentially. The first node at at level x is 2^x-1 and there are 2^x elements at that level.
Unfortunately, I don't think this really gets you much of anywhere from a complexity standpoint. Removing all the left nodes at a level is O(n/2) worst case, which is of course O(n). Of course the actual work depends on which bit you are checking, so the average may be somewhat better. This also requires O(2^n) memory which is much worse than the linked list and not practical at all.
I think what this problem is really asking is for a way to efficiently partition a set of sets into two sets. Using a bitset to describe the set gives you a fast check for membership, but doesn't seem to lend itself to making the problem any easier.