Keep track of the result of accumulation for operation without inverse - algorithm

I have an operation A * A -> A, which is commutative and associative. This means the order I apply it in doesn't matter, as long as I use the same elements. Nice.
I have to apply it to a list of values. To be more precise, I have to use it as the operation to accumulate the values of the list. So far, so good.
I then have a series of requests to add an element to the list, or erase it from the list. After each insertion or deletion, I have to return the new accumulated value for the new list. Simple, right?
The problem is I don't have an inverse; that is no operation '/' able to remove b if I only know a * b and tell me the other operand must have been a. (in fact, there isn't even an identity element)
So, my only obvious option is to accumulate again at every deletion -in linear time.
Can I do better? I've thought a lot about it.
And the answer is, of course I can... if I really want: I need to implement a custom binary tree, maybe a red/black one to have good worst case guarantees. Have next to the value an additional cache storing the result of the whole subtree.
cache = value * left.cache * right.cache
Maintain this invariant after every operation; then the root cache is the result.
However, "implement a custom R/B tree while maintaining an additional invariant" isn't something I'm particularly comfortable at doing. Well I would do it, but not swear by its correctness. Plus, the constant before the log would probably be significant. It seems pretty unwieldy, to do a simple thing like keeping track of an accumulation.
Does anyone see a better solution?
For completeness: the operation is a union of filters. A filter is a couple (code, mask), and a value "passes the filter" if (C bitwise operators) (value ^ code) & mask == 0; that is, if its bit corresponding to bits set in mask are equal to the corresponding bits in code. The union therefore sets to 0 (ignored) the bits where masks or codes differ, and keeps the ones which are the same.
Bonus appreciation to anyone finding a way to exploit the specific properties of the operation to get a solution more efficient than it is possible for the general problem I abstracted! ;-)

For your specific problem you could keep track for each bit x:
The total number of times that bit x is set to 1 in a mask
The total number of times that bit x is set to 1 in a mask and bit x of code is equal to 0
The total number of times that bit x is set to 1 in a mask and bit x of code is equal to 1
With these 3 counts (for each bit) it is straightforward to compute the union of all the filters.
The complexity is O(R) (where R is the number of bits in mask) to add or remove a filter.

Related

Adding, Removing and First missing positive integer

Given an empty list. There are three types of queries 1, 2, 3.
Query 1 x where x is a positive integer indicates adding the number x into the list.
Query 2 x indicates removing x from the list.
Query 3 indicates printing the smallest positive integer not present in the array.
Here x can be from 1 upto 10^9 and number of queries upto 10^5. For the large range of x I can't keep a boolean array marking visited integers. How should I approach?
There are too many unknowns about your data to give a definitive answer here. The approach differs a lot between at least these different cases:
Few values.
A lot of values but with large gaps.
A lot of values with only small gaps.
Almost all values.
It also depends on which ones of the mentioned operations that you will do the most.
It is less than 1 GB of data so it is possible to keep it as a bit array in memory on most machines. But if the data set is sparse (case 1 and 2 above) you might want to consider sparse arrays instead, or for very sparse sets (case 1) perhaps a binary search tree or a min-heap. The heap is probably not a good idea if you are going to use operation 2 a lot.
For case 1, 2 and 4 you might consider a range tree. The upside to this solution is that you can do operation 3 in logarithmic time just by going leftwards down the tree and look at the first range.
It might also be possible to page out your datastructure to disk if you are not going to do a lot of random insertions.
You might also consider speeding up the search with a Bloom filter, depending on what type of datastructure you choose in the end.

Are there sorting algorithms that respect final position restrictions and run in O(n log n) time?

I'm looking for a sorting algorithm that honors a min and max range for each element1. The problem domain is a recommendations engine that combines a set of business rules (the restrictions) with a recommendation score (the value). If we have a recommendation we want to promote (e.g. a special product or deal) or an announcement we want to appear near the top of the list (e.g. "This is super important, remember to verify your email address to participate in an upcoming promotion!") or near the bottom of the list (e.g. "If you liked these recommendations, click here for more..."), they will be curated with certain position restriction in place. For example, this should always be the top position, these should be in the top 10, or middle 5 etc. This curation step is done ahead of time and remains fixed for a given time period and for business reasons must remain very flexible.
Please don't question the business purpose, UI or input validation. I'm just trying to implement the algorithm in the constraints I've been given. Please treat this as an academic question. I will endeavor to provide a rigorous problem statement, and feedback on all other aspects of the problem is very welcome.
So if we were sorting chars, our data would have a structure of
struct {
char value;
Integer minPosition;
Integer maxPosition;
}
Where minPosition and maxPosition may be null (unrestricted). If this were called on an algorithm where all positions restrictions were null, or all minPositions were 0 or less and all maxPositions were equal to or greater than the size of the list, then the output would just be chars in ascending order.
This algorithm would only reorder two elements if the minPosition and maxPosition of both elements would not be violated by their new positions. An insertion-based algorithm which promotes items to the top of the list and reorders the rest has obvious problems in that every later element would have to be revalidated after each iteration; in my head, that rules out such algorithms for having O(n3) complexity, but I won't rule out such algorithms without considering evidence to the contrary, if presented.
In the output list, certain elements will be out of order with regard to their value, if and only if the set of position constraints dictates it. These outputs are still valid.
A valid list is any list where all elements are in a position that does not conflict with their constraints.
An optimal list is a list which cannot be reordered to more closely match the natural order without violating one or more position constraint. An invalid list is never optimal. I don't have a strict definition I can spell out for 'more closely matching' between one ordering or another. However, I think it's fairly easy to let intuition guide you, or choose something similar to a distance metric.
Multiple optimal orderings may exist if multiple inputs have the same value. You could make an argument that the above paragraph is therefore incorrect, because either one can be reordered to the other without violating constraints and therefore neither can be optimal. However, any rigorous distance function would treat these lists as identical, with the same distance from the natural order and therefore reordering the identical elements is allowed (because it's a no-op).
I would call such outputs the correct, sorted order which respects the position constraints, but several commentators pointed out that we're not really returning a sorted list, so let's stick with 'optimal'.
For example, the following are a input lists (in the form of <char>(<minPosition>:<maxPosition>), where Z(1:1) indicates a Z that must be at the front of the list and M(-:-) indicates an M that may be in any position in the final list and the natural order (sorted by value only) is A...M...Z) and their optimal orders.
Input order
A(1:1) D(-:-) C(-:-) E(-:-) B(-:-)
Optimal order
A B C D E
This is a trivial example to show that the natural order prevails in a list with no constraints.
Input order
E(1:1) D(2:2) C(3:3) B(4:4) A(5:5)
Optimal order
E D C B A
This example is to show that a fully constrained list is output in the same order it is given. The input is already a valid and optimal list. The algorithm should still run in O(n log n) time for such inputs. (Our initial solution is able to short-circuit any fully constrained list to run in linear time; I added the example both to drive home the definitions of optimal and valid and because some swap-based algorithms I considered handled this as the worse case.)
Input order
E(1:1) C(-:-) B(1:5) A(4:4) D(2:3)
Optimal Order
E B D A C
E is constrained to 1:1, so it is first in the list even though it has the lowest value. A is similarly constrained to 4:4, so it is also out of natural order. B has essentially identical constraints to C and may appear anywhere in the final list, but B will be before C because of value. D may be in positions 2 or 3, so it appears after B because of natural ordering but before C because of its constraints.
Note that the final order is correct despite being wildly different from the natural order (which is still A,B,C,D,E). As explained in the previous paragraph, nothing in this list can be reordered without violating the constraints of one or more items.
Input order
B(-:-) C(2:2) A(-:-) A(-:-)
Optimal order
A(-:-) C(2:2) A(-:-) B(-:-)
C remains unmoved because it already in its only valid position. B is reordered to the end because its value is less than both A's. In reality, there will be additional fields that differentiate the two A's, but from the standpoint of the algorithm, they are identical and preserving OR reversing their input ordering is an optimal solution.
Input order
A(1:1) B(1:1) C(3:4) D(3:4) E(3:4)
Undefined output
This input is invalid for two reasons: 1) A and B are both constrained to position 1 and 2) C, D, and E are constrained to a range than can only hold 2 elements. In other words, the ranges 1:1 and 3:4 are over-constrained. However, the consistency and legality of the constraints are enforced by UI validation, so it's officially not the algorithms problem if they are incorrect, and the algorithm can return a best-effort ordering OR the original ordering in that case. Passing an input like this to the algorithm may be considered undefined behavior; anything can happen. So, for the rest of the question...
All input lists will have elements that are initially in valid positions.
The sorting algorithm itself can assume the constraints are valid and an optimal order exists.2
We've currently settled on a customized selection sort (with runtime complexity of O(n2)) and reasonably proved that it works for all inputs whose position restrictions are valid and consistent (e.g. not overbooked for a given position or range of positions).
Is there a sorting algorithm that is guaranteed to return the optimal final order and run in better than O(n2) time complexity?3
I feel that a library standard sorting algorithm could be modified to handle these constrains by providing a custom comparator that accepts the candidate destination position for each element. This would be equivalent to the current position of each element, so maybe modifying the value holding class to include the current position of the element and do the extra accounting in the comparison (.equals()) and swap methods would be sufficient.
However, the more I think about it, an algorithm that runs in O(n log n) time could not work correctly with these restrictions. Intuitively, such algorithms are based on running n comparisons log n times. The log n is achieved by leveraging a divide and conquer mechanism, which only compares certain candidates for certain positions.
In other words, input lists with valid position constraints (i.e. counterexamples) exist for any O(n log n) sorting algorithm where a candidate element would be compared with an element (or range in the case of Quicksort and variants) with/to which it could not be swapped, and therefore would never move to the correct final position. If that's too vague, I can come up with a counter example for mergesort and quicksort.
In contrast, an O(n2) sorting algorithm makes exhaustive comparisons and can always move an element to its correct final position.
To ask an actual question: Is my intuition correct when I reason that an O(n log n) sort is not guaranteed to find a valid order? If so, can you provide more concrete proof? If not, why not? Is there other existing research on this class of problem?
1: I've not been able to find a set of search terms that points me in the direction of any concrete classification of such sorting algorithm or constraints; that's why I'm asking some basic questions about the complexity. If there is a term for this type of problem, please post it up.
2: Validation is a separate problem, worthy of its own investigation and algorithm. I'm pretty sure that the existence of a valid order can be proven in linear time:
Allocate array of tuples of length equal to your list. Each tuple is an integer counter k and a double value v for the relative assignment weight.
Walk the list, adding the fractional value of each elements position constraint to the corresponding range and incrementing its counter by 1 (e.g. range 2:5 on a list of 10 adds 0.4 to each of 2,3,4, and 5 on our tuple list, incrementing the counter of each as well)
Walk the tuple list and
If no entry has value v greater than the sum of the series from 1 to k of 1/k, a valid order exists.
If there is such a tuple, the position it is in is over-constrained; throw an exception, log an error, use the doubles array to correct the problem elements etc.
Edit: This validation algorithm itself is actually O(n2). Worst case, every element has the constraints 1:n, you end up walking your list of n tuples n times. This is still irrelevant to the scope of the question, because in the real problem domain, the constraints are enforced once and don't change.
Determining that a given list is in valid order is even easier. Just check each elements current position against its constraints.
3: This is admittedly a little bit premature optimization. Our initial use for this is for fairly small lists, but we're eyeing expansion to longer lists, so if we can optimize now we'd get small performance gains now and large performance gains later. And besides, my curiosity is piqued and if there is research out there on this topic, I would like to see it and (hopefully) learn from it.
On the existence of a solution: You can view this as a bipartite digraph with one set of vertices (U) being the k values, and the other set (V) the k ranks (1 to k), and an arc from each vertex in U to its valid ranks in V. Then the existence of a solution is equivalent to the maximum matching being a bijection. One way to check for this is to add a source vertex with an arc to each vertex in U, and a sink vertex with an arc from each vertex in V. Assign each edge a capacity of 1, then find the max flow. If it's k then there's a solution, otherwise not.
http://en.wikipedia.org/wiki/Maximum_flow_problem
--edit-- O(k^3) solution: First sort to find the sorted rank of each vertex (1-k). Next, consider your values and ranks as 2 sets of k vertices, U and V, with weighted edges from each vertex in U to all of its legal ranks in V. The weight to assign each edge is the distance from the vertices rank in sorted order. E.g., if U is 10 to 20, then the natural rank of 10 is 1. An edge from value 10 to rank 1 would have a weight of zero, to rank 3 would have a weight of 2. Next, assume all missing edges exist and assign them infinite weight. Lastly, find the "MINIMUM WEIGHT PERFECT MATCHING" in O(k^3).
http://www-math.mit.edu/~goemans/18433S09/matching-notes.pdf
This does not take advantage of the fact that the legal ranks for each element in U are contiguous, which may help get the running time down to O(k^2).
Here is what a coworker and I have come up with. I think it's an O(n2) solution that returns a valid, optimal order if one exists, and a closest-possible effort if the initial ranges were over-constrained. I just tweaked a few things about the implementation and we're still writing tests, so there's a chance it doesn't work as advertised. This over-constrained condition is detected fairly easily when it occurs.
To start, things are simplified if you normalize your inputs to have all non-null constraints. In linear time, that is:
for each item in input
if an item doesn't have a minimum position, set it to 1
if an item doesn't have a maximum position, set it to the length of your list
The next goal is to construct a list of ranges, each containing all of the candidate elements that have that range and ordered by the remaining capacity of the range, ascending so ranges with the fewest remaining spots are on first, then by start position of the range, then by end position of the range. This can be done by creating a set of such ranges, then sorting them in O(n log n) time with a simple comparator.
For the rest of this answer, a range will be a simple object like so
class Range<T> implements Collection<T> {
int startPosition;
int endPosition;
Collection<T> items;
public int remainingCapacity() {
return endPosition - startPosition + 1 - items.size();
}
// implement Collection<T> methods, passing through to the items collection
public void add(T item) {
// Validity checking here exposes some simple cases of over-constraining
// We'll catch these cases with the tricky stuff later anyways, so don't choke
items.add(item);
}
}
If an element A has range 1:5, construct a range(1,5) object and add A to its elements. This range has remaining capacity of 5 - 1 + 1 - 1 (max - min + 1 - size) = 4. If an element B has range 1:5, add it to your existing range, which now has capacity 3.
Then it's a relatively simple matter of picking the best element that fits each position 1 => k in turn. Iterate your ranges in their sorted order, keeping track of the best eligible element, with the twist that you stop looking if you've reached a range that has a remaining size that can't fit into its remaining positions. This is equivalent to the simple calculation range.max - current position + 1 > range.size (which can probably be simplified, but I think it's most understandable in this form). Remove each element from its range as it is selected. Remove each range from your list as it is emptied (optional; iterating an empty range will yield no candidates. That's a poor explanation, so lets do one of our examples from the question. Note that C(-:-) has been updated to the sanitized C(1:5) as described in above.
Input order
E(1:1) C(1:5) B(1:5) A(4:4) D(2:3)
Built ranges (min:max) <remaining capacity> [elements]
(1:1)0[E] (4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Find best for 1
Consider (1:1), best element from its list is E
Consider further ranges?
range.max - current position + 1 > range.size ?
range.max = 1; current position = 1; range.size = 1;
1 - 1 + 1 > 1 = false; do not consider subsequent ranges
Remove E from range, add to output list
Find best for 2; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Consider (4:4); skip it because it is not eligible for position 2
Consider (2:3); best element is D
Consider further ranges?
3 - 2 + 1 > 1 = true; check next range
Consider (2:5); best element is B
End of range list; remove B from range, add to output list
An added simplifying factor is that the capacities do not need to be updated or the ranges reordered. An item is only removed if the rest of the higher-sorted ranges would not be disturbed by doing so. The remaining capacity is never checked after the initial sort.
Find best for 3; output is now E, B; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C]
Consider (4:4); skip it because it is not eligible for position 3
Consider (2:3); best element is D
Consider further ranges?
same as previous check, but current position is now 3
3 - 3 + 1 > 1 = false; don't check next range
Remove D from range, add to output list
Find best for 4; output is now E, B, D; current range list is:
(4:4)0[A] (1:5)3[C]
Consider (4:4); best element is A
Consider further ranges?
4 - 4 + 1 > 1 = false; don't check next range
Remove A from range, add to output list
Output is now E, B, D, A and there is one element left to be checked, so it gets appended to the end. This is the output list we desired to have.
This build process is the longest part. At its core, it's a straightforward n2 selection sorting algorithm. The range constraints only work to shorten the inner loop and there is no loopback or recursion; but the worst case (I think) is still sumi = 0 n(n - i), which is n2/2 - n/2.
The detection step comes into play by not excluding a candidate range if the current position is beyond the end of that ranges max position. You have to track the range your best candidate came from in order to remove it, so when you do the removal, just check if the position you're extracting the candidate for is greater than that ranges endPosition.
I have several other counter-examples that foiled my earlier algorithms, including a nice example that shows several over-constraint detections on the same input list and also how the final output is closest to the optimal as the constraints will allow. In the mean time, please post any optimizations you can see and especially any counter examples where this algorithm makes an objectively incorrect choice (i.e. arrives at an invalid or suboptimal output when one exists).
I'm not going to accept this answer, because I specifically asked if it could be done in better than O(n2). I haven't wrapped my head around the constraints satisfaction approach in #DaveGalvin's answer yet and I've never done a maximum flow problem, but I thought this might be helpful for others to look at.
Also, I discovered the best way to come up with valid test data is to start with a valid list and randomize it: for 0 -> i, create a random value and constraints such that min < i < max. (Again, posting it because it took me longer than it should have to come up with and others might find it helpful.)
Not likely*. I assume you mean average run time of O(n log n) in-place, non-stable, off-line. Most Sorting algorithms that improve on bubble sort average run time of O(n^2) like tim sort rely on the assumption that comparing 2 elements in a sub set will produce the same result in the super set. A slower variant of Quicksort would be a good approach for your range constraints. The worst case won't change but the average case will likely decrease and the algorithm will have the extra constraint of a valid sort existing.
Is ... O(n log n) sort is not guaranteed to find a valid order?
All popular sort algorithms I am aware of are guaranteed to find an order so long as there constraints are met. Formal analysis (concrete proof) is on each sort algorithems wikepedia page.
Is there other existing research on this class of problem?
Yes; there are many journals like IJCSEA with sorting research.
*but that depends on your average data set.

Best data structure to store lots one bit data

I want to store lots of data so that
they can be accessed by an index,
each data is just yes and no (so probably one bit is enough for each)
I am looking for the data structure which has the highest performance and occupy least space.
probably storing data in a flat memory, one bit per data is not a good choice on the other hand using different type of tree structures still use lots of memory (e.g. pointers in each node are required to make these tree even though each node has just one bit of data).
Does anyone have any Idea?
What's wrong with using a single block of memory and either storing 1 bit per byte (easy indexing, but wastes 7 bits per byte) or packing the data (slightly trickier indexing, but more memory efficient) ?
Well in Java the BitSet might be a good choice http://download.oracle.com/javase/6/docs/api/java/util/BitSet.html
If I understand your question correctly you should store them in an unsigned integer where you assign each value to a bit of the integer (flag).
Say you represent 3 values and they can be on or off. Then you assign the first to 1, the second to 2 and the third to 4. Your unsigned int can then be 0,1,2,3,4,5,6 or 7 depending on which values are on or off and you check the values using bitwise comparison.
Depends on the language and how you define 'index'. If you mean that the index operator must work, then your language will need to be able to overload the index operator. If you don't mind using an index macro or function, you can access the nth element by dividing the given index by the number of bits in your type (say 8 for char, 32 for uint32_t and variants), then return the result of arr[n / n_bits] & (1 << (n % n_bits))
Have a look at a Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
It performs very well and is space-efficient. But make sure you read the fine print below ;-): Quote from the above wiki page.
An empty Bloom filter is a bit array
of m bits, all set to 0. There must
also be k different hash functions
defined, each of which maps or hashes
some set element to one of the m array
positions with a uniform random
distribution. To add an element, feed
it to each of the k hash functions to
get k array positions. Set the bits at
all these positions to 1. To query for
an element (test whether it is in the
set), feed it to each of the k hash
functions to get k array positions. If
any of the bits at these positions are
0, the element is not in the set – if
it were, then all the bits would have
been set to 1 when it was inserted. If
all are 1, then either the element is
in the set, or the bits have been set
to 1 during the insertion of other
elements. The requirement of designing
k different independent hash functions
can be prohibitive for large k. For a
good hash function with a wide output,
there should be little if any
correlation between different
bit-fields of such a hash, so this
type of hash can be used to generate
multiple "different" hash functions by
slicing its output into multiple bit
fields. Alternatively, one can pass k
different initial values (such as 0,
1, ..., k − 1) to a hash function that
takes an initial value; or add (or
append) these values to the key. For
larger m and/or k, independence among
the hash functions can be relaxed with
negligible increase in false positive
rate (Dillinger & Manolios (2004a),
Kirsch & Mitzenmacher (2006)).
Specifically, Dillinger & Manolios
(2004b) show the effectiveness of
using enhanced double hashing or
triple hashing, variants of double
hashing, to derive the k indices using
simple arithmetic on two or three
indices computed with independent hash
functions. Removing an element from
this simple Bloom filter is
impossible. The element maps to k
bits, and although setting any one of
these k bits to zero suffices to
remove it, this has the side effect of
removing any other elements that map
onto that bit, and we have no way of
determining whether any such elements
have been added. Such removal would
introduce a possibility for false
negatives, which are not allowed.
One-time removal of an element from a
Bloom filter can be simulated by
having a second Bloom filter that
contains items that have been removed.
However, false positives in the second
filter become false negatives in the
composite filter, which are not
permitted. In this approach re-adding
a previously removed item is not
possible, as one would have to remove
it from the "removed" filter. However,
it is often the case that all the keys
are available but are expensive to
enumerate (for example, requiring many
disk reads). When the false positive
rate gets too high, the filter can be
regenerated; this should be a
relatively rare event.

Algorithm for non-contiguous netmask match

I have to write a really really fast algorithm to match an IP address to a list of groups, where each group is defined using a notation like 192.168.0.0/252.255.0.255. As you can see, the bitmask can contain zeros even in the middle, so the traditional "longest prefix match" algorithms won't work. If an IP matches two groups, it will be assigned to the group containing most 1's in the netmask.
I'm not working with many entries (let's say < 1000) and I don't want to use a data structure requiring a large memory footprint (let's say > 1-2 MB), but it really has to be fast (of course I can't afford a linear search).
Do you have any suggestion? Thanks guys.
UPDATE: I found something quite interesting at http://www.cse.usf.edu/~ligatti/papers/grouper-conf.pdf, but it's still too memory-hungry for my utopic use case
If you know how many IP addresses you'll be dealing with initially, I'd say use a Hash Map structure. For the keys of this map, convert the IP into an integer-type structure. Hash Maps, assuming a good hash function (with no collision), will give you O(1) insertion and O(1) lookup.
If you don't know how many IPs you'll have, look into using a Fibonacci Heap (which I think has the best time complexity out of all tree structures for insert/delete/lookup).
Another type of structure you could use is a Radix Sort.
Do you have any specific requirements on how long the algorithm must take? "Really, really, really fast" is kinda vague.
You build a binary tree that checks the bits individually. You order the bit-checks in a form that gives you the "bushiest tree". You have a post order traversal, so that it checks full depth before exiting, thus returning the longest hit.
pseudocode
nodeCheck(bitVector, index){// bitvector is ordering of IP address bits for bushy tree
if myVal=-2 (return -1); //mismatched bit encountered No point continuing.
lVal,Rval=-1;
if (Left !=NULL && bitvector[index]==0) lVal=Left.nodeCheck(bitvector, index+1);
if (Right !=NULL && bitvector[index]==1) rVal=Right.nodeCheck(bitvector, index+1);
if (lVal>rVal) return lVal; // higher numbers have >= number of 1's in netmask.
if (rVal >-1) return rVal;
return myVal; //the group that getting this far would place you in, -1 if none.
}
Sure for speed you want to skip the OO factor, but the concept is the same..
The logic is a bit wonky, but the idea is sound.
But given that you have the radixTree down I didn't want to bog too deep into it.
the post order traversal simply lets you grab the longest matching without getting too weird.
The simple answer is order your bits to fit your tree. Make your tree as bushy (actually short) as possible.
more thorough answer:
Since these have the same length their order shouldn't matter
Lets Call
0.0.0.0/255.255.0.255 A
0.0.0.0/255.255.255.0 B
incoming 0.0.111.0
octet 1 2 3 4 just so we have right ordering
And I'm going to do them by octets because I'm lazy.
To make the bushiest tree you need to check octet 3 or 4 as your first test 3 being the lower will take arbitrary precedence.
So this looks at the value, and checks the right hand branch. The Right hand branch is another node,it checks octet one, and moves down the left hand branch, to the next node which checks octet 2, this checks the left (octet 4) and gets -1 (via NULL), the right and gets -1(via NULL), so it returns A (we'll call it an enumerated type).
So the octet ordering becomes 3 1 2 4.
Generally you want to order the bit checks so that early levels are doing some kind of check. In this case we push the 4 to the end because if the three hits(was a zero) the check on octet 4 is a waste and doesn't need to be done. But the 1 and 2 need to be done no matter the outcome of the first check.
on a larger problem there will be some nodes that have no check, sending them to identical left and right branches regardless of the value of the bit contained.
A poorly built valid tree could take an ordering of 3 4 1 2 so if the first check passes(0 instead of 111), the second check is a waste because we already belong in group B, no matter the value of octet 4.
Good luck.

Efficient way to handle adding and removing items by bitwise And

So, suppose you have a collection of items. Each item has an identifier which can be represented using a bitfield. As a simple example, suppose your collection is:
0110, 0111, 1001, 1011, 1110, 1111
So, you then want to implement a function, Remove(bool bitval, int position). For example, a call to Remove(0, 2) would remove all items where index 2(i.e. 3rd bit) was 0. In this case, that would be 1001, only. Remove(1,1) would remove 1110, 1111, 0111, and 0110. It is trivial to come up with an O(n) collection where this is possible (just use a linked list), with n being the number of items in the collection. In general the number of items to be removed is going to be O(n) (assuming a given bit has a ≥ c% chance of being 1 and a ≥ c% chance of being 0, where c is some constant > 0), so "better" algorithms which somehow are O(l), with l being the number of items being removed, are unexciting.
Is it possible to define a data structure where the average (or better yet, worst case) removal time is better than O(n)? A binary tree can do pretty well (just remove all left/right branches at the height m, where m is the index being tested), but I'm wondering if there is any way to do better (and quite honestly, I'm not sure how to removing all left or right branches at a particular height in an efficient manner). Alternatively, is there a proof that doing better is not possible?
Edit: I'm not sure exactly what I'm expecting in terms of efficiency (sorry Arno), but a basic explanation of it's possible application is thus: Suppose we are working with a binary decision tree. Such a tree could be used for a game tree or a puzzle solver or whatever. Further suppose the tree is small enough that we can fit all of the leaf nodes into memory. Each such node is basically just a bitfield listing all of the decisions. Now, if we want to prune arbitrary decisions from this tree, one method would be to just jump to the height where a particular decision is made and prune the left or right side of every node (left meaning one decision, right meaning the other). Normally in a decision tree you only want to prune subtree at a time (since the parent of that subtree is different from the parent of other subtrees and thus the decision which should be pruned in one subtree should not be pruned from others), but in some types of situations this may not be the case. Further, you normally only want to prune everything below a particular node, but in this case you'll be leaving some stuff below the node but also pruning below other nodes in the tree.
Anyhow, this is somewhat of a question based on curiousity; I'm not sure it's practical to use any results, but am interested in what people have to say.
Edit:
Thinking about it further, I think the tree method is actually O(n / logn), assuming it's reasonably dense. Proof:
Suppose you have a binary tree with n items. It's height is log(n). Removing half the bottom will require n/2 removals. Removing the half the row above will require n/4. The sum of operations for each row is n-1. So the average number of removals is n-1 / log(n).
Provided the length of your bitfields is limited, the following may work:
First, represent the bitfields that are in the set as an array of booleans, so in your case (4 bit bitfields), new bool[16];
Transform this array of booleans into a bitfield itself, so a 16-bit bitfield in this case, where each bit represents whether the bitfield corresponding to its index is included
Then operations become:
Remove(0, 0) = and with bitmask 1010101010101010
Remove(1, 0) = and with bitmask 0101010101010101
Remove(0, 2) = and with bitmask 1111000011110000
Note that more complicated 'add/remove' operations could then also be added as O(1) bit-logic.
The only down-side is that extra work is needed to interpret the resulting 16-bit bitfield back into a set of values, but with lookup arrays that might not turn out too bad either.
Addendum:
Additional down-sides:
Once the size of an integer is exceeded, every added bit to the original bit-fields will double the storage space. However, this is not much worse than a typical scenario using another collection where you have to store on average half the possible bitmask values (provided the typical scenario doesn't store far less remaining values).
Once the size of an integer is exceeded, every added bit also doubles the number of 'and' operations needed to implement the logic.
So basically, I'd say if your original bitfields are not much larger than a byte, you are likely better off with this encoding, beyond that you're probably better off with the original strategy.
Further addendum:
If you only ever execute Remove operations, which over time thins out the set state-space further and further, you may be able to stretch this approach a bit further (no pun intended) by making a more clever abstraction that somehow only keeps track of the int values that are non-zero. Detecting zero values may not be as expensive as it sounds either if the JIT knows what it's doing, because a CPU 'and' operation typically sets the 'zero' flag if the result is zero.
As with all performance optimizations, this one'd need some measurement to determine if it is worthwile.
If each decision bit and position are listed as objects, {bit value, k-th position}, you would end up with an array of length 2*k. If you link to each of these array positions from your item, represented as a linked list (which are of length k), using a pointer to the {bit, position} object as the node value, you can "invalidate" a bunch of items by simply deleting the {bit, position} object. This would require you, upon searching the list of items, to find "complete" items (it makes search REALLY slow?).
So something like:
[{0,0}, {1,0}, {0,1}, {1, 1}, {0,2}, {1, 2}, {0,3}, {1,3}]
and linked from "0100", represented as: {0->3->4->6}
You wouldn't know which items were invalid until you tried to find them (so it doesn't really limit your search space, which is what you're after).
Oh well, I tried.
Sure, it is possible (even if this is "cheating"). Just keep a stack of Remove objects:
struct Remove {
bool set;
int index;
}
The remove function just pushes an object on the stack. Viola, O(1).
If you wanted to get fancy, your stack couldn't exceed (number of bits) without containing duplicate or impossible scenarios.
The rest of the collection has to apply the logic whenever things are withdrawn or iterated over.
Two ways to do insert into the collection:
Apply the Remove rules upon insert, to clear out the stack, making in O(n). Gotta pay somewhere.
Each bitfield has to store it's index in the remove stack, to know what rules apply to it. Then, the stack size limit above wouldn't matter
If you use an array to store your binary tree, you can quickly index any element (the children of the node at index n are at index (n+1)*2 and (n+1)*2-1. All the nodes at a given level are stored sequentially. The first node at at level x is 2^x-1 and there are 2^x elements at that level.
Unfortunately, I don't think this really gets you much of anywhere from a complexity standpoint. Removing all the left nodes at a level is O(n/2) worst case, which is of course O(n). Of course the actual work depends on which bit you are checking, so the average may be somewhat better. This also requires O(2^n) memory which is much worse than the linked list and not practical at all.
I think what this problem is really asking is for a way to efficiently partition a set of sets into two sets. Using a bitset to describe the set gives you a fast check for membership, but doesn't seem to lend itself to making the problem any easier.

Resources