Random variable-length encoded numbers with uniform distribution - algorithm

Suppose I have data presented with variable-length encoding when I can retrieve the data parsing some virtual b-tree and stopping when I reach the item (similar to Huffman encoding). There is unknown number of items (in the best case only the upper limit is known). Is there an algorithm to generate uniformly distributed numbers? The problem is that a coin-based algorithm will give non-uniform result in this case, for example if there's a number encoded as 101 and there's a number encoded 10010101, the latter will appear very rarely comparing to the former.
UPDATE: In other words, I have a set of maximum N elements (but maybe fewer) when every element can be addressed with arbitrary number of bits (and with accordance with informational theory, so if one is encoded 101 then no other element can be encoded with the same prefix). So it's more like B-Tree when I go left or right depending on a bit and at some moment I get to the data item. I want to get a sequence of random numbers addressed with this technique, but the distribution of them should be uniform (the example why choosing randomly left-right won't work is above, the numbers 101 and 10010101)
Thanks
Max

I can think of three basic methods, one of which involves frequent reguessing and one of which involves keeping extra information. I think that doing one or the other of these things is unavoidable. I'm going to begin with the extra information one:
In each node, store a number count which represents the number of descendants it has. For every node, you'll need to have a number between 1 and count for that node to tell you whether to go left or right by comparing it to the left child's count. Here's the algorithm:
n := random integer between 1 and root.count
node := route
while node.count != 1
if n <= node.left.count
node = node.left
else
node = node.right
n = n - node.left.count
So, essentially, we're imposing a left-to-right ordering on all nodes and selecting the nth one from the left. This is fairly quick, only having a O(depth of tree), which is likely the best we can do without doing something like also building a vector which contains all the node labels. This also adds an overhead of O(depth of tree) to any changes to the tree since counts must be corrected. If you're going the other way and never changing the tree at all but going to be selecting random nodes a lot, just bit the bullet and put all of the node labels in a vector. That way you can select a random one in O(1) after O(N) initial set-up time.
If, however, you don't want to use up any storage space, here's an alternative with a lot of reguessing. First find a bound (which I'll label B) for the depth of the tree (we can use N-1 if needed, but obviously, that's a very loose bound. The tighter the bound which can be found, the faster the algorithm runs). Next we're going to generate a possible node label in a random, but even way. There are 2^(B+1)-1 possibilities. It's not just 2^B because, for example, the string "0011" and "11" are completely different strings. As a result, we need to count all possible binary strings of length between 0 and B. Obviously, we have 2^i strings of length i. So for strings of length i or less, we have sum(i=0 to B){2^i} = 2^(B+1)-1. So, we can just chose a number between 0 and 2^(B+1)-2 and then find the corresponding node label. Of course, the mapping from numbers to node labels isn't trivial, so I'll provide it here.
We convert the number we have chosen into a string of bits in the ordinary way. Then, reading from the left, if the first digit is a 0, then the node label is the remaining string to the right (possibly the empty string, which is a valid node label although not likely to be in use). If the first digit is a 1, then we throw it away and repeat this process. Thus, if B=4, then the node label "0001" would come from the number "00001". The node label "001" would come from the number "10001". The node label "01" would come from the number "11001". The node label "1" would come from the number "11101". And the node label "" would come from the number "11110". We did not include the number 2^(B+1)-1 ("11111" in this case) which has no valid interpretation under this scheme. I'll leave it as an exercise to the reader to prove to themselves that every string from length 0 to B can be represented under this scheme. Rather than trying to prove it, I'll just assert that it will work.
So now we have a node label. The next step is to see if that label exists by traversing the tree. If it does, we're done. If it doesn't, then choose a new number and start over (that's the reguessing part). It's likely to have to reguess a lot, since only a small fraction of legal node labels will be in use, but this won't skew the fairness, just increase the time.
Here's a pseudo-code version of this process in four functions:
function num_to_binary_list(n,digits) =
if digits == 0 return ()
if n mod 2 == 0 return 0 :: num_to_digits(n/2,digits-1)
else return 1 :: num_to_digits((n-1)/2,digits-1)
function binary_list_to_node_label_list(l) =
if l.head() == 0 return l.tail()
else return binary_list_to_node_label_list(l.tail())
function check_node_label_list_against_tree(str,node) =
if node == null return false,null
if str.isEmpty()
if node.isLeaf() return true,node
else return false,null
if str.head() == 0 return check_node_label_list_against_tree(str.tail(),node.left)
else check_node_label_list_against_tree(str.tail,node.right)
function generate_random_node tree b =
found := false
while (not found)
x := random(0,2**(b+1)-2) // We're assuming that this random selects inclusively
node_label := binary_list_to_node_label(num_to_binary_list(x,b+1))
found,node := check_node_label_list_against_tree(node_label,tree)
return node
The timing analysis for this, of course, is pretty horrendous. Basically, the while loop will run an average of (2^(B+1)-1)/N times. So, in the worst case, it's O((2^N)/N) which is terrible. In the best case, B would be on the order of log(N), so it would be roughly O(1), but that requires that the tree be fairly balanced which it may not be. Still, if you really want no extra space, this method does that.
I don't really think that you can do better than this last method without storing some information. It sounds appealing to be able to traverse the tree, making random decisions as you go, but without storing additional information about the structure, you're just not going to be able to do that. Every time you make a branching decision, you could have just one node on the left side and a million nodes on the right side or it could have a million nodes on the left side and just one on the right side. Because those are both possible and you don't know which is the case, there's simply no way to make an even random decision between the two sides. Obviously 50-50 doesn't work and any other choice is going to be similarly problematic.
So, if you don't want extra space, the second method will work, but be slow. If you don't mind adding some extra space, the first method will work and be fast. And, as I said earlier, if you're not going to be changing the tree and you'll be selecting a lot of random nodes, then bite the bullet and just traverse the tree and stick all leaf nodes in a self-growing array or vector and then pick from that.

Related

Are there sorting algorithms that respect final position restrictions and run in O(n log n) time?

I'm looking for a sorting algorithm that honors a min and max range for each element1. The problem domain is a recommendations engine that combines a set of business rules (the restrictions) with a recommendation score (the value). If we have a recommendation we want to promote (e.g. a special product or deal) or an announcement we want to appear near the top of the list (e.g. "This is super important, remember to verify your email address to participate in an upcoming promotion!") or near the bottom of the list (e.g. "If you liked these recommendations, click here for more..."), they will be curated with certain position restriction in place. For example, this should always be the top position, these should be in the top 10, or middle 5 etc. This curation step is done ahead of time and remains fixed for a given time period and for business reasons must remain very flexible.
Please don't question the business purpose, UI or input validation. I'm just trying to implement the algorithm in the constraints I've been given. Please treat this as an academic question. I will endeavor to provide a rigorous problem statement, and feedback on all other aspects of the problem is very welcome.
So if we were sorting chars, our data would have a structure of
struct {
char value;
Integer minPosition;
Integer maxPosition;
}
Where minPosition and maxPosition may be null (unrestricted). If this were called on an algorithm where all positions restrictions were null, or all minPositions were 0 or less and all maxPositions were equal to or greater than the size of the list, then the output would just be chars in ascending order.
This algorithm would only reorder two elements if the minPosition and maxPosition of both elements would not be violated by their new positions. An insertion-based algorithm which promotes items to the top of the list and reorders the rest has obvious problems in that every later element would have to be revalidated after each iteration; in my head, that rules out such algorithms for having O(n3) complexity, but I won't rule out such algorithms without considering evidence to the contrary, if presented.
In the output list, certain elements will be out of order with regard to their value, if and only if the set of position constraints dictates it. These outputs are still valid.
A valid list is any list where all elements are in a position that does not conflict with their constraints.
An optimal list is a list which cannot be reordered to more closely match the natural order without violating one or more position constraint. An invalid list is never optimal. I don't have a strict definition I can spell out for 'more closely matching' between one ordering or another. However, I think it's fairly easy to let intuition guide you, or choose something similar to a distance metric.
Multiple optimal orderings may exist if multiple inputs have the same value. You could make an argument that the above paragraph is therefore incorrect, because either one can be reordered to the other without violating constraints and therefore neither can be optimal. However, any rigorous distance function would treat these lists as identical, with the same distance from the natural order and therefore reordering the identical elements is allowed (because it's a no-op).
I would call such outputs the correct, sorted order which respects the position constraints, but several commentators pointed out that we're not really returning a sorted list, so let's stick with 'optimal'.
For example, the following are a input lists (in the form of <char>(<minPosition>:<maxPosition>), where Z(1:1) indicates a Z that must be at the front of the list and M(-:-) indicates an M that may be in any position in the final list and the natural order (sorted by value only) is A...M...Z) and their optimal orders.
Input order
A(1:1) D(-:-) C(-:-) E(-:-) B(-:-)
Optimal order
A B C D E
This is a trivial example to show that the natural order prevails in a list with no constraints.
Input order
E(1:1) D(2:2) C(3:3) B(4:4) A(5:5)
Optimal order
E D C B A
This example is to show that a fully constrained list is output in the same order it is given. The input is already a valid and optimal list. The algorithm should still run in O(n log n) time for such inputs. (Our initial solution is able to short-circuit any fully constrained list to run in linear time; I added the example both to drive home the definitions of optimal and valid and because some swap-based algorithms I considered handled this as the worse case.)
Input order
E(1:1) C(-:-) B(1:5) A(4:4) D(2:3)
Optimal Order
E B D A C
E is constrained to 1:1, so it is first in the list even though it has the lowest value. A is similarly constrained to 4:4, so it is also out of natural order. B has essentially identical constraints to C and may appear anywhere in the final list, but B will be before C because of value. D may be in positions 2 or 3, so it appears after B because of natural ordering but before C because of its constraints.
Note that the final order is correct despite being wildly different from the natural order (which is still A,B,C,D,E). As explained in the previous paragraph, nothing in this list can be reordered without violating the constraints of one or more items.
Input order
B(-:-) C(2:2) A(-:-) A(-:-)
Optimal order
A(-:-) C(2:2) A(-:-) B(-:-)
C remains unmoved because it already in its only valid position. B is reordered to the end because its value is less than both A's. In reality, there will be additional fields that differentiate the two A's, but from the standpoint of the algorithm, they are identical and preserving OR reversing their input ordering is an optimal solution.
Input order
A(1:1) B(1:1) C(3:4) D(3:4) E(3:4)
Undefined output
This input is invalid for two reasons: 1) A and B are both constrained to position 1 and 2) C, D, and E are constrained to a range than can only hold 2 elements. In other words, the ranges 1:1 and 3:4 are over-constrained. However, the consistency and legality of the constraints are enforced by UI validation, so it's officially not the algorithms problem if they are incorrect, and the algorithm can return a best-effort ordering OR the original ordering in that case. Passing an input like this to the algorithm may be considered undefined behavior; anything can happen. So, for the rest of the question...
All input lists will have elements that are initially in valid positions.
The sorting algorithm itself can assume the constraints are valid and an optimal order exists.2
We've currently settled on a customized selection sort (with runtime complexity of O(n2)) and reasonably proved that it works for all inputs whose position restrictions are valid and consistent (e.g. not overbooked for a given position or range of positions).
Is there a sorting algorithm that is guaranteed to return the optimal final order and run in better than O(n2) time complexity?3
I feel that a library standard sorting algorithm could be modified to handle these constrains by providing a custom comparator that accepts the candidate destination position for each element. This would be equivalent to the current position of each element, so maybe modifying the value holding class to include the current position of the element and do the extra accounting in the comparison (.equals()) and swap methods would be sufficient.
However, the more I think about it, an algorithm that runs in O(n log n) time could not work correctly with these restrictions. Intuitively, such algorithms are based on running n comparisons log n times. The log n is achieved by leveraging a divide and conquer mechanism, which only compares certain candidates for certain positions.
In other words, input lists with valid position constraints (i.e. counterexamples) exist for any O(n log n) sorting algorithm where a candidate element would be compared with an element (or range in the case of Quicksort and variants) with/to which it could not be swapped, and therefore would never move to the correct final position. If that's too vague, I can come up with a counter example for mergesort and quicksort.
In contrast, an O(n2) sorting algorithm makes exhaustive comparisons and can always move an element to its correct final position.
To ask an actual question: Is my intuition correct when I reason that an O(n log n) sort is not guaranteed to find a valid order? If so, can you provide more concrete proof? If not, why not? Is there other existing research on this class of problem?
1: I've not been able to find a set of search terms that points me in the direction of any concrete classification of such sorting algorithm or constraints; that's why I'm asking some basic questions about the complexity. If there is a term for this type of problem, please post it up.
2: Validation is a separate problem, worthy of its own investigation and algorithm. I'm pretty sure that the existence of a valid order can be proven in linear time:
Allocate array of tuples of length equal to your list. Each tuple is an integer counter k and a double value v for the relative assignment weight.
Walk the list, adding the fractional value of each elements position constraint to the corresponding range and incrementing its counter by 1 (e.g. range 2:5 on a list of 10 adds 0.4 to each of 2,3,4, and 5 on our tuple list, incrementing the counter of each as well)
Walk the tuple list and
If no entry has value v greater than the sum of the series from 1 to k of 1/k, a valid order exists.
If there is such a tuple, the position it is in is over-constrained; throw an exception, log an error, use the doubles array to correct the problem elements etc.
Edit: This validation algorithm itself is actually O(n2). Worst case, every element has the constraints 1:n, you end up walking your list of n tuples n times. This is still irrelevant to the scope of the question, because in the real problem domain, the constraints are enforced once and don't change.
Determining that a given list is in valid order is even easier. Just check each elements current position against its constraints.
3: This is admittedly a little bit premature optimization. Our initial use for this is for fairly small lists, but we're eyeing expansion to longer lists, so if we can optimize now we'd get small performance gains now and large performance gains later. And besides, my curiosity is piqued and if there is research out there on this topic, I would like to see it and (hopefully) learn from it.
On the existence of a solution: You can view this as a bipartite digraph with one set of vertices (U) being the k values, and the other set (V) the k ranks (1 to k), and an arc from each vertex in U to its valid ranks in V. Then the existence of a solution is equivalent to the maximum matching being a bijection. One way to check for this is to add a source vertex with an arc to each vertex in U, and a sink vertex with an arc from each vertex in V. Assign each edge a capacity of 1, then find the max flow. If it's k then there's a solution, otherwise not.
http://en.wikipedia.org/wiki/Maximum_flow_problem
--edit-- O(k^3) solution: First sort to find the sorted rank of each vertex (1-k). Next, consider your values and ranks as 2 sets of k vertices, U and V, with weighted edges from each vertex in U to all of its legal ranks in V. The weight to assign each edge is the distance from the vertices rank in sorted order. E.g., if U is 10 to 20, then the natural rank of 10 is 1. An edge from value 10 to rank 1 would have a weight of zero, to rank 3 would have a weight of 2. Next, assume all missing edges exist and assign them infinite weight. Lastly, find the "MINIMUM WEIGHT PERFECT MATCHING" in O(k^3).
http://www-math.mit.edu/~goemans/18433S09/matching-notes.pdf
This does not take advantage of the fact that the legal ranks for each element in U are contiguous, which may help get the running time down to O(k^2).
Here is what a coworker and I have come up with. I think it's an O(n2) solution that returns a valid, optimal order if one exists, and a closest-possible effort if the initial ranges were over-constrained. I just tweaked a few things about the implementation and we're still writing tests, so there's a chance it doesn't work as advertised. This over-constrained condition is detected fairly easily when it occurs.
To start, things are simplified if you normalize your inputs to have all non-null constraints. In linear time, that is:
for each item in input
if an item doesn't have a minimum position, set it to 1
if an item doesn't have a maximum position, set it to the length of your list
The next goal is to construct a list of ranges, each containing all of the candidate elements that have that range and ordered by the remaining capacity of the range, ascending so ranges with the fewest remaining spots are on first, then by start position of the range, then by end position of the range. This can be done by creating a set of such ranges, then sorting them in O(n log n) time with a simple comparator.
For the rest of this answer, a range will be a simple object like so
class Range<T> implements Collection<T> {
int startPosition;
int endPosition;
Collection<T> items;
public int remainingCapacity() {
return endPosition - startPosition + 1 - items.size();
}
// implement Collection<T> methods, passing through to the items collection
public void add(T item) {
// Validity checking here exposes some simple cases of over-constraining
// We'll catch these cases with the tricky stuff later anyways, so don't choke
items.add(item);
}
}
If an element A has range 1:5, construct a range(1,5) object and add A to its elements. This range has remaining capacity of 5 - 1 + 1 - 1 (max - min + 1 - size) = 4. If an element B has range 1:5, add it to your existing range, which now has capacity 3.
Then it's a relatively simple matter of picking the best element that fits each position 1 => k in turn. Iterate your ranges in their sorted order, keeping track of the best eligible element, with the twist that you stop looking if you've reached a range that has a remaining size that can't fit into its remaining positions. This is equivalent to the simple calculation range.max - current position + 1 > range.size (which can probably be simplified, but I think it's most understandable in this form). Remove each element from its range as it is selected. Remove each range from your list as it is emptied (optional; iterating an empty range will yield no candidates. That's a poor explanation, so lets do one of our examples from the question. Note that C(-:-) has been updated to the sanitized C(1:5) as described in above.
Input order
E(1:1) C(1:5) B(1:5) A(4:4) D(2:3)
Built ranges (min:max) <remaining capacity> [elements]
(1:1)0[E] (4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Find best for 1
Consider (1:1), best element from its list is E
Consider further ranges?
range.max - current position + 1 > range.size ?
range.max = 1; current position = 1; range.size = 1;
1 - 1 + 1 > 1 = false; do not consider subsequent ranges
Remove E from range, add to output list
Find best for 2; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Consider (4:4); skip it because it is not eligible for position 2
Consider (2:3); best element is D
Consider further ranges?
3 - 2 + 1 > 1 = true; check next range
Consider (2:5); best element is B
End of range list; remove B from range, add to output list
An added simplifying factor is that the capacities do not need to be updated or the ranges reordered. An item is only removed if the rest of the higher-sorted ranges would not be disturbed by doing so. The remaining capacity is never checked after the initial sort.
Find best for 3; output is now E, B; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C]
Consider (4:4); skip it because it is not eligible for position 3
Consider (2:3); best element is D
Consider further ranges?
same as previous check, but current position is now 3
3 - 3 + 1 > 1 = false; don't check next range
Remove D from range, add to output list
Find best for 4; output is now E, B, D; current range list is:
(4:4)0[A] (1:5)3[C]
Consider (4:4); best element is A
Consider further ranges?
4 - 4 + 1 > 1 = false; don't check next range
Remove A from range, add to output list
Output is now E, B, D, A and there is one element left to be checked, so it gets appended to the end. This is the output list we desired to have.
This build process is the longest part. At its core, it's a straightforward n2 selection sorting algorithm. The range constraints only work to shorten the inner loop and there is no loopback or recursion; but the worst case (I think) is still sumi = 0 n(n - i), which is n2/2 - n/2.
The detection step comes into play by not excluding a candidate range if the current position is beyond the end of that ranges max position. You have to track the range your best candidate came from in order to remove it, so when you do the removal, just check if the position you're extracting the candidate for is greater than that ranges endPosition.
I have several other counter-examples that foiled my earlier algorithms, including a nice example that shows several over-constraint detections on the same input list and also how the final output is closest to the optimal as the constraints will allow. In the mean time, please post any optimizations you can see and especially any counter examples where this algorithm makes an objectively incorrect choice (i.e. arrives at an invalid or suboptimal output when one exists).
I'm not going to accept this answer, because I specifically asked if it could be done in better than O(n2). I haven't wrapped my head around the constraints satisfaction approach in #DaveGalvin's answer yet and I've never done a maximum flow problem, but I thought this might be helpful for others to look at.
Also, I discovered the best way to come up with valid test data is to start with a valid list and randomize it: for 0 -> i, create a random value and constraints such that min < i < max. (Again, posting it because it took me longer than it should have to come up with and others might find it helpful.)
Not likely*. I assume you mean average run time of O(n log n) in-place, non-stable, off-line. Most Sorting algorithms that improve on bubble sort average run time of O(n^2) like tim sort rely on the assumption that comparing 2 elements in a sub set will produce the same result in the super set. A slower variant of Quicksort would be a good approach for your range constraints. The worst case won't change but the average case will likely decrease and the algorithm will have the extra constraint of a valid sort existing.
Is ... O(n log n) sort is not guaranteed to find a valid order?
All popular sort algorithms I am aware of are guaranteed to find an order so long as there constraints are met. Formal analysis (concrete proof) is on each sort algorithems wikepedia page.
Is there other existing research on this class of problem?
Yes; there are many journals like IJCSEA with sorting research.
*but that depends on your average data set.

How to perform minimum splits to satisfy special set ordering?

I'm trying to create an algorithm to solve the following problem:
Input is an unsorted list of sets containing pairs (key, value) of ints. The first of each pair is positive and unique within the set.
I want to find an algorithm to split the input sets so the sets can be ordered such that for each key the value is nondecreasing in the set order.
There is a trival solution which is to split the sets into each individual value and sort them, I'd like something more efficient in terms of the number of sets which are split.
Are there any similar problems you have encountered and/or techniques you can suggest?
Does the optimal (minimum number of splits) solution sound like it is possible in polynomial time?
Edit: In the example the "<=" operator indicates a constraint on the sets as a whole whereby for each key value (100, 101, 102) the corresponding values are equal to or greater than the values in previous sets (or omitted from the set). I.e extracting the values for each key using the order from the output sets gives:
Key 100 {0, 1}
Key 101 {2, 3}
Key 102 {10, 15}
A*
I propose using A* to find an optimal solution. Build the order of split sets incrementally from left to right, minimizing the number of sets required to achieve this.
A* visits states based on some heuristic estimate of the total cost. I propose that a state is described by the totality of all the pairs already included in the order as we have it so far. If all values for every key are different, then you can represent this information rather concisely by simply storing the last value for each key. Otherwise you'll have to somehow take care of equal values, so you know which ones were already included and which ones were not. For every state you maintain some representation of the best order leading to it, but that may get updated along the way while the state remains the same.
The heuristic should be an estimate of the total cost of the path from the beginning through the current state to the goal. It may be too low, but must never be too high. In our case, the heuristic should count the number of (possibly split) sets included in the order so far, and add to that the number of (unsplit) sets still waiting for insertion. As the remaining sets may need splitting, this might be too low, but as you can never have less sets than those still waiting for insertion, it is a suitable heuristic.
Now you have some priority queue of states, ordered by the value of this heuristic. You extract minimal items from it, and know that the moment you extract a state from the queue, the cost up to that state can not decrease any more, so the path up to that state is optimal. Now you examine what other states can be reached from this: which other pairs can be next in the order of split sets? For each remaining set which has pairs that are ready to be included, you create a new subsequent state, taking all the pairs from the set which are ready. The cost so far increases by one. If you manage to take a whole set, without splitting, then the extimate for the remaining cost decreases by one.
For this new state, you check whether it is already persent in your priority queue. If it is, and its previous cost was higher than the one just computed, then you update its cost, and the optimal path leading to it. Make sure the priority key changes its position accordingly (“decrease key”). If the state wasn't present in the queue before, then add it to the queue.
Dijkstra
Come to think of it, this is the same as running Dijkstra's algorithm with the number of splits as cost. And as each edge has either cost zero or cost one, you can implement this even easier, without any priority queue at all. Instead, you can use two sets, called S₀ and S₁, where all elements from S₀ require the same number of splits, and all elements from S₁ require one more split. Roughly sketched in pseudocode:
S₀ = ∅ (empty set)
S₁ = ∅
add initial state (no pairs added yet, all sets remain to be added) to S₀
while True
while (S₀ ≠ ∅)
x = take and remove any element from zero
if x is the target state (all pairs included in the order) then
return the path information associated with it
for (r: those sets which remain to be added in state x)
if we can take r as a whole then
let y be the state obtained by taking r as the next set in the order
if y is in S₁, remove it
add y to S₀
else if we can add only some elements from r then
let y bet the state obtained by taking as many elements from r as possible
if y is not in S₀, add it to S₁
S₀ = S₁
S₁ = ∅

Algorithm for non-contiguous netmask match

I have to write a really really fast algorithm to match an IP address to a list of groups, where each group is defined using a notation like 192.168.0.0/252.255.0.255. As you can see, the bitmask can contain zeros even in the middle, so the traditional "longest prefix match" algorithms won't work. If an IP matches two groups, it will be assigned to the group containing most 1's in the netmask.
I'm not working with many entries (let's say < 1000) and I don't want to use a data structure requiring a large memory footprint (let's say > 1-2 MB), but it really has to be fast (of course I can't afford a linear search).
Do you have any suggestion? Thanks guys.
UPDATE: I found something quite interesting at http://www.cse.usf.edu/~ligatti/papers/grouper-conf.pdf, but it's still too memory-hungry for my utopic use case
If you know how many IP addresses you'll be dealing with initially, I'd say use a Hash Map structure. For the keys of this map, convert the IP into an integer-type structure. Hash Maps, assuming a good hash function (with no collision), will give you O(1) insertion and O(1) lookup.
If you don't know how many IPs you'll have, look into using a Fibonacci Heap (which I think has the best time complexity out of all tree structures for insert/delete/lookup).
Another type of structure you could use is a Radix Sort.
Do you have any specific requirements on how long the algorithm must take? "Really, really, really fast" is kinda vague.
You build a binary tree that checks the bits individually. You order the bit-checks in a form that gives you the "bushiest tree". You have a post order traversal, so that it checks full depth before exiting, thus returning the longest hit.
pseudocode
nodeCheck(bitVector, index){// bitvector is ordering of IP address bits for bushy tree
if myVal=-2 (return -1); //mismatched bit encountered No point continuing.
lVal,Rval=-1;
if (Left !=NULL && bitvector[index]==0) lVal=Left.nodeCheck(bitvector, index+1);
if (Right !=NULL && bitvector[index]==1) rVal=Right.nodeCheck(bitvector, index+1);
if (lVal>rVal) return lVal; // higher numbers have >= number of 1's in netmask.
if (rVal >-1) return rVal;
return myVal; //the group that getting this far would place you in, -1 if none.
}
Sure for speed you want to skip the OO factor, but the concept is the same..
The logic is a bit wonky, but the idea is sound.
But given that you have the radixTree down I didn't want to bog too deep into it.
the post order traversal simply lets you grab the longest matching without getting too weird.
The simple answer is order your bits to fit your tree. Make your tree as bushy (actually short) as possible.
more thorough answer:
Since these have the same length their order shouldn't matter
Lets Call
0.0.0.0/255.255.0.255 A
0.0.0.0/255.255.255.0 B
incoming 0.0.111.0
octet 1 2 3 4 just so we have right ordering
And I'm going to do them by octets because I'm lazy.
To make the bushiest tree you need to check octet 3 or 4 as your first test 3 being the lower will take arbitrary precedence.
So this looks at the value, and checks the right hand branch. The Right hand branch is another node,it checks octet one, and moves down the left hand branch, to the next node which checks octet 2, this checks the left (octet 4) and gets -1 (via NULL), the right and gets -1(via NULL), so it returns A (we'll call it an enumerated type).
So the octet ordering becomes 3 1 2 4.
Generally you want to order the bit checks so that early levels are doing some kind of check. In this case we push the 4 to the end because if the three hits(was a zero) the check on octet 4 is a waste and doesn't need to be done. But the 1 and 2 need to be done no matter the outcome of the first check.
on a larger problem there will be some nodes that have no check, sending them to identical left and right branches regardless of the value of the bit contained.
A poorly built valid tree could take an ordering of 3 4 1 2 so if the first check passes(0 instead of 111), the second check is a waste because we already belong in group B, no matter the value of octet 4.
Good luck.

Efficient way to handle adding and removing items by bitwise And

So, suppose you have a collection of items. Each item has an identifier which can be represented using a bitfield. As a simple example, suppose your collection is:
0110, 0111, 1001, 1011, 1110, 1111
So, you then want to implement a function, Remove(bool bitval, int position). For example, a call to Remove(0, 2) would remove all items where index 2(i.e. 3rd bit) was 0. In this case, that would be 1001, only. Remove(1,1) would remove 1110, 1111, 0111, and 0110. It is trivial to come up with an O(n) collection where this is possible (just use a linked list), with n being the number of items in the collection. In general the number of items to be removed is going to be O(n) (assuming a given bit has a ≥ c% chance of being 1 and a ≥ c% chance of being 0, where c is some constant > 0), so "better" algorithms which somehow are O(l), with l being the number of items being removed, are unexciting.
Is it possible to define a data structure where the average (or better yet, worst case) removal time is better than O(n)? A binary tree can do pretty well (just remove all left/right branches at the height m, where m is the index being tested), but I'm wondering if there is any way to do better (and quite honestly, I'm not sure how to removing all left or right branches at a particular height in an efficient manner). Alternatively, is there a proof that doing better is not possible?
Edit: I'm not sure exactly what I'm expecting in terms of efficiency (sorry Arno), but a basic explanation of it's possible application is thus: Suppose we are working with a binary decision tree. Such a tree could be used for a game tree or a puzzle solver or whatever. Further suppose the tree is small enough that we can fit all of the leaf nodes into memory. Each such node is basically just a bitfield listing all of the decisions. Now, if we want to prune arbitrary decisions from this tree, one method would be to just jump to the height where a particular decision is made and prune the left or right side of every node (left meaning one decision, right meaning the other). Normally in a decision tree you only want to prune subtree at a time (since the parent of that subtree is different from the parent of other subtrees and thus the decision which should be pruned in one subtree should not be pruned from others), but in some types of situations this may not be the case. Further, you normally only want to prune everything below a particular node, but in this case you'll be leaving some stuff below the node but also pruning below other nodes in the tree.
Anyhow, this is somewhat of a question based on curiousity; I'm not sure it's practical to use any results, but am interested in what people have to say.
Edit:
Thinking about it further, I think the tree method is actually O(n / logn), assuming it's reasonably dense. Proof:
Suppose you have a binary tree with n items. It's height is log(n). Removing half the bottom will require n/2 removals. Removing the half the row above will require n/4. The sum of operations for each row is n-1. So the average number of removals is n-1 / log(n).
Provided the length of your bitfields is limited, the following may work:
First, represent the bitfields that are in the set as an array of booleans, so in your case (4 bit bitfields), new bool[16];
Transform this array of booleans into a bitfield itself, so a 16-bit bitfield in this case, where each bit represents whether the bitfield corresponding to its index is included
Then operations become:
Remove(0, 0) = and with bitmask 1010101010101010
Remove(1, 0) = and with bitmask 0101010101010101
Remove(0, 2) = and with bitmask 1111000011110000
Note that more complicated 'add/remove' operations could then also be added as O(1) bit-logic.
The only down-side is that extra work is needed to interpret the resulting 16-bit bitfield back into a set of values, but with lookup arrays that might not turn out too bad either.
Addendum:
Additional down-sides:
Once the size of an integer is exceeded, every added bit to the original bit-fields will double the storage space. However, this is not much worse than a typical scenario using another collection where you have to store on average half the possible bitmask values (provided the typical scenario doesn't store far less remaining values).
Once the size of an integer is exceeded, every added bit also doubles the number of 'and' operations needed to implement the logic.
So basically, I'd say if your original bitfields are not much larger than a byte, you are likely better off with this encoding, beyond that you're probably better off with the original strategy.
Further addendum:
If you only ever execute Remove operations, which over time thins out the set state-space further and further, you may be able to stretch this approach a bit further (no pun intended) by making a more clever abstraction that somehow only keeps track of the int values that are non-zero. Detecting zero values may not be as expensive as it sounds either if the JIT knows what it's doing, because a CPU 'and' operation typically sets the 'zero' flag if the result is zero.
As with all performance optimizations, this one'd need some measurement to determine if it is worthwile.
If each decision bit and position are listed as objects, {bit value, k-th position}, you would end up with an array of length 2*k. If you link to each of these array positions from your item, represented as a linked list (which are of length k), using a pointer to the {bit, position} object as the node value, you can "invalidate" a bunch of items by simply deleting the {bit, position} object. This would require you, upon searching the list of items, to find "complete" items (it makes search REALLY slow?).
So something like:
[{0,0}, {1,0}, {0,1}, {1, 1}, {0,2}, {1, 2}, {0,3}, {1,3}]
and linked from "0100", represented as: {0->3->4->6}
You wouldn't know which items were invalid until you tried to find them (so it doesn't really limit your search space, which is what you're after).
Oh well, I tried.
Sure, it is possible (even if this is "cheating"). Just keep a stack of Remove objects:
struct Remove {
bool set;
int index;
}
The remove function just pushes an object on the stack. Viola, O(1).
If you wanted to get fancy, your stack couldn't exceed (number of bits) without containing duplicate or impossible scenarios.
The rest of the collection has to apply the logic whenever things are withdrawn or iterated over.
Two ways to do insert into the collection:
Apply the Remove rules upon insert, to clear out the stack, making in O(n). Gotta pay somewhere.
Each bitfield has to store it's index in the remove stack, to know what rules apply to it. Then, the stack size limit above wouldn't matter
If you use an array to store your binary tree, you can quickly index any element (the children of the node at index n are at index (n+1)*2 and (n+1)*2-1. All the nodes at a given level are stored sequentially. The first node at at level x is 2^x-1 and there are 2^x elements at that level.
Unfortunately, I don't think this really gets you much of anywhere from a complexity standpoint. Removing all the left nodes at a level is O(n/2) worst case, which is of course O(n). Of course the actual work depends on which bit you are checking, so the average may be somewhat better. This also requires O(2^n) memory which is much worse than the linked list and not practical at all.
I think what this problem is really asking is for a way to efficiently partition a set of sets into two sets. Using a bitset to describe the set gives you a fast check for membership, but doesn't seem to lend itself to making the problem any easier.

Generating random fixed length permutations of a string

Lets say my alphabet contains X letters and my language supports only Y letter words (Y < X ofcourse). I need to generate all the words possible in random order.
E.g.
Alphabet=a,b,c,d,e,f,g
Y=3
So the words would be:
aaa
aab
aac
aba
..
bbb
ccc
..
(the above should be generated in random order)
The trivial way to do it would be to generate the words and then randomize the list. I DONT want to do that. I want to generate the words in random order.
rondom(n)=letter[x].random(n-1) will not work because then you'll have a list of words starting with letter[x].. which will make the list not so random.
Any code/pseudocode appreciated.
As other answers have implied, there's two main approaches: 1) track what you've already generated (the proposed solutions in this category suffer from possibly never terminating), or 2) track what permutations have yet to be produced (which implies that the permutations must be pre-generated which was specifically disallowed in the requirements). Here's another solution that is guaranteed to terminate and does not require pre-generation, but may not meet your randomization requirements (which are vague at this point).
General overview: generate a tree to track what's been generated or what's remaining. "select" new permutations by traversing random links in the tree, pruning the tree at the leafs after generation of that permutation to prevent it from being generated again.
Without a whiteboard to diagram this, I hope this description is good enough to describe what I mean: Create a "node" that has links to other nodes for every letter in the alphabet. This could be implemented using a generic map of alphabet letters to nodes or if your alphabet is fixed, you could create specific references. The node represents the available letters in the alphabet that can be "produced" next for generating a permutation. Start generating permutations by visiting the root node, selecting a random letter from the available letters in that node, then traversing that reference to the next node. With each traversal, a letter is produced for the permutation. When a leaf is reached (i.e. a permutation is fully constructed), you'd backtrack up the tree to see if the parent nodes have any available permutations remaining; if not, the parent node can be pruned.
As an implementation detail, the node could store the set of letters that are not available to be produced at that point or the set of letters that are still available to be produced at that point. In order to possibly reduce storage requirements, you could also allow the node to store either with a flag indicating which it's doing so that when the node allows more than half the alphabet it stores the letters produced so far and switch to using the letters remaining when there's less than half the alphabet available.
Using such a tree structure limits what can be produced without having to pre-generate all combinations since you don't need to pre-construct the entire tree (it can be constructed as the permutations are generated) and you're guaranteed to complete because of the purging of the nodes (i.e. you're only traversing links to nodes when that's an allowed combination for an unproduced permutation).
I believe the randomization of the technique is a little odd, however, and I don't think each combination is equally likely to be generated at any given time, though I haven't really thought through this. It's also probably worth noting that even though the full tree isn't necessarily generated up front, the overhead involved will likely be enough such that you may be better off pre-generating all permutations.
I think you can do something pretty straightforward by generating a random array of characters based on the alphabet you have (in c#):
char[] alphabet = {'a', 'b', 'c', 'd'};
int wordLength = 3;
Random rand = new Random();
for (int i = 0; i < 5; i++)
{
char[] word = new char[wordLength];
for (int j = 0; j < wordLength; j++)
{
word[j] = alphabet[rand.Next(alphabet.Length)];
}
Console.WriteLine(new string(word));
}
Obviously this might generate duplicates but you could maybe store results in a hashmap or something to check for duplicates if you need to.
So I take it what you want is to produce a permutation of the set using as little memory as possible.
First off, it can't be done using no memory. For your first string, you want a function that could produce any of the strings with equal likelihood. Say that function is called nextString(). If you call nextString() again without changing anything in the state, of course it will once again be able to produce any of the strings.
So you need to store something. The question is, what do you need to store, and how much space will it take?
The strings can be seen as numbers 0 - X^Y. (aaa=0, aab=1,aac=2...aba=X...) So to store a single string as efficiently as possible, you'd need lg(X^Y) bits. Let's say X = 16 and Y=2. Then you'd need 1 byte of storage to uniquely specify a string.
Of course the most naive algorithm is to mark each string as it is produced, which takes X^Y bits, which in my example is 256 bits (32 bytes). This is what you've said you don't want to do. You can use a shuffle algorithm as discussed in this question: Creating a random ordered list from an ordered list (you won't need to store the strings as you produce them through the shuffle algorithm, but you still need to mark them).
Ok, now the question is, can we do better than that? How much do we need to store, total?
Well, on the first call, we don't need any storage. On the second call, we need to know which one was produced before. On the last call, we only need to know which one is the last one left. So the worst case is when we're halfway through. When we're halfway through, there have been 128 strings produced, and there are 128 to go. We need to know which are left to produce. Assuming the process is truly random, any split is possible. There are (256 choose 128) possibilities. In order to potentially be able to store any of these, we need lg(256 choose 128) bits, which according to google calculator is 251.67. So if you were really clever you could squeeze the information into 4 fewer bits than the naive algorithm. Probably not worth it.
If you just want it to look randomish with very little storage, see this question: Looking for an algorithm to spit out a sequence of numbers in a (pseudo) random order

Resources