Algorithm for non-contiguous netmask match - data-structures

I have to write a really really fast algorithm to match an IP address to a list of groups, where each group is defined using a notation like 192.168.0.0/252.255.0.255. As you can see, the bitmask can contain zeros even in the middle, so the traditional "longest prefix match" algorithms won't work. If an IP matches two groups, it will be assigned to the group containing most 1's in the netmask.
I'm not working with many entries (let's say < 1000) and I don't want to use a data structure requiring a large memory footprint (let's say > 1-2 MB), but it really has to be fast (of course I can't afford a linear search).
Do you have any suggestion? Thanks guys.
UPDATE: I found something quite interesting at http://www.cse.usf.edu/~ligatti/papers/grouper-conf.pdf, but it's still too memory-hungry for my utopic use case

If you know how many IP addresses you'll be dealing with initially, I'd say use a Hash Map structure. For the keys of this map, convert the IP into an integer-type structure. Hash Maps, assuming a good hash function (with no collision), will give you O(1) insertion and O(1) lookup.
If you don't know how many IPs you'll have, look into using a Fibonacci Heap (which I think has the best time complexity out of all tree structures for insert/delete/lookup).
Another type of structure you could use is a Radix Sort.
Do you have any specific requirements on how long the algorithm must take? "Really, really, really fast" is kinda vague.

You build a binary tree that checks the bits individually. You order the bit-checks in a form that gives you the "bushiest tree". You have a post order traversal, so that it checks full depth before exiting, thus returning the longest hit.
pseudocode
nodeCheck(bitVector, index){// bitvector is ordering of IP address bits for bushy tree
if myVal=-2 (return -1); //mismatched bit encountered No point continuing.
lVal,Rval=-1;
if (Left !=NULL && bitvector[index]==0) lVal=Left.nodeCheck(bitvector, index+1);
if (Right !=NULL && bitvector[index]==1) rVal=Right.nodeCheck(bitvector, index+1);
if (lVal>rVal) return lVal; // higher numbers have >= number of 1's in netmask.
if (rVal >-1) return rVal;
return myVal; //the group that getting this far would place you in, -1 if none.
}
Sure for speed you want to skip the OO factor, but the concept is the same..
The logic is a bit wonky, but the idea is sound.
But given that you have the radixTree down I didn't want to bog too deep into it.
the post order traversal simply lets you grab the longest matching without getting too weird.

The simple answer is order your bits to fit your tree. Make your tree as bushy (actually short) as possible.
more thorough answer:
Since these have the same length their order shouldn't matter
Lets Call
0.0.0.0/255.255.0.255 A
0.0.0.0/255.255.255.0 B
incoming 0.0.111.0
octet 1 2 3 4 just so we have right ordering
And I'm going to do them by octets because I'm lazy.
To make the bushiest tree you need to check octet 3 or 4 as your first test 3 being the lower will take arbitrary precedence.
So this looks at the value, and checks the right hand branch. The Right hand branch is another node,it checks octet one, and moves down the left hand branch, to the next node which checks octet 2, this checks the left (octet 4) and gets -1 (via NULL), the right and gets -1(via NULL), so it returns A (we'll call it an enumerated type).
So the octet ordering becomes 3 1 2 4.
Generally you want to order the bit checks so that early levels are doing some kind of check. In this case we push the 4 to the end because if the three hits(was a zero) the check on octet 4 is a waste and doesn't need to be done. But the 1 and 2 need to be done no matter the outcome of the first check.
on a larger problem there will be some nodes that have no check, sending them to identical left and right branches regardless of the value of the bit contained.
A poorly built valid tree could take an ordering of 3 4 1 2 so if the first check passes(0 instead of 111), the second check is a waste because we already belong in group B, no matter the value of octet 4.
Good luck.

Related

Keep track of the result of accumulation for operation without inverse

I have an operation A * A -> A, which is commutative and associative. This means the order I apply it in doesn't matter, as long as I use the same elements. Nice.
I have to apply it to a list of values. To be more precise, I have to use it as the operation to accumulate the values of the list. So far, so good.
I then have a series of requests to add an element to the list, or erase it from the list. After each insertion or deletion, I have to return the new accumulated value for the new list. Simple, right?
The problem is I don't have an inverse; that is no operation '/' able to remove b if I only know a * b and tell me the other operand must have been a. (in fact, there isn't even an identity element)
So, my only obvious option is to accumulate again at every deletion -in linear time.
Can I do better? I've thought a lot about it.
And the answer is, of course I can... if I really want: I need to implement a custom binary tree, maybe a red/black one to have good worst case guarantees. Have next to the value an additional cache storing the result of the whole subtree.
cache = value * left.cache * right.cache
Maintain this invariant after every operation; then the root cache is the result.
However, "implement a custom R/B tree while maintaining an additional invariant" isn't something I'm particularly comfortable at doing. Well I would do it, but not swear by its correctness. Plus, the constant before the log would probably be significant. It seems pretty unwieldy, to do a simple thing like keeping track of an accumulation.
Does anyone see a better solution?
For completeness: the operation is a union of filters. A filter is a couple (code, mask), and a value "passes the filter" if (C bitwise operators) (value ^ code) & mask == 0; that is, if its bit corresponding to bits set in mask are equal to the corresponding bits in code. The union therefore sets to 0 (ignored) the bits where masks or codes differ, and keeps the ones which are the same.
Bonus appreciation to anyone finding a way to exploit the specific properties of the operation to get a solution more efficient than it is possible for the general problem I abstracted! ;-)
For your specific problem you could keep track for each bit x:
The total number of times that bit x is set to 1 in a mask
The total number of times that bit x is set to 1 in a mask and bit x of code is equal to 0
The total number of times that bit x is set to 1 in a mask and bit x of code is equal to 1
With these 3 counts (for each bit) it is straightforward to compute the union of all the filters.
The complexity is O(R) (where R is the number of bits in mask) to add or remove a filter.

Algorithm for expressing reordering, as minimum number of object moves

This problem arises in synchronization of arrays (ordered sets) of objects.
Specifically, consider an array of items, synchronized to another computer. The user moves one or more objects, thus reordering the array, behind my back. When my program wakes up, I see the new order, and I know the old order. I must transmit the changes to the other computer, reproducing the new order there. Here's an example:
index 0 1 2
old order A B C
new order C A B
Define a move as moving a given object to a given new index. The problem is to express the reordering by transmitting a minimum number of moves across a communication link, such that the other end can infer the remaining moves by taking the unmoved objects in the old order and moving them into as-yet unused indexes in the new order, starting with the lowest index and going up. This method of transmission would be very efficient in cases where a small number of objects are moved within a large array, displacing a large number of objects.
Hang on. Let's continue the example. We have
CANDIDATE 1
Move A to index 1
Move B to index 2
Infer moving C to index 0 (the only place it can go)
Note that the first two moves are required to be transmitted. If we don't transmit Move B to index 2, B will be inferred to index 0, and we'll end up with B A C, which is wrong. We need to transmit two moves. Let's see if we can do better…
CANDIDATE 2
Move C to index 0
Infer moving A to index 1 (the first available index)
Infer moving B to index 2 (the next available index)
In this case, we get the correct answer, C A B, transmitting only one move, Move C to index 0. Candidate 2 is therefore better than Candidate 1. There are four more candidates, but since it's obvious that at least one move is needed to do anything, we can stop now and declare Candidate 2 to be the winner.
I think I can do this by brute forcibly trying all possible candidates, but for an array of N items there are N! (N factorial) possible candidates, and even if I am smart enough to truncate unnecessary searches as in the example, things might still get pretty costly in a typical array which may contain hundreds of objects.
The solution of just transmitting the whole order is not acceptable, because, for compatibility, I need to emulate the transmissions of another program.
If someone could just write down the answer that would be great, but advice to go read Chapter N of computer science textbook XXX would be quite acceptable. I don't know those books because, I'm, hey, only an electrical engineer.
Thanks!
Jerry Krinock
I think that the problem is reducible to Longest common subsequence problem, just find this common subsequence and transmit the moves that are not belonging to it. There is no prove of optimality, just my intuition, so I might be wrong. Even if I'm wrong, that may be a good starting point to some more fancy algorithm.
Information theory based approach
First, have a bit series such that 0 corresponds to 'regular order' and 11 corresponds to 'irregular entry'. Whenever there in irregular entry also add the original location of the entry that is next.
Eg. Assume original order of ABCDE for the following cases
ABDEC: 001 3 01 2
BCDEA: 1 1 0001 0
Now, if the probability of making a 'move' is p, this method requires roughly n + n*p*log(n) bits.
Note that if p is small the number of 0s is going to be high. You can further compress the result to:
n*(p*log(1/p) + (1-p)*log(1/(1-p))) + n*p*log(n) bits

Given a flat file of IP Ranges and mappings, find a city given an IP

This is the question:
Given a flat text file that contains a range of IP addresses that map
to a location (e.g.
192.168.0.0-192.168.0.255 = Boston, MA), come up with an algorithm that will find a city for a specific ip address if a mapping exists.
My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits) and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.
Any faults in my plans, or better solutions?
Your approach seems perfectly reasonable.
If you are interested in doing a bit of research / extra coding, there are algorithms that will asymptotically outperform the standard binary search technique that rely on the fact that your IP addresses can be interpreted as integers in the range from 0 to 231 - 1. For example, the van Emde Boas tree and y-Fast Trie data structures can implement the predecessor search operation that you're looking at in time O(log log U), where U is the maximum possible IP address, as opposed to the O(log N) approach that binary search uses. The constant factors are higher, though, which means that there is no guarantee that this approach will be any faster. However, it might be worth exploring as another approach that could potentially be even faster.
Hope this helps!
The problem smells of ranges, and one of the good data-structures for this problem would be a Segment Tree. Some resources to help you get started.
The root of the segment tree can represent the addresses (0.0.0.0 - 255.255.255.255). The left sub-tree would represent the addresses (0.0.0.0 - 127.255.255.255) and the right sub-tree would represent the range (128.0.0.0 - 255.255.255.255), and so on. This will go on till we reach ranges which cannot be further sub-divided. Say, if we have the range 32.0.0.0 - 63.255.255.255, mapped to some arbitrary city, it will be a leaf node, we will not further subdivide that range when we arrive there, and tag it to the specific city.
To search for a specific mapping, we follow the tree, just as we do in a Binary Search Tree. If your IP lies in the range of the left sub-tree, move to the left sub-tree, else move to the right sub-tree.
The good parts:
You need not have all sub-trees, only add the sub-trees which are required. For example, if in your data, there is no city mapped for the range (0.0.0.0 - 127.255.255.255), we will not construct that sub-tree.
We are space efficient. If the entire range is mapped to one city, we will create only the root node!
This is a dynamic data-structure. You can add more cities, split-up ranges later on, etc.
You will be making constant number of operations, since the maximum depth of the tree would be 4 x log2(256) = 32. For this particular problem it turns out that Segment Trees would be as fast as van-Emde Boas trees, and require lesser space (O(N)).
This is a simple, but non-trivial data-structure, which is better than sorting, because it is dynamic, and easier to explain to your interviewer than van-Emde Boas trees.
This is one of the easiest non-trivial data-structures to code :)
Please note that in some Segment Tree tutorials, they use arrays to represent the tree. This is probably not what you want, since we would not be populating the entire tree, so dynamically allocating nodes, just like we do in a standard Binary Tree is the best.
My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits)...
If following this approach, you would probably want to multiply by 256^3, 256^2, 256 and 1 respectively for A, B, C and D in an address A.B.C.D. That effectively recreates the IP address as a 32-bit unsigned number.
... and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.
I would suggest creating a contiguous array (a std::vector) containing structs with the lower and upper ranges (and location name - discussed below). Then as you say you can binary search for a range including a specific value, without any odd/even hassles.
Using the lower end of the range as a key in a hash is one way to avoid having space for the location names in the array, but given the average number of characters in a city name, the likely size of pointers, a choice between a sparsely populated hash table and lengthly displacement lists to search in successive alternative buckets or further indirection to arbitrary length containers - you'd need to be pretty desperate to bother trying. In the first instance, storing the location in struct alongside the IP value range seems good.
Alternatively, you could create a tree based on e.g. the individual 0-255 IP values: each level in the tree could be either an array of 256 values for direct indexing, or a sorted array of populated values. That can reduce the number of IP value comparisons you're likely to need to make (O(log2N) to O(1)).
In your example, 192.168.0.0-192.168.0.255 = Boston, MA.
Will the first three octets (192.168.0) be the same for both IP addresses in the entry?
Also, will the first three octets be unique for a city?
If so, then this problem can solved more easily

Random variable-length encoded numbers with uniform distribution

Suppose I have data presented with variable-length encoding when I can retrieve the data parsing some virtual b-tree and stopping when I reach the item (similar to Huffman encoding). There is unknown number of items (in the best case only the upper limit is known). Is there an algorithm to generate uniformly distributed numbers? The problem is that a coin-based algorithm will give non-uniform result in this case, for example if there's a number encoded as 101 and there's a number encoded 10010101, the latter will appear very rarely comparing to the former.
UPDATE: In other words, I have a set of maximum N elements (but maybe fewer) when every element can be addressed with arbitrary number of bits (and with accordance with informational theory, so if one is encoded 101 then no other element can be encoded with the same prefix). So it's more like B-Tree when I go left or right depending on a bit and at some moment I get to the data item. I want to get a sequence of random numbers addressed with this technique, but the distribution of them should be uniform (the example why choosing randomly left-right won't work is above, the numbers 101 and 10010101)
Thanks
Max
I can think of three basic methods, one of which involves frequent reguessing and one of which involves keeping extra information. I think that doing one or the other of these things is unavoidable. I'm going to begin with the extra information one:
In each node, store a number count which represents the number of descendants it has. For every node, you'll need to have a number between 1 and count for that node to tell you whether to go left or right by comparing it to the left child's count. Here's the algorithm:
n := random integer between 1 and root.count
node := route
while node.count != 1
if n <= node.left.count
node = node.left
else
node = node.right
n = n - node.left.count
So, essentially, we're imposing a left-to-right ordering on all nodes and selecting the nth one from the left. This is fairly quick, only having a O(depth of tree), which is likely the best we can do without doing something like also building a vector which contains all the node labels. This also adds an overhead of O(depth of tree) to any changes to the tree since counts must be corrected. If you're going the other way and never changing the tree at all but going to be selecting random nodes a lot, just bit the bullet and put all of the node labels in a vector. That way you can select a random one in O(1) after O(N) initial set-up time.
If, however, you don't want to use up any storage space, here's an alternative with a lot of reguessing. First find a bound (which I'll label B) for the depth of the tree (we can use N-1 if needed, but obviously, that's a very loose bound. The tighter the bound which can be found, the faster the algorithm runs). Next we're going to generate a possible node label in a random, but even way. There are 2^(B+1)-1 possibilities. It's not just 2^B because, for example, the string "0011" and "11" are completely different strings. As a result, we need to count all possible binary strings of length between 0 and B. Obviously, we have 2^i strings of length i. So for strings of length i or less, we have sum(i=0 to B){2^i} = 2^(B+1)-1. So, we can just chose a number between 0 and 2^(B+1)-2 and then find the corresponding node label. Of course, the mapping from numbers to node labels isn't trivial, so I'll provide it here.
We convert the number we have chosen into a string of bits in the ordinary way. Then, reading from the left, if the first digit is a 0, then the node label is the remaining string to the right (possibly the empty string, which is a valid node label although not likely to be in use). If the first digit is a 1, then we throw it away and repeat this process. Thus, if B=4, then the node label "0001" would come from the number "00001". The node label "001" would come from the number "10001". The node label "01" would come from the number "11001". The node label "1" would come from the number "11101". And the node label "" would come from the number "11110". We did not include the number 2^(B+1)-1 ("11111" in this case) which has no valid interpretation under this scheme. I'll leave it as an exercise to the reader to prove to themselves that every string from length 0 to B can be represented under this scheme. Rather than trying to prove it, I'll just assert that it will work.
So now we have a node label. The next step is to see if that label exists by traversing the tree. If it does, we're done. If it doesn't, then choose a new number and start over (that's the reguessing part). It's likely to have to reguess a lot, since only a small fraction of legal node labels will be in use, but this won't skew the fairness, just increase the time.
Here's a pseudo-code version of this process in four functions:
function num_to_binary_list(n,digits) =
if digits == 0 return ()
if n mod 2 == 0 return 0 :: num_to_digits(n/2,digits-1)
else return 1 :: num_to_digits((n-1)/2,digits-1)
function binary_list_to_node_label_list(l) =
if l.head() == 0 return l.tail()
else return binary_list_to_node_label_list(l.tail())
function check_node_label_list_against_tree(str,node) =
if node == null return false,null
if str.isEmpty()
if node.isLeaf() return true,node
else return false,null
if str.head() == 0 return check_node_label_list_against_tree(str.tail(),node.left)
else check_node_label_list_against_tree(str.tail,node.right)
function generate_random_node tree b =
found := false
while (not found)
x := random(0,2**(b+1)-2) // We're assuming that this random selects inclusively
node_label := binary_list_to_node_label(num_to_binary_list(x,b+1))
found,node := check_node_label_list_against_tree(node_label,tree)
return node
The timing analysis for this, of course, is pretty horrendous. Basically, the while loop will run an average of (2^(B+1)-1)/N times. So, in the worst case, it's O((2^N)/N) which is terrible. In the best case, B would be on the order of log(N), so it would be roughly O(1), but that requires that the tree be fairly balanced which it may not be. Still, if you really want no extra space, this method does that.
I don't really think that you can do better than this last method without storing some information. It sounds appealing to be able to traverse the tree, making random decisions as you go, but without storing additional information about the structure, you're just not going to be able to do that. Every time you make a branching decision, you could have just one node on the left side and a million nodes on the right side or it could have a million nodes on the left side and just one on the right side. Because those are both possible and you don't know which is the case, there's simply no way to make an even random decision between the two sides. Obviously 50-50 doesn't work and any other choice is going to be similarly problematic.
So, if you don't want extra space, the second method will work, but be slow. If you don't mind adding some extra space, the first method will work and be fast. And, as I said earlier, if you're not going to be changing the tree and you'll be selecting a lot of random nodes, then bite the bullet and just traverse the tree and stick all leaf nodes in a self-growing array or vector and then pick from that.

Efficient way to handle adding and removing items by bitwise And

So, suppose you have a collection of items. Each item has an identifier which can be represented using a bitfield. As a simple example, suppose your collection is:
0110, 0111, 1001, 1011, 1110, 1111
So, you then want to implement a function, Remove(bool bitval, int position). For example, a call to Remove(0, 2) would remove all items where index 2(i.e. 3rd bit) was 0. In this case, that would be 1001, only. Remove(1,1) would remove 1110, 1111, 0111, and 0110. It is trivial to come up with an O(n) collection where this is possible (just use a linked list), with n being the number of items in the collection. In general the number of items to be removed is going to be O(n) (assuming a given bit has a ≥ c% chance of being 1 and a ≥ c% chance of being 0, where c is some constant > 0), so "better" algorithms which somehow are O(l), with l being the number of items being removed, are unexciting.
Is it possible to define a data structure where the average (or better yet, worst case) removal time is better than O(n)? A binary tree can do pretty well (just remove all left/right branches at the height m, where m is the index being tested), but I'm wondering if there is any way to do better (and quite honestly, I'm not sure how to removing all left or right branches at a particular height in an efficient manner). Alternatively, is there a proof that doing better is not possible?
Edit: I'm not sure exactly what I'm expecting in terms of efficiency (sorry Arno), but a basic explanation of it's possible application is thus: Suppose we are working with a binary decision tree. Such a tree could be used for a game tree or a puzzle solver or whatever. Further suppose the tree is small enough that we can fit all of the leaf nodes into memory. Each such node is basically just a bitfield listing all of the decisions. Now, if we want to prune arbitrary decisions from this tree, one method would be to just jump to the height where a particular decision is made and prune the left or right side of every node (left meaning one decision, right meaning the other). Normally in a decision tree you only want to prune subtree at a time (since the parent of that subtree is different from the parent of other subtrees and thus the decision which should be pruned in one subtree should not be pruned from others), but in some types of situations this may not be the case. Further, you normally only want to prune everything below a particular node, but in this case you'll be leaving some stuff below the node but also pruning below other nodes in the tree.
Anyhow, this is somewhat of a question based on curiousity; I'm not sure it's practical to use any results, but am interested in what people have to say.
Edit:
Thinking about it further, I think the tree method is actually O(n / logn), assuming it's reasonably dense. Proof:
Suppose you have a binary tree with n items. It's height is log(n). Removing half the bottom will require n/2 removals. Removing the half the row above will require n/4. The sum of operations for each row is n-1. So the average number of removals is n-1 / log(n).
Provided the length of your bitfields is limited, the following may work:
First, represent the bitfields that are in the set as an array of booleans, so in your case (4 bit bitfields), new bool[16];
Transform this array of booleans into a bitfield itself, so a 16-bit bitfield in this case, where each bit represents whether the bitfield corresponding to its index is included
Then operations become:
Remove(0, 0) = and with bitmask 1010101010101010
Remove(1, 0) = and with bitmask 0101010101010101
Remove(0, 2) = and with bitmask 1111000011110000
Note that more complicated 'add/remove' operations could then also be added as O(1) bit-logic.
The only down-side is that extra work is needed to interpret the resulting 16-bit bitfield back into a set of values, but with lookup arrays that might not turn out too bad either.
Addendum:
Additional down-sides:
Once the size of an integer is exceeded, every added bit to the original bit-fields will double the storage space. However, this is not much worse than a typical scenario using another collection where you have to store on average half the possible bitmask values (provided the typical scenario doesn't store far less remaining values).
Once the size of an integer is exceeded, every added bit also doubles the number of 'and' operations needed to implement the logic.
So basically, I'd say if your original bitfields are not much larger than a byte, you are likely better off with this encoding, beyond that you're probably better off with the original strategy.
Further addendum:
If you only ever execute Remove operations, which over time thins out the set state-space further and further, you may be able to stretch this approach a bit further (no pun intended) by making a more clever abstraction that somehow only keeps track of the int values that are non-zero. Detecting zero values may not be as expensive as it sounds either if the JIT knows what it's doing, because a CPU 'and' operation typically sets the 'zero' flag if the result is zero.
As with all performance optimizations, this one'd need some measurement to determine if it is worthwile.
If each decision bit and position are listed as objects, {bit value, k-th position}, you would end up with an array of length 2*k. If you link to each of these array positions from your item, represented as a linked list (which are of length k), using a pointer to the {bit, position} object as the node value, you can "invalidate" a bunch of items by simply deleting the {bit, position} object. This would require you, upon searching the list of items, to find "complete" items (it makes search REALLY slow?).
So something like:
[{0,0}, {1,0}, {0,1}, {1, 1}, {0,2}, {1, 2}, {0,3}, {1,3}]
and linked from "0100", represented as: {0->3->4->6}
You wouldn't know which items were invalid until you tried to find them (so it doesn't really limit your search space, which is what you're after).
Oh well, I tried.
Sure, it is possible (even if this is "cheating"). Just keep a stack of Remove objects:
struct Remove {
bool set;
int index;
}
The remove function just pushes an object on the stack. Viola, O(1).
If you wanted to get fancy, your stack couldn't exceed (number of bits) without containing duplicate or impossible scenarios.
The rest of the collection has to apply the logic whenever things are withdrawn or iterated over.
Two ways to do insert into the collection:
Apply the Remove rules upon insert, to clear out the stack, making in O(n). Gotta pay somewhere.
Each bitfield has to store it's index in the remove stack, to know what rules apply to it. Then, the stack size limit above wouldn't matter
If you use an array to store your binary tree, you can quickly index any element (the children of the node at index n are at index (n+1)*2 and (n+1)*2-1. All the nodes at a given level are stored sequentially. The first node at at level x is 2^x-1 and there are 2^x elements at that level.
Unfortunately, I don't think this really gets you much of anywhere from a complexity standpoint. Removing all the left nodes at a level is O(n/2) worst case, which is of course O(n). Of course the actual work depends on which bit you are checking, so the average may be somewhat better. This also requires O(2^n) memory which is much worse than the linked list and not practical at all.
I think what this problem is really asking is for a way to efficiently partition a set of sets into two sets. Using a bitset to describe the set gives you a fast check for membership, but doesn't seem to lend itself to making the problem any easier.

Resources