IP lookup tables and data structure? - data-structures

One technique to reduce the size of IP lookup tables is to remove redundancy. When a prefix P is
longer than a prefix P’ (i.e. their first P’ bits are the same) and they both have the same next hop, which
prefix can be removed from the table? How would you implement such compression process assuming
the original table was implemented in a Trie

This is called summary routing. When an IP is located within another route entry and uses the same gateway you can simply forget that IP or network entry.
Depending on how exactly your data tree is organized, you might have to check each entry against all others. With an appropriately sorted tree you'd only have to check one branch. Or when it's really cleverly sorted you just check the next lower entry.

Related

Can two items in a hashmap be in different locations but have the same hashcode?

Can two items in a hashmap be in different locations but have the same hashcode?
I'm new to hashing, and I've recently learned about hashmaps. I was wondering whether two objects with the same hashcode can possibly go to different locations in a hashmap?
I'm not completely sure and would appreciate any help
As #Dai pointed out in the comments, this will depend on what kind of hash table you're using. (Turns out, there's a bunch of different ways to make a hash table, and no one data structure is "the" way that hash tables work!)
One of more common hash tables uses a strategy called closed addressing. In closed addressing, every item is mapped to a slot based on its hash code and stored with all other items that also end up in that slot. Lookups are then done by finding which bucket to look in, then inspecting all the items in that bucket. In that case, any two items with the same hash code will end up in the same bucket. (They can't literally occupy the same spot within that bucket, though.)
Another strategy for building hash tables uses an approach called open addressing. This is a family of different methods that are all based on the following idea. We require that each slot in the table store at most one element. As before, to do an insertion, we use the element's hash code to figure out which slot to put it in. If the slot is empty, great! We put the element there. If that slot is full, we can't put the item there. Instead, using some predictable strategy, we start looking at other slots until we find a free one, then put the item there. (The simplest way of doing this, linear probing, works by trying the next slot after the desired slot, then the next one, etc., wrapping around if need be.) In this system, since we can't store multiple items in the same spot, no, two elements with the same hash code don't have to (and in fact, can't!) occupy the same spot.
A more recent hashing strategy that's becoming more popular is cuckoo hashing. In cuckoo hashing, we maintain some small number of separate hash tables (typically, two or three), where each slot can only hold one item. To insert an element, we try placing it in the first table at a spot determined by its hash code. If that spot is free, great! We put the item there. If not, we kick out the item there and try putting that item in the next table. This process repeats until eventually everything comes to rest or we get caught in a loop. Like open addressing, this system prevents multiple items from being stored in the same slot, so two elements with the same hash code might go to different places. (There are variations on cuckoo hashing in which each table slot can store a fixed but small number of items, in which case you could have two items with the same hash code in the same spot. But it's not guaranteed.)
There are some other hashing schemes I didn't describe here. FKS perfect hashing works by using two layers of hash tables, along the lines of closed addressing, but periodically rebuilds the whole table to ensure that no one bucket is too overloaded. Extendible hashing uses a trie-like structure to grow overflowing buckets once they become too fully. Hopscotch hashing is a hybrid between linear probing and chained hashing and plays well with concurrency. But hopefully this gives you a sense of how the type of hash table you use influences the answer to your question!

How to implement dynamic indexes?

I know, Maybe the title is a little confusing. however, my actual question is basic I think.
I'm working on a brand new LRU implementation for that I use an Index Table which maps the name of the incoming packet to index of where the content of packet stored in CS.
As illustrated below each incoming packet store in the CS and can be addressed by Index Table.
Now suppose new packet arrived, as we know, regarding LRU, its index must set to top of CS (zero) and it needs to upgrade other indexes, they need to be incremented as a result.
One obvious solution is to loop over all entries in the Index Table and increment them.
Is there any solution or structure that is using for such a problem?
I don't see how you are establishing the order of your cache in the description. But to answer your question, it's possible to reduce the LRU store method to O(1) time complexity.
The classical way to do it is to have these two data structures:
Doubly Linked List : for order in the cache. Each node stores a data element (it plays the role of your content store).
HashMap that associates each key to the pointer to the node in the linked list. (it plays the role of your index table)
So when you access already stored data in your cache, it must be at the top of the list, so you delete the corresponding node from the linked list (in O(1) time because you have access to its previous and next nodes) and store it at the head.
For new data it is simpler, only store it at the head of the list and store your (key, value) in the hashmap.

What is the most efficient way to match the IP addresses to huge route entries?

Imagining there is a firewall, and the system administrator blocked many subnets, perhaps all subnets of a specific country.
For example:
192.168.2.0 / 255.255.255.0
223.201.0.0 / 255.255.0.0
223.202.0.0 / 255.254.0.0
223.208.0.0 / 255.252.0.0
....
To determine whether a IP address have been blocked, the firewall may use the algorithm below.
func blocked(ip)
foreach subnet in blocked_subnets
if in_subnet(subnet, ip)
return true
return false
But, the algorithm needs too much time to run, the time complexity is O(n). If the route table contains too many entries, the network will become almost unavailable.
Is there a more efficient way to match the IP addresses to huge route entries? It is based on some kinds of trees/graphs (Trie?) I guess. I have read something about Longest prefix match and Trie but didn't get the point.
All you really need is a trie with four levels. Each non-leaf node contains an array of up to 256 child nodes. Each node also contains a subnet mask. So, given your example:
192.168.2.0 / 255.255.255.0
223.201.0.0 / 255.255.0.0
223.202.0.0 / 255.254.0.0
223.208.0.0 / 255.252.0.0
Your tree would look something like that below. The two numbers for each node are the IP segment followed by the subnet mask.
root
/ \
192,255 223,255
| -------------------------
168,255 | | |
| 201,255 202,255 208,255
2,255
When you get an IP address, you break it into segments. You search for the first segment at the root level. For speed, you'll probably want to use an array at the root level so that you can do a direct lookup.
Say the first segment of the IP address is 223. You'd grab the node from root[223], and now you're working with just that one subtree. You probably don't want a full array at the other levels, unless your data is really dense. A dictionary of some kind for the subsequent levels is probably what you'll want. If the next segment is 201, you look up 201 in the dictionary for the 223 node, and now your possible list of candidates is just 64K items (i.e. all IP addresses that are 223,201.x.x). You can do the same thing with the other two levels. The result is that you can resolve an IP address in just four lookups: one lookup in an array, and three dictionary lookups.
This structure is also very easy to maintain. Inserting a new address or range requires at most four lookups and adds. Same with deleting. Updates can be done in-place, without having to rebuild the entire tree. You just have to make sure that you're not trying to read while you're updating, and you're not trying to do concurrent updates. But any number of readers can be accessing the thing concurrently.
Using hash map or trie would let you have a hard time dealing with CIDR IP ranges (i.e. the mask is not necessarily 8-based, like 192.168.1.0/28)
An efficient way of doing this is binary search, given that all these IP ranges don't overlap:
Convert the range A.B.C.D/X into a 32-bit integer representing the starting IP address, as well as an integer of how many IPs in this range. For example, 192.168.1.0/24 converts to 3232235776, 256.
Add these ranges in a list or array, and sort by the starting IP address number.
To match an incoming IP address to any range in the list is to do the binary search.
Use red-black or avl trees to store blocked ip for separate subnets . As you are dealing with ip which are basically set of 4 numbers you can use a customized comparator in your desired programming language and store it in red-black tree or avl tree.
Comparator :-
Use 4/6 ip parts to compare the two ip whether they are greater of
less using first unmatched part.
example :-
10.0.1.1 and 10.0.0.1
Here ip1 > ip2 because the 3rd unmatched entry is greater in one.
Time Complexity :-
As red-black tree is balanced BST you will need O(logn) for insertion,deletion and search. For each subnet of k subnets so total O(log(n)*k) for searching ip.
Optimization :- If number of subnet is large then use different key with similar comparisons as above but with only one red-black tree.
Key = (subnet_no,ip)
You can compare them similar to above and would get O(log(S)) where S
is total number of ip entries in all subnets.
This may be a simple one, but as no one said anything about memory constraints, you may use a look-up table. Having a 2^32 item LUT is not impossible even in practice, and then the problem is reduced into a single table lookup regardless of the rules. (The same can be used for routing, as well.) If you want it fast, it takes 2^32 octets (4 GiB), if you have a bit more time, a bitwise table takes 2^32 bits, i.e. 512 MiB. Even in that case it can be made fast, but then using high-level programming languages may produce suboptimal results.
Of course, the question of "fast" is always a bit tricky. Do you want to have fast in practice or in theory? If in practice, on which platform? Even the LUT method may be slow, if your system swaps the table into HDD, and depending on the cache construction the more complicated methods may be faster even compared to RAM-based LUTs, because they fit into the processor cache. Cache miss may be several hundred CPU cycles, and during those cycles rather complicated operations can be done.
The problem with the LUT approach (in addition to the memory use) is the cost of rule deletions. As the table results from a bitwise OR of all rules, there is no simple way to remove a rule. So, in that case it must be determined where there are no overlapping rules with the rule to be deleted, and then those areas have to be zeroed out. This is probably best done bit-by-bit with the structures outlined in the other answers.
Recall that an IP address is basically a 32 bits number.
You can cannonize each subnet to its normal form, and stored all the normal forms in a hash-table.
On run-time, cannonize the given address (easy to do), and check if the hash table contains this entry - if it does, block. Otherwise - permit.
Example, let's say you want to block the subnet 5.*.*.*, this is actually the network with the leading bits 00000101. so add the address 5.0.0.0 or 00000101 - 00000000 - 00000000 - 00000000 to your hash table.
Once a specific address arrives - for example 5.1.2.3, cannonize it back to 5.0.0.0, and check if its in the table.
The query time is O(1) on average using a hash table.

Data Structure, independent of volume of data in it

Is there any data structure in which locating a data is independent of its volume ?
"locating a data is independent of volume of data in it" - I assume this means O(1) for get operations. That would be a hash map.
This presumes that you fetch the object based on the hash.
If you have to check each element to see if an attribute matches a particular value, like your rson or ern or any other parts of it, then you have to make that value the key up front.
If you have several values that you need to search on - all of the must be unique and immutable - you can create several maps, one for each value. That lets you search on more than one. But they have to all be unique, immutable, and known up front.
If you don't establish the key up front it's O(N), which means you have to check every element in turn until you find what you want. On average, this time will increase as the size of the collection grows. That's what O(N) means.

Suitable data structure for finding a person's phone number, given their name?

Suppose you want to write a program that implements a simple phone book. Given a particular name, you want to be able to retrieve that person's phone number as quickly as possible. What data structure would you use to store the phone book, and why?
the text below answers your question.
In computer science, a hash table or hash map is a data structure that
uses a hash function to map identifying values, known as keys (e.g., a
person's name), to their associated values (e.g., their telephone
number). Thus, a hash table implements an associative array. The hash
function is used to transform the key into the index (the hash) of an
array element (the slot or bucket) where the corresponding value is to
be sought.
the text is from wiki:hashtable.
there are some further discussions, like collision, hash functions... check the wiki page for details.
I respect & love hashtables :) but even a balanced binary tree would be fine for your phone book application giving you in worst case a logarithmic complexity and avoiding you for having good hash functions, collisions etc. which is more suitable for huge amounts of data.
When I talk about huge data what I mean is something related to storage. Every time you fill all of the buckets in a hash-table you will need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. Balanced trees wont let you go into these problems. Domain needs to be considered too while designing data structures, for an example for small devices storage matters a lot.
I was wondering why 'Tries' didn't come up in one of the answers,
Tries is suitable for Phone book kind of data.
Also, saving space compared to HashTable at the same cost(almost) of Retrieval efficiency, (assuming constant size alphabet & constant length Names)
Tries also facilitate the 'Prefix Matches' sometimes required while searching.
A dictionary is both dynamic and fast.
You want a dictionary, where you use the name as the key, and the number as the data stored. Check this out: http://en.wikipedia.org/wiki/Dictionary_%28data_structure%29
Why not use a singly linked list? Each node will have the name, number and link information.
One drawback is that your search might take some time since you'll have to traverse the entire list from link to link. You might order the list at the time of node insertion itself!
PS: To make the search a tad bit faster, maintain a link to the middle of the list. Search can continue to the left or right of the list based on the value of the "name" field at this node. Note that this requires a doubly linked list.

Resources