Module for unique visitors count - algorithm

I got this on the job interview:
Let’s assume that you got the task: to write a module, on input of which an infinite stream of IP-addresses of site visitors will be
directed .
In any moment of time module should be able to answer quickly, how
many unique users are collected (uniqueness is specified by IP
address). How would you describe the method of solving this question (in details)
on the condition that:
a) it needs to get exact amount of unique visitors
b) approximate value with small inaccuracy not more than 3-4% is acceptable
What solutions do you see here? I've found several whitepapers about stream algorithms but I don't know if it's appliable in this case or not:
http://www.cs.berkeley.edu/~satishr/cs270/sp11/rough-notes/Streaming.pdf
http://en.wikipedia.org/wiki/Count-distinct_problem

If you only had to deal with 32-bit IPv4 addresses, you could use the simple solution (proposed by #Stephen C) of a bit vector of 232 bits (half a gigabyte). With that, you can maintain a precise count of unique addresses.
But these days, it is necessary to consider 128-bit IPv6 addresses, which is far too large a namespace to be able to use a bit-vector. If you only need an approximate count, though, you can use a Bloom filter, which requires k bits per entry, for some small value of k related to the expected number of false positives you are prepared to accept. A false positive will cause a unique ip address to be uncounted, so the proportion of false positives is roughly the expected inaccuracy of a count.
As the linked Wikipedia page mentions, using 10 bits per entry is expected to keep the false positive percentage to less than one percent; with 8 GB of memory, you could maintain a Bloom filter with about 6.8 thousand million entries.

The solutions you found are definitely appliable
For (a) I would have a counter for total unique IPs and would create a Hash in which the key would be the IP Address, you need to store every single IP address si
That way whenever you receive an IP you check if it is already in the Hash and if its not you store it in there and increase the counter by one.
On the other hand for (b) I would use a Hashing function on the IPs themselves to compact them even further and then insert them on a smaller or more efficient Hash. This way the probability of a collision exists, but you also gain some performance.

There are 2^32 unique IPv4 addresses.
So implement an array of 2^32 booleans whose indexes correspond to the IP addresses. Each time you get a visit:
ip_index = convert_ip_to_32bit_integer(ip)
if !seen[ip_index]:
seen[ip_index] = true
nos_unique_visitors++
This requires 2^29 bytes of memory (i.e. 0.5Gb) assuming that you pack the booleans 8 per byte.

Assuming there is not IPV6 adresses, an IPV4 address is encoded using 4 bytes 255.255.255.255. Which gives us 32 bits.
You could use a binary tree with 32 levels to store the ip address which will let you know if an ip exists in the tree, insert it quickly and easily, ...
Number of operations to find an ip will then approximatively be something near 32*2.
You could prefer to use a Trie tree, with 8 levels, each one storing 4 bits. Maximum number of operations will be, with a number of operation of 8*16.
This will be a cheaper method than allowing memory for a full array, and a Trie can also be used for IPV6 with less costs.

Related

What is the most efficient way to match the IP addresses to huge route entries?

Imagining there is a firewall, and the system administrator blocked many subnets, perhaps all subnets of a specific country.
For example:
192.168.2.0 / 255.255.255.0
223.201.0.0 / 255.255.0.0
223.202.0.0 / 255.254.0.0
223.208.0.0 / 255.252.0.0
....
To determine whether a IP address have been blocked, the firewall may use the algorithm below.
func blocked(ip)
foreach subnet in blocked_subnets
if in_subnet(subnet, ip)
return true
return false
But, the algorithm needs too much time to run, the time complexity is O(n). If the route table contains too many entries, the network will become almost unavailable.
Is there a more efficient way to match the IP addresses to huge route entries? It is based on some kinds of trees/graphs (Trie?) I guess. I have read something about Longest prefix match and Trie but didn't get the point.
All you really need is a trie with four levels. Each non-leaf node contains an array of up to 256 child nodes. Each node also contains a subnet mask. So, given your example:
192.168.2.0 / 255.255.255.0
223.201.0.0 / 255.255.0.0
223.202.0.0 / 255.254.0.0
223.208.0.0 / 255.252.0.0
Your tree would look something like that below. The two numbers for each node are the IP segment followed by the subnet mask.
root
/ \
192,255 223,255
| -------------------------
168,255 | | |
| 201,255 202,255 208,255
2,255
When you get an IP address, you break it into segments. You search for the first segment at the root level. For speed, you'll probably want to use an array at the root level so that you can do a direct lookup.
Say the first segment of the IP address is 223. You'd grab the node from root[223], and now you're working with just that one subtree. You probably don't want a full array at the other levels, unless your data is really dense. A dictionary of some kind for the subsequent levels is probably what you'll want. If the next segment is 201, you look up 201 in the dictionary for the 223 node, and now your possible list of candidates is just 64K items (i.e. all IP addresses that are 223,201.x.x). You can do the same thing with the other two levels. The result is that you can resolve an IP address in just four lookups: one lookup in an array, and three dictionary lookups.
This structure is also very easy to maintain. Inserting a new address or range requires at most four lookups and adds. Same with deleting. Updates can be done in-place, without having to rebuild the entire tree. You just have to make sure that you're not trying to read while you're updating, and you're not trying to do concurrent updates. But any number of readers can be accessing the thing concurrently.
Using hash map or trie would let you have a hard time dealing with CIDR IP ranges (i.e. the mask is not necessarily 8-based, like 192.168.1.0/28)
An efficient way of doing this is binary search, given that all these IP ranges don't overlap:
Convert the range A.B.C.D/X into a 32-bit integer representing the starting IP address, as well as an integer of how many IPs in this range. For example, 192.168.1.0/24 converts to 3232235776, 256.
Add these ranges in a list or array, and sort by the starting IP address number.
To match an incoming IP address to any range in the list is to do the binary search.
Use red-black or avl trees to store blocked ip for separate subnets . As you are dealing with ip which are basically set of 4 numbers you can use a customized comparator in your desired programming language and store it in red-black tree or avl tree.
Comparator :-
Use 4/6 ip parts to compare the two ip whether they are greater of
less using first unmatched part.
example :-
10.0.1.1 and 10.0.0.1
Here ip1 > ip2 because the 3rd unmatched entry is greater in one.
Time Complexity :-
As red-black tree is balanced BST you will need O(logn) for insertion,deletion and search. For each subnet of k subnets so total O(log(n)*k) for searching ip.
Optimization :- If number of subnet is large then use different key with similar comparisons as above but with only one red-black tree.
Key = (subnet_no,ip)
You can compare them similar to above and would get O(log(S)) where S
is total number of ip entries in all subnets.
This may be a simple one, but as no one said anything about memory constraints, you may use a look-up table. Having a 2^32 item LUT is not impossible even in practice, and then the problem is reduced into a single table lookup regardless of the rules. (The same can be used for routing, as well.) If you want it fast, it takes 2^32 octets (4 GiB), if you have a bit more time, a bitwise table takes 2^32 bits, i.e. 512 MiB. Even in that case it can be made fast, but then using high-level programming languages may produce suboptimal results.
Of course, the question of "fast" is always a bit tricky. Do you want to have fast in practice or in theory? If in practice, on which platform? Even the LUT method may be slow, if your system swaps the table into HDD, and depending on the cache construction the more complicated methods may be faster even compared to RAM-based LUTs, because they fit into the processor cache. Cache miss may be several hundred CPU cycles, and during those cycles rather complicated operations can be done.
The problem with the LUT approach (in addition to the memory use) is the cost of rule deletions. As the table results from a bitwise OR of all rules, there is no simple way to remove a rule. So, in that case it must be determined where there are no overlapping rules with the rule to be deleted, and then those areas have to be zeroed out. This is probably best done bit-by-bit with the structures outlined in the other answers.
Recall that an IP address is basically a 32 bits number.
You can cannonize each subnet to its normal form, and stored all the normal forms in a hash-table.
On run-time, cannonize the given address (easy to do), and check if the hash table contains this entry - if it does, block. Otherwise - permit.
Example, let's say you want to block the subnet 5.*.*.*, this is actually the network with the leading bits 00000101. so add the address 5.0.0.0 or 00000101 - 00000000 - 00000000 - 00000000 to your hash table.
Once a specific address arrives - for example 5.1.2.3, cannonize it back to 5.0.0.0, and check if its in the table.
The query time is O(1) on average using a hash table.

How wide should be random numbers so it is virtually impossible that you repeat two of them?

Certain system is supposed to spawn objects with unique IDs. That system will run in different computers without connection between them; yet no ID collision can happen. The only way to implement this is generating random numbers. How wide should be the those so you can consider it is virtually impossible for a collision to ever happen?
This is basically a generalization of the birthday problem.
This probability table can help you to figure out how many bits you are going to need in order to achieve the probability you desire - based on p - desired probability, and #elements that are going to be "hashed" (generated).
In your question you mentioned:
The only way to implement this is generating random numbers
No, this is NOT the only way to do this. In fact this is one of the ways NOT to do it.
There is already a well known and widely used method for doing something like this that you yourself are using right now: adding a prefix (or postfix, doesn't matter). The prefix is called many things by many systems: Ethernet and WiFi call it vendor id. In TCP/IP it's called a subnet (technically it's called a "network").
The idea is simple. Say for example you want to use a 32 bit number for your global id. Reserve something like 8 bits to identify which system it's on and the rest can simply be sequential numbers within each system.
Stealing syntax from IPv4 for a moment. Say system 1 has an id of 1. And system 2 has an id of 2. Therefore ids form system 1 will be in the range between 1.0.0.0 - 1.255.255.255 and ids from system 2 will be between 2.0.0.0 - 2.255.255.255.
That's just an example. There's nothing that forces you to waste so many bits for system id. In fact, IPv4 is itself no longer organized by byte boundaries. You can instead use 4 bits as system id and 28 bits for individual ids. You can also use 64 bits if you need more ids or go the IPv6 route and use 128 bits (in which case you can definitely afford to waste a byte or two for system id).
Because each system cannot generate an id that's generated by another system no collision will ever occur before the ids overflow.
If you need the ids to look "random" use a hashing algorithm. Good hashing algorithms such as SHA1 and CRC are guaranteed to never collide if your data is of a fixed size below the size of the hash. For example, SHA1 is 160 bits so if your id generation system is less than 160 bits internally then the SHA1 hash of ids will never collide. The caveat being that you must use all 160 bits. Turncating the SHA1 will cause collisions. For 32 bit ids CRC32 is a perfect fit while there's also CRC64 if you want to generate 64 bit ids.
Guids use 2^128 and the likelyhood of collision is negligible

How to efficiently hash the ip-address

This is an interview question. I thought about some solution like multiway- hashing but could not find some thing elegant. Please suggest some good method.
Question:
You have 10 million IP addresses. (IPv4 4 byte addresses). Create a hash function for these IP addresses.
Hint: Using the IP's themselves as a key is a bad idea because there will be a lot of wasted space
Interesting, that such an interesting question did not have any interesting answer (sorry for tautology).
If you see it as a theoretical matter, then this link is what you need (there is even a superfast hash function written for you and ready to go):
http://www.kfki.hu/~kadlec/sw/netfilter/ct3/
Practical matter may be different. If your hash table is of reasonable size, you will have to handle collisions anyway (with linked lists). So ask yourself what use case will take place in the end? If your code will run within some secluded ecosystem, and the IP address is a-b-c-d, c and d are the most volatile numbers and d won't be null (assuming you don't handle networks), so a hash table of 64K buckets, and cd as a hash may well be satisfactory?
Another use case - TCP connection tracking where a client use ephemeral port that is assigned by kernel randomly (isn't it ideal for hashing?). The problem is the limited range: something like 32768-61000 which renders least significant byte more random than most significant byte. So you can XOR the most significant byte with the most volatile byte in IP address that can be zerro (c) and use it as a hash in your 64K table.
Because your input is random & size of table is smaller the address space any hash function that you design will have its own pathological data set which will make your hash function look bad. I think the interviewer wants to know your knowledge about existing hash function that are used as standards.
Following are few such hash functions :
MD5
SHA-1,SHA-2
Why these functions work better than other hash functions because their pathelogical data sets are difficult to find without using brute force algorithms. So if you have something as good as these than donot tell your interviewer (you can get a patent on it and get job in google).
For Hashing ip addresses use MD5 or SHA on it and truncate to the size of table and you are done.
Note:- Size of table must be prime to prevent bad hashing.
I have also the same question before.
To solve this, you should divide your data.
We know ip address is consequent.
table1 from 0.0.0.0 to 0.0.0.127 (they are all in New York town1)
table2 from 0.0.0.128 to 0.0.0.255 (they are all in New York town2)
....
Then, create a map like this.
0.0.0.0~0.0.0.127 -> address1
0.0.0.127~0.0.0.255 -> address2
......
Then, to get the address for the IP, just get value from map;
Note: all the data is in database, I don't think it cost lots of space, to get the address in 1s, you should cost several space to optimize the speed

Find the count of a particular number in an infinite stream of numbers at a particular moment

I faced this problem in a recent interview:
You have a stream of incoming numbers in range 0 to 60000 and you have a function which will take a number from that range and return the count of occurrence of that number till that moment. Give a suitable Data structure/algorithm to implement this system.
My solution is:
Make an array of size 60001 pointing to bit-vectors. These bit vectors will contain the count of the incoming numbers and the incoming numbers will also be used to index into the array for the corresponding number. Bit-vectors will dynamically increase as the count gets too big to hold in them.
So, if the numbers are coming at rate 100numbers/sec then, in 1million years total numbers will be = (100*3600*24)*365*1000000 = 3.2*10^15. In the worst case where all numbers in the stream is same it will take ceil((log(3.2*10^15) / log 2) )= 52bits and if the numbers are uniformly distributed the we will have (3.2*10^15) / 60001 = 5.33*10^10 number of occurrences for each number which will require total of 36 bits for each numbers.
So, assuming 4byte pointers we need (60001 * 4)/1024 = 234 KB memory for the array and for the case with same numbers, we need bit vector size = 52/8 = 7.5 bytes which is still around 234KB. And for the other case we need (60001 * 36 / 8)/1024 = 263.7 KB for bit vector totaling about 500KB. So, it is very much feasible to do this with ordinary PC and memory.
But the interviewer said, as it is infinite stream it will eventually overflow and gave me hint like how can we do this if there were many PCs and we could pass messages between them or think about file system etc. But I kept thinking if this solution was not working then, others would too. Needless to say, I did not get the job.
How to do this problem with less memory? Can you think of an alternative approach (using network of PCs may be)?
A formal model for the problem could be the following.
We want to know if it exists a constant space bounded Turing machine such that, in any given time it recognizes the language L of all couples (number,number of occurrences so far). This means that all correct couples will be accepted and all incorrect couples will be rejected.
As a corollary of the Theorem 3.13 in Hopcroft-Ullman we know that every language recognized by a constant space bounded machine is regular.
It can be proven by using the pumping lemma for regular languages that the language described above is not a regular language. So you can't recognize it with a constant space bounded machine.
you can easily use index based search, by using an array like int arr[60000][1], whenever you get a number , say 5000, directly access the index( num-1) = (5000-1) as, arr[num-1][1], and increment the number, and now whenever u want to know how many times a particular num has ocurred you can just access it by arr[num-1][1] and you'll get the count for that number, Its simplest possible linear time implementation.
Isn't this External Sorting? Store the infinite stream in a file. Do a seek() (RandomAccessFile.seek() in Java) in the file and get to the appropriate timestamp. This is similar to Binary Search since the data is sorted by timestamps. Once you get to the appropriate timestamp, the problem turns into counting a particular number from an infinite set of numbers. Here, instead of doing a quick sort in memory, Counting sort can be done since the range of numbers is limited.

What makes table lookups so cheap?

A while back, I learned a little bit about big O notation and the efficiency of different algorithms.
For example, looping through each item in an array to do something with it
foreach(item in array)
doSomethingWith(item)
is an O(n) algorithm, because the number of cycles the program performs is directly proportional to the size of the array.
What amazed me, though, was that table lookup is O(1). That is, looking up a key in a hash table or dictionary
value = hashTable[key]
takes the same number of cycles regardless of whether the table has one key, ten keys, a hundred keys, or a gigabrajillion keys.
This is really cool, and I'm very happy that it's true, but it's unintuitive to me and I don't understand why it's true.
I can understand the first O(n) algorithm, because I can compare it to a real-life example: if I have sheets of paper that I want to stamp, I can go through each paper one-by-one and stamp each one. It makes a lot of sense to me that if I have 2,000 sheets of paper, it will take twice as long to stamp using this method than it would if I had 1,000 sheets of paper.
But I can't understand why table lookup is O(1). I'm thinking that if I have a dictionary, and I want to find the definition of polymorphism, it will take me O(logn) time to find it: I'll open some page in the dictionary and see if it's alphabetically before or after polymorphism. If, say, it was after the P section, I can eliminate all the contents of the dictionary after the page I opened and repeat the process with the remainder of the dictionary until I find the word polymorphism.
This is not an O(1) process: it will usually take me longer to find words in a thousand page dictionary than in a two page dictionary. I'm having a hard time imagining a process that takes the same amount of time regardless of the size of the dictionary.
tl;dr: Can you explain to me how it's possible to do a table lookup with O(1) complexity?
(If you show me how to replicate the amazing O(1) lookup algorithm, I'm definitely going to get a big fat dictionary so I can show off to all of my friends my ninja-dictionary-looking-up skills)
EDIT: Most of the answers seem to be contingent on this assumption:
You have the ability to access any page of a dictionary given its page number in constant time
If this is true, it's easy for me to see. But I don't know why this underlying assumption is true: I would use the same process to to look up a page by number as I would by word.
Same thing with memory addresses, what algorithm is used to load a memory address? What makes it so cheap to find a piece of memory from an address? In other words, why is memory access O(1)?
You should read the Wikipedia article.
But the essence is that you first apply a hash function to your key, which converts it to an integer index (this is O(1)). This is then used to index into an array, which is also O(1). If the hash function has been well designed, there should only be one (or a few items) stored at each location in the array, so the lookup is complete.
So in massively-simplified pseudocode:
ValueType array[ARRAY_SIZE];
void insert(KeyType k, ValueType v)
{
int index = hash(k);
array[index] = v;
}
ValueType lookup(KeyType k)
{
int index = hash(k);
return array[index];
}
Obviously, this doesn't handle collisions, but you can read the article to learn how that's handled.
Update
To address the edited question, indexing into an array is O(1) because underneath the hood, the CPU is doing this:
ADD index, array_base_address -> pointer
LOAD pointer -> some_cpu_register
where LOAD loads data stored in memory at the specified address.
Update 2
And the reason a load from memory is O(1) is really just because this is an axiom we usually specify when we talk about computational complexity (see http://en.wikipedia.org/wiki/RAM_model). If we ignore cache hierarchies and data-access patterns, then this is a reasonable assumption. As we scale the size of the machine,, this may not be true (a machine with 100TB of storage may not take the same amount of time as a machine with 100kB). But usually, we assume that the storage capacity of our machine is constant, and much much bigger than any problem size we're likely to look at. So for all intents and purposes, it's a constant-time operation.
I'll address the question from a different perspective from every one else. Hopefully this will give light to why the accessing x[45] and accessing x[5454563] takes the same amount of time.
A RAM is laid out in a grid (i.e. rows and columns) of capacitors. A RAM can address a particular cell of memory by activating a particular column and row on the grid, so let's say if you have a 16-byte capacity RAM, laid out in a 4x4 grid (insanely small for modern computer, but sufficient for illustrative purpose), and you're trying to access the memory address 13 (1101), you first split the address into rows and column, i.e row 3 (11) column 1 (01).
Let's suppose a 0 means taking the left intersection and a 1 means taking a right intersection. So when you want to activate row 3, you send an army of electrons in the row starting gate, the row-army electrons went right, right to reach row 3 activation gate; next you send another army of electrons on the column starting gate, the column-army electrons went left then right to reach the 1st column activation gate. A memory cell can only be read/written if the row and column are both activated, so this would allow the marked cell to be read/written.
The effect of all this gibberish is that the access time of a memory address depends on the address length, and not the particular memory address itself; if an architecture uses a 32-bit address space (i.e. 32 intersections), then addressing memory address 45 and addressing memory address 5454563 both will still have to pass through all 32 intersections (actually 16 intersections for the row electrons and 16 intersections for the columns electrons).
Note that in reality memory addressing takes very little amount of time compared to charging and discharging the capacitors, therefore even if we start having a 512-bit length address space (enough for ~1.4*10^130 yottabyte of RAM, i.e. enough to keep everything under the sun in your RAM), which mean the electrons would have to go through 512 intersections, it wouldn't really add that much time to the actual memory access time.
Note that this is a gross oversimplification of modern RAM. In modern DRAM, if you want to access subsequent memory addresses you only change the columns and not spend time changing the rows, therefore accessing subsequent memory is much faster than accessing totally random addresses. Also, this description is totally ignorant about the effect of CPU cache (although CPU cache also uses a similar grid addressing scheme, however since CPU cache uses the much faster transistor-based capacitor, the negative effect of having large cache address space becomes very critical). However, the point still holds that if you're jumping around the memory, accessing any one of them will take the same amount of time.
You're right, it's surprisingly difficult to find a real-world example of this. The idea of course is that you're looking for something by address and not value.
The dictionary example fails because you don't immediately know the location of page say 278. You still have to look that up the same as you would a word because the page locations are not in your memory.
But say I marked a number on each of your fingers and then I told you to wiggle the one with 15 written on it. You'd have to look at each of them (assuming its unsorted), and if it's not 15 you check the next one. O(n).
If I told you to wiggle your right pinky. You don't have to look anything up. You know where it is because I just told you where it is. The value I just passed to you is its address in your "memory."
It's kind of like that with databases, but on a much larger scale than just 10 fingers.
Because work is done up front -- the value is put in a bucket that is easily accessible given the hashcode of the key. It would be like if you wanted to look up your work in the dictionary but had marked the exact page the word was on.
Imagine you had a dictionary where everything starting with letter A was on page 1, letter B on page 2...etc. So if you wanted to look up "balloon" you would know exactly what page to go to. This is the concept behind O(1) lookups.
Arbitrary data input => maps to a specific memory address
The trade-off of course being you need more memory to allocate for all the potential addresses, many of which may never be used.
If you have an array with 999999999 locations, how long does it take to find a record by social security number?
Assuming you don't have that much memory, then allocate about 30% more array locations that the number of records you intend to store, and then write a hash function to look it up instead.
A very simple (and probably bad) hash function would be social % numElementsInArray.
The problem is collisions--you can't guarantee that every location holds only one element. But thats ok, instead of storing the record at the array location, you can store a linked list of records. Then you scan linearly for the element you want once you hash to get the right array location.
Worst case this is O(n)--everything goes to the same bucket. Average case is O(1) because in general if you allocate enough buckets and your hash function is good, records generally don't collide very often.
Ok, hash-tables in a nutshell:
You take a regular array (O(1) access), and instead of using regular Int values to access it, you use MATH.
What you do, is to take the key value (lets say a string) calculate it into a number (some function on the characters) and then use a well known mathematical formula that gives you a relatively good distribution on the array's range.
So, in that case you are just doing like 4-5 calculations (O(1)) to get an object from that array, using a key which isn't an int.
Now, avoiding collisions, and finding the right mathematical formula for good distribution is the hard part. That's what is explained pretty well in wikipedia: en.wikipedia.org/wiki/Hash_table
Lookup tables know exactly how to access the given item in the table before hand.
Completely the opposite of say, finding an item by it's value in a sorted array, where you have to access items to check that it is what you want.
In theory, a hashtable is a series of buckets (addresses in memory) and a function that maps objects from a domain into those buckets.
Say your domain is 3 letter words, you'd block out 26^3=17,576 addresses for all the possible 3 letter words and create a function that maps all 3 letter words to those addresses, e.g., aaa=0, aab=1, etc. Now when you have a word you'd like to look up, say, "and", you know immediately from your O(1) function that it is address number 367.

Resources