Algorithm to find load distribution to NAPTR DNS queries - algorithm

I have the following problem:
I have 5 servers, where I want to load balance them with 60% for the first server and 10% for the other fours servers.
I use NAPTR DNS entries to answer these servers.
All 5 servers will have the same ORDER but will have different PREFERENCE values to achieve the load balance weight.
According to RFC2915:
Preference is
A 16-bit unsigned integer that specifies the order in which NAPTR
records with equal "order" values SHOULD be processed, low
numbers being processed before high numbers.
My difficult is to find out which value should the field PREFERENCE receive for each load balance percentage.
Does anyone know how to do the maths on this?

You are missing the rest of the quote: This is similar to
the preference field in an MX record
Which means the algorithm is quite simple: the client uses the lowest number, try to connect based on the content. If it succeeds, end of algorithm, if it fails go back at beginning using the next lowest number. Until they are no more entries.
So the values themselves are meaningless, they can be configured in any way the administrator likes. What is important is their relative value between each other, just to be able to order the set from an authority standpoint.

Related

How to generate a 4-digit validation code according to latitude & longitude information?

My application needs this feature:
User A can upload his location information and get ADD CODE which is
generated on server.
User B can input ADD CODE and also has to upload his location
information. Only when userA and userB is close enough and ADD CODE
is matched can they finally be friends.
Before calculating distance and comparing ADD CODE, I will check
whether their city number(unique for each city) is same. In other
words, I have to make sure that in each city, ADD CODE won't
conflict with another at the same time(or few minutes).
Of course, a 4-digit number won't satisfy all possibilities, but is there a way to generate this 4-digit number to satisfy this feature as much as possible?
Assuming this sequence happens within a limited period of time (and not that an Add code is valid for all-time):
Your 4-digit number doesn't need to be globally unique, it only needs to be unique within this window of time. So, with that observation, maintain a table of Add codes, when they were issued and for what location. Generate them randomly ensuring they aren't already in the table. Periodically remove any Add codes that have expired.
Provided you never have more than 10,000 users simultaneously trying to connect with each other this will work.
If you need more than that consider allowing duplicates in the table but using the lat/long to ensure that the same Add code is never allocated to any point within 2x the max distance allowed for pairing.
Is there a way to generate this 4-digit number to satisfy this feature as much as possible?
Yes. There are probably millions of possible ways to generate a 4-digit number (where almost all of them are awful and don't satisfy most of the requirements); and if you sort them in order of how much they satisfy, then one of them must satisfy the feature as much as possible.
The real question is, how awful and unsatisfactory would "as much as possible" be?
If you assume 4 decimal digits; then you're limited to 10000 locations or 10000 unique users. That's unlikely to be enough for anything good.
If you assume 4 hexadecimal digits; then you're limited to 65536 locations or 65536 unique users. That's better but still not enough.
So.. what if you used "base 1234567"? In this case a 4-digit number has 2323050529221952581345121 permutations. The surface of the Earth is about 510.1 million square kilometers, so this would be enough to encode a location very precisely (probably within a few meters?).

Module for unique visitors count

I got this on the job interview:
Let’s assume that you got the task: to write a module, on input of which an infinite stream of IP-addresses of site visitors will be
directed .
In any moment of time module should be able to answer quickly, how
many unique users are collected (uniqueness is specified by IP
address). How would you describe the method of solving this question (in details)
on the condition that:
a) it needs to get exact amount of unique visitors
b) approximate value with small inaccuracy not more than 3-4% is acceptable
What solutions do you see here? I've found several whitepapers about stream algorithms but I don't know if it's appliable in this case or not:
http://www.cs.berkeley.edu/~satishr/cs270/sp11/rough-notes/Streaming.pdf
http://en.wikipedia.org/wiki/Count-distinct_problem
If you only had to deal with 32-bit IPv4 addresses, you could use the simple solution (proposed by #Stephen C) of a bit vector of 232 bits (half a gigabyte). With that, you can maintain a precise count of unique addresses.
But these days, it is necessary to consider 128-bit IPv6 addresses, which is far too large a namespace to be able to use a bit-vector. If you only need an approximate count, though, you can use a Bloom filter, which requires k bits per entry, for some small value of k related to the expected number of false positives you are prepared to accept. A false positive will cause a unique ip address to be uncounted, so the proportion of false positives is roughly the expected inaccuracy of a count.
As the linked Wikipedia page mentions, using 10 bits per entry is expected to keep the false positive percentage to less than one percent; with 8 GB of memory, you could maintain a Bloom filter with about 6.8 thousand million entries.
The solutions you found are definitely appliable
For (a) I would have a counter for total unique IPs and would create a Hash in which the key would be the IP Address, you need to store every single IP address si
That way whenever you receive an IP you check if it is already in the Hash and if its not you store it in there and increase the counter by one.
On the other hand for (b) I would use a Hashing function on the IPs themselves to compact them even further and then insert them on a smaller or more efficient Hash. This way the probability of a collision exists, but you also gain some performance.
There are 2^32 unique IPv4 addresses.
So implement an array of 2^32 booleans whose indexes correspond to the IP addresses. Each time you get a visit:
ip_index = convert_ip_to_32bit_integer(ip)
if !seen[ip_index]:
seen[ip_index] = true
nos_unique_visitors++
This requires 2^29 bytes of memory (i.e. 0.5Gb) assuming that you pack the booleans 8 per byte.
Assuming there is not IPV6 adresses, an IPV4 address is encoded using 4 bytes 255.255.255.255. Which gives us 32 bits.
You could use a binary tree with 32 levels to store the ip address which will let you know if an ip exists in the tree, insert it quickly and easily, ...
Number of operations to find an ip will then approximatively be something near 32*2.
You could prefer to use a Trie tree, with 8 levels, each one storing 4 bits. Maximum number of operations will be, with a number of operation of 8*16.
This will be a cheaper method than allowing memory for a full array, and a Trie can also be used for IPV6 with less costs.

Find Top 10 Most Frequent visited URl, data is stored across network

Source: Google Interview Question
Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.
Have many large <string (url) -> int (visits)> maps.
Calculate < string (url) -> int (sum of visits among all distributed maps), and get the top ten in the combined map.
Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly.
I have now come across quite a few questions of this type, where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer.
All I could think of is brute force, which in some or other way, violates the given constraint.
It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:
pre-processing: let R be the number of servers in cluster, give each
server unique id from 0,1,2,...,R-1
(map) For each (string,id) - send the tuple to the server which has the id hash(string) % R.
(reduce) Once step 2 is done (simple control communication), produce the (string,count) of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server.
(map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
(reduce) Server 0 will yield the top 10 across the network.
Notes:
The problem with the algorithm, like most big-data algorithms that
don't use frameworks is handling failing servers. MapReduce takes
care of it for you.
The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
In the worst case any algorithm, which does not require transmitting the whole frequency table, is going to fail. We can create a trivial case where the global top-10s are all at the bottom of every individual machines list.
If we assume that the frequency of URIs follow Zipf's law, we can come up with effecive solutions. One such solution follows.
Each machine sends top-K elements. K depends solely on the bandwidth available. One master machine aggregates the frequencies and finds the 10th maximum frequency value "V10" (note that this is a lower limit. Since the global top-10 may not be in top-K of every machine, the sum is incomplete).
In the next step every machine sends a list of URIs whose frequency is V10/M (where M is the number of machines). The union of all such is sent back to every machine. Each machines, in turn, sends back the frequency for this particular list. A master aggregates this list into top-10 list.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

A good algorithm for generating an order number

As much as I like using GUIDs as the unique identifiers in my system, it is not very user-friendly for fields like an order number where a customer may have to repeat that to a customer service representative.
What's a good algorithm to use to generate order number so that it is:
Unique
Not sequential (purely for optics)
Numeric values only (so it can be easily read to a CSR over phone or keyed in)
< 10 digits
Can be generated in the middle tier without doing a round trip to the database.
UPDATE (12/05/2009)
After carefully reviewing each of the answers posted, we decided to randomize a 9-digit number in the middle tier to be saved in the DB. In the case of a collision, we'll regenerate a new number.
If the middle tier cannot check what "order numbers" already exists in the database, the best it can do will be the equivalent of generating a random number. However, if you generate a random number that's constrained to be less than 1 billion, you should start worrying about accidental collisions at around sqrt(1 billion), i.e., after a few tens of thousand entries generated this way, the risk of collisions is material. What if the order number is sequential but in a disguised way, i.e. the next multiple of some large prime number modulo 1 billion -- would that meet your requirements?
<Moan>OK sounds like a classic case of premature optimisation. You imagine a performance problem (Oh my god I have to access the - horror - database to get an order number! My that might be slow) and end up with a convoluted mess of psuedo random generators and a ton of duplicate handling code.</moan>
One simple practical answer is to run a sequence per customer. The real order number being a composite of customer number and order number. You can easily retrieve the last sequence used when retriving other stuff about your customer.
One simple option is to use the date and time, eg. 0912012359, and if two orders are received in the same minute, simply increment the second order by a minute (it doesn't matter if the time is out, it's just an order number).
If you don't want the date to be visible, then calculate it as the number of minutes since a fixed point in time, eg. when you started taking orders or some other arbitary date. Again, with the duplicate check/increment.
Your competitors will glean nothing from this, and it's easy to implement.
Maybe you could try generating some unique text using a markov chain - see here for an example implementation in Python. Maybe use sequential numbers (rather than random ones) to generate the chain, so that (hopefully) the each order number is unique.
Just a warning, though - see here for what can possibly happen if you aren't careful with your settings.
One solution would be to take the hash of some field of the order. This will not guarantee that it is unique from the order numbers of all of the other orders, but the likelihood of a collision is very low. I would imagine that without "doing a round trip to the database" it would be challenging to make sure that the order number is unique.
In case you are not familiar with hash functions, the wikipedia page is pretty good.
You could base64-encode a guid. This will meet all your criteria except the "numeric values only" requirement.
Really, though, the correct thing to do here is let the database generate the order number. That may mean creating an order template record that doesn't actually have an order number until the user saves it, or it might be adding the ability to create empty (but perhaps uncommitted) orders.
Use primitive polynomials as finite field generator.
Your 10 digit requirement is a huge limitation. Consider a two stage approach.
Use a GUID
Prefix the GUID with a 10 digit (or 5 or 4 digit) hash of the GUID.
You will have multiple hits on the hash value. But not that many. The customer service people will very easily be able to figure out which order is in question based on additional information from the customer.
The straightforward answer to most of your bullet points:
Make the first six digits a sequentially-increasing field, and append three digits of hash to the end. Or seven and two, or eight and one, depending on how many orders you envision having to support.
However, you'll still have to call a function on the back-end to reserve a new order number; otherwise, it's impossible to guarantee a non-collision, since there are so few digits.
We do TTT-CCCCCC-1A-N1.
T = Circuit type (D1E=DS1 EEL, D1U=DS1 UNE, etc.)
C = 6 Digit Customer ID
1 = The customer's first location
A = The first circuit (A=1, B=2, etc) at this location
N = Order type (N=New, X=Disconnect, etc)
1 = The first order of this kind for this circuit

Resources