Redis: Find keys that exist - performance

We have a list of numbers stored in redis as keys (300 million keys which are 10 digit numeric keys).
Our users give us a list of about 1 million numbers and expect that we get out a subset of these numbers that do not exist in redis as keys. The expectation is to get the result in sub second and we have been trying to use Redis for the same.
Initially it looked like the right way to go (using EXISTS), but now we are questioning if there are better ways to get to the resultant without looping through these numbers and creating the subset.
Could someone please let us know how we could do this efficiently?

Old question I know, but I thought it deserved a fuller answer.
The problem with getting all the keys from redis and then doing the containment test is that you have to pull the 300m keys from redis for every check, or keep a "local" copy of those keys which defeats the point in redis.
Instead of taking the data to the processing, much better to take the processing to the data.
You can use redis sets and let redis do the set diffing.
Using python-redis here, but obviously the execution of redis could be done from any language.
import os, base64, time, redis
r = redis.Redis()
def create_keys(n, size=10):
data = base64.b64encode(os.urandom(n * size))
return [data[i:i+size] for i in range(0, n * size, size)]
if not r.exists('ref_keys'):
for _ in range(3):
r.sadd('ref_keys', *create_keys(1*10**6))
print('{} keys in reference key set'.format(r.scard('ref_keys')))
existing_keys = r.srandmember('ref_keys', number=50*10**3)
keys_to_check = existing_keys + create_keys(50*10**3)
start = time.time()
try:
r.sadd('check_ref', *keys_to_check)
missing = r.sdiff('check_ref', 'ref_keys')
finally:
r.delete('check_ref')
print('number of missing keys: {}, time taken {:0.3f}s'.format(len(missing), time.time() - start))
(Most of the code here (and time) is spent creating the test case.)
Only the checked 1m keys have to be transferred, rather than all 300m.
Note: due to memory my ref_keys set is only has 30m keys and the containment test took 3s. SDIFF has "Time complexity: O(N) where N is the total number of elements in all given sets." so I suspect you would struggle to get the time to below a second on commodity hardware.

Yes you should avoid looping on the user list and use EXISTS for each keys. Redis commands are relatively slow (because of the client/server architecture) as opposed to variable manipulation in a usual language.
One solution I would suggest would require some coding : I would get all the existing keys with KEYS (http://redis.io/commands/keys), then sort the result and the user list.
Then you can implement a fast search to check if the user' keys are in the redis keys.
Actually you might use set in Python, with the difference already coded
http://docs.python.org/2/library/sets.html
(This is unsorted, but the implementation is a dict, which is a hashtable).

Related

Garbage collection with a very large dictionary

I have a very large immutable set of keys that doesn't fit in memory, and an even larger list of references, which must be scanned just once. How can the mark phase be done in RAM? I do have a possible solution, which I will write as an answer later (don't want to spoil it), but maybe there are other solutions I didn't think about.
I will try to restate the problem to make it more "real":
You work at Facebook, and your task is to find which users didn't ever create a post with an emoji. All you have is the list of active user names (around 2 billion), and the list of posts (user name / text), which you have to scan, but just once. It contains only active users (you don't need to validate them).
Also, you have one computer, with 2 GB of RAM (bonus points for 1 GB). So it has to be done all in RAM (without external sort or reading in sorted order). Within two day.
Can you do it? How? Tips: You might want to use a hash table, with the user name as the key, and one bit as the value. But the list of user names doesn't fit in memory, so that doesn't work. With user ids it might work, but you just have the names. You can scan the list of user names a few times (maybe 40 times, but not more).
Sounds like a problem I tackled 10 years ago.
The first stage: ditch GC. The overhead of GC for small objects (a few bytes) can be in excess of 100%.
The second stage: design a decent compression scheme for user names. English has about 3 bits per character. Even if you allowed more characters, the average amount of bits won't rise fast.
Third stage: Create dictionary of usernames in memory. Use a 16 bit prefix of each username to choose the right sub-dictionary. Read in all usernames, initially sorting them just by this prefix. Then sort each dictionary in turn.
As noted in the question, allocate one extra bit per username for the "used emoji" result.
The problem is now I/O bound, as the computation is embarrassingly parallel. The longest phase will be reading in all the posts (which is going to be many TB).
Note that in this setup, you're not using fancy data types like String. The dictionaries are contiguous memory blocks.
Given a deadline of two days, I would however dump some of this this fanciness. The I/O bound for reading the text is severe enough that the creation of the user database may exceed 16 GB. Yes, that will swap to disk. Big deal for a one-off.
Hash the keys, sort the hashes, and store sorted hashes in compressed form.
TL;DR
The algorithm I propose may be considered as an extension to the solution for similar (simpler) problem.
To each key: apply a hash function that maps keys to integers in range [0..h]. It seems to be reasonably good to start with h = 2 * number_of_keys.
Fill all available memory with these hashes.
Sort the hashes.
If hash value is unique, write it to the list of unique hashes; otherwise remove all copies of it and write it to the list of duplicates. Both these lists should be kept in compressed form: as difference between adjacent values, compressed with optimal entropy coder (like arithmetic coder, range coder, or ANS coder). If the list of unique hashes was not empty, merge it with sorted hashes; additional duplicates may be found while merging. If the list of duplicates was not empty, merge new duplicates to it.
Repeat steps 1..4 while there are any unprocessed keys.
Read keys several more times while performing steps 1..5. But ignore all keys that are not in the list of duplicates from previous pass. For each pass use different hash function (for anything except matching with the list of duplicates from previous pass, which means we need to sort hashes twice, for 2 different hash functions).
Read keys again to convert remaining list of duplicate hashes into list of plain keys. Sort it.
Allocate array of 2 billion bits.
Use all unoccupied memory to construct an index for each compressed list of hashes. This could be a trie or a sorted list. Each entry of the index should contain a "state" of entropy decoder which allows to avoid decoding compressed stream from the very beginning.
Process the list of posts and update the array of 2 billion bits.
Read keys once more co convert hashes back to keys.
While using value h = 2*number_of_keys seems to be reasonably good, we could try to vary it to optimize space requirements. (Setting it too high decreases compression ratio, setting it too low results in too many duplicates).
This approach does not guarantee the result: it is possible to invent 10 bad hash functions so that every key is duplicated on every pass. But with high probability it will succeed and most likely will need about 1GB RAM (because most compressed integer values are in range [1..8], so each key results in about 2..3 bits in compressed stream).
To estimate space requirements precisely we might use either (complicated?) mathematical proof or complete implementation of algorithm (also pretty complicated). But to obtain rough estimation we could use partial implementation of steps 1..4. See it on Ideone. It uses variant of ANS coder named FSE (taken from here: https://github.com/Cyan4973/FiniteStateEntropy) and simple hash function implementation (taken from here: https://gist.github.com/badboy/6267743). Here are the results:
Key list loads allowed: 10 20
Optimal h/n: 2.1 1.2
Bits per key: 2.98 2.62
Compressed MB: 710.851 625.096
Uncompressed MB: 40.474 3.325
Bitmap MB: 238.419 238.419
MB used: 989.744 866.839
Index entries: 1'122'520 5'149'840
Indexed fragment size: 1781.71 388.361
With the original OP limitation of 10 key scans optimal value for hash range is only slightly higher (2.1) than my guess (2.0) and this parameter is very convenient because it allows using 32-bit hashes (instead of 64-bit ones). Required memory is slightly less than 1GB, which allows to use pretty large indexes (so step 10 would be not very slow). Here lies a little problem: these results show how much memory is consumed at the end, but in this particular case (10 key scans) we temporarily need more than 1 GB memory while performing second pass. This may be fixed if we drop results (unique hashes) of the first first pass and recompute them later, together with step 7.
With not so tight limitation of 20 key scans optimal value for hash range is 1.2, which means algorithm needs much less memory and allows more space for indexes (so that step 10 would be almost 5 times faster).
Loosening limitation to 40 key scans does not result in any further improvements.
Minimal perfect hashing
Create a minimal perfect hash function (MPHF).
At around 1.8 bits per key (using the
RecSplit
algorithm), this uses about 429 MB.
(Here, 1 MB is 2^20 bytes, 1 GB is 2^30 bytes.)
For each user, allocate one bit as a marker, about 238 MB.
So memory usage is around 667 MB.
Then read the posts, for each user calculate the hash,
and set the related bit if needed.
Read the user table again, calculate the hash, check if the bit is set.
Generation
Generating the MPHF is a bit tricky, not because it is slow
(this may take around 30 minutes of CPU time),
but due to memory usage. With 1 GB or RAM,
it needs to be done in segments.
Let's say we use 32 segments of about the same size, as follows:
Loop segmentId from 0 to 31.
For each user, calculate the hash code, modulo 32 (or bitwise and 31).
If this doesn't match the current segmentId, ignore it.
Calculate a 64 bit hash code (using a second hash function),
and add that to the list.
Do this until all users are read.
A segment will contain about 62.5 million keys (2 billion divided by 32), that is 238 MB.
Sort this list by key (in place) to detect duplicates.
With 64 bit entries, the probability of duplicates is very low,
but if there are any, use a different hash function and try again
(you need to store which hash function was used).
Now calculate the MPHF for this segment.
The RecSplit algorithm is the fastest I know.
The CHD algorithm can be used as well,
but needs more space / is slower to generate.
Repeat until all segments are processed.
The above algorithm reads the user list 32 times.
This could be reduced to about 10 if more segments are used
(for example one million),
and as many segments are read, per step, as fits in memory.
With smaller segments, less bits per key are needed
to the reduced probability of duplicates within one segment.
The simplest solution I can think of is an old-fashioned batch update program. It takes a few steps, but in concept it's no more complicated than merging two lists that are in memory. This is the kind of thing we did decades ago in bank data processing.
Sort the file of user names by name. You can do this easily enough with the Gnu sort utility, or any other program that will sort files larger than what will fit in memory.
Write a query to return the posts, in order by user name. I would hope that there's a way to get these as a stream.
Now you have two streams, both in alphabetic order by user name. All you have to do is a simple merge:
Here's the general idea:
currentUser = get first user name from users file
currentPost = get first post from database stream
usedEmoji = false
while (not at end of users file and not at end of database stream)
{
if currentUser == currentPostUser
{
if currentPost has emoji
{
usedEmoji = true
}
currentPost = get next post from database
}
else if currentUser > currentPostUser
{
// No user for this post. Get next post.
currentPost = get next post from database
usedEmoji = false
}
else
{
// Current user is less than post user name.
// So we have to switch users.
if (usedEmoji == false)
{
// No post by this user contained an emoji
output currentUser name
}
currentUser = get next user name from file
}
}
// at the end of one of the files.
// Clean up.
// if we reached the end of the posts, but there are still users left,
// then output each user name.
// The usedEmoji test is in there strictly for the first time through,
// because the current user when the above loop ended might have had
// a post with an emoji.
while not at end of user file
{
if (usedEmoji == false)
{
output currentUser name
}
currentUser = get next user name from file
usedEmoji = false
}
// at this point, names of all the users who haven't
// used an emoji in a post have been written to the output.
An alternative implementation, if obtaining the list of posts as described in #2 is overly burdensome, would be to scan the list of posts in their natural order and output the user name from any post that contains an emoji. Then, sort the resulting file and remove duplicates. You can then proceed with a merge similar to the one described above, but you don't have to explicitly check if post has an emoji. Basically, if a name appears in both files, then you don't output it.

Redis: Efficient key sharding for large key set

I have a problem where I am going beyond the amount of RAM in my server. I need to reduce the database size so that I can still use Redis. My application is a massive key / value store where the keys are user given text strings (directory / file paths). The values are very simple pointers to objects that I create. So it is an object store. The problem is that I have a Petabyte of objects, where an object could be 100K bytes. I can actually constrain the average object to be no less than 1M bytes, so 10^15 / 10^6 = 10^9 objects. Being that each object needs a key, that is 10^9, or 1G keys. If each key/value pair is 100 bytes, that is 100GB in RAM. That almost fits in servers with 128GB of RAM but it is not the only thing that is going on in the server. I'd like to reduce the footprint if I can.
The question is what direction to go in? I tried compressing the input key, but that was actually bigger than the original in my testing because it is such a short string and not a document. I have thought about using a different data store for smaller sized files, let's say below 1G. That will reduce what I need to put into Redis. I have also thought about using a hash algorithm that intentionally overlaps and bins the keys, and then putting the hash deltas into the merged keys as values. If that it too confusing here is a made up example:
Key Hash
A 15gh2
B 15gh2
C 4Tgnx
I would then store in Redis:
V(15gh2) = A, B, A-Value=A-Object, B-Value=B-Object
V(4Tgnx) = C
There is probably a proper way to algebraically represent this, but I don't know how to do that. "A-Object" is my pointer to the A object. What I'm trying to do is to end up with fewer keys, based on some posts I've read about keys being more expensive than Redis hash values (don't confuse the "Redis hash" with the "hash" algorithm). I have access to http://ieeexplore.ieee.org/ full database to search for papers on this topic. I'm not quite sure what I should be searching for in the query field? I tried things like "hash chain" but that appears to be targeting encryption more than efficient database stores. Any solution ideas or paths for greater research would be appreciated.
Update: As noted in the comments section, the values, or what I call "A-Object", "B-Object" are encoded "pointers" that are paths to objects. These are actual files in an XFS filesystem. They can be encoded as simply as "1:6:2" to point to path "/data/d0001/d0006/d0002". So a very short value "1:6:2" is all that needs to be stored.
The standard approach with this much data is to partition data across multiple servers.
See http://redis.io/topics/partitioning for advice on how to do that.

Optimize returning top results from intersecting Sorted Set and Set in Redis

I am trying to optimize my Redis code and I am currently having some difficulties scaling my solution. Redis is hosted on Redis Cloud, the app on Heroku and I am using Ruby for my language.
My Redis setup:
I have a series of sorted sets which each contains approx 1,000 scored members and a corresponding set (which could be a hash, string, list, set, sorted set or any other structure) for each user in the system.
For example in the news:sports key I have the following structure. Other keys examples are news:entertainment, news:business.
news:sports -- Member Score
StoryOID1 1000
StoryOID2 999
StoryOID3 998
StoryOID4 997
StoryOID5 996
...
I also have a user specific key (a set) for each of the master sorted sets (user1:news:sports) that contain a list of stories the user has already seen. ie
seen:user1:sports StoryOID2
My challenge:
On each user request, I need to return the top 3 members (with the highest scores, although I don't need to know the scores) from a specific sorted set that the user hasn't already seen. I prefer not to save the result in Redis as have no long term use and I will just have to delete the key.
Given the example above, User1 requests sports:news, I would return:
StoryOID1
StoryOID3
StoryOID4
As part of my code, I loop through 10 sorted sets (10 genres) returning the top 3 unseen stories from each for a total of 30 OIDS being returned for each request.
For each user request:
Do this 10 times:
ZRANGEBYSCORE top 24 members
loop user genres key using SISMEMBER until I return 3 unseen members
end
Benchmarking at 60 dynos (heroku) on I can only get to 500 simultaneous connections (and at 1000 ms response time) with the Redis loop below being the bottleneck. I am aiming to scale a few multiples over my current setup. Anything and Everything can be change to scale this process.
My Current Process (in Ruby):
def newslist (userOID,genres)
#pull top 24 stories for the given news:genres -- 24 could be replaced by 1,3,6,12 etc
newsscores = ##redis.zrevrangebyscore("news:#{genres}", "+inf", "-inf", :limit => [0, 24],:with_scores => true)
newsstories = Array.new(3)
i = 0 #news acceptance counter
loopcnt = 0 #loop counter
while i < 3
if newsscores.count == loopcnt - 1 #loop to the max number of news returned in news news
break #breakout of loop
end
seen = ##redis.sismember("seen:#{userOID}:#{genres}", newsscores[loopcnt][0])
if seen == false
newsstories[i] = newsscores[loopcnt][0]
i+=1
end
loopcnt += 1
end
if i==3
return newsstories #return 3 news newss
else
return 0 #return 0 -- this should cause a repick
end
return 0 #return 0 -- this should cause a repick
end
I know I am paying a large costs the large number of Redis requets. My current thought process is to basically translate the above into a Lua script I can run on the server side but I can't help but feel there is a more elegant solution that will scale far better.
Is there a better way?
First off, yes: you should 100% be using Lua for this. Check the CPU on your Redis box. I'm willing to bet it's not burning up. Your bottleneck at this point is almost definitely network throughput because you're requiring a back-and-forth call each time you hit SISMEMBER (which is up to 24 times PER USER). That's a lot of unnecessary network activity. It's especially unnecessary when you consider that the logic you're doing on top of SISMEMBER could very easily be done server side, and there's really no reason to send anything back to your client until you finish the loop completely. That logic applies to the initial ZRANGEBYSCORE top 24 members as well. You can literally just translate the entire:
Do this 10 times:
ZRANGEBYSCORE top 24 members
loop user genres key using SISMEMBER until I return 3 unseen members
end
into Lua and go from 250 network hits per user to just 1 per user. That will be a huge, huge win. On top of that, you'll be sending a lot less information to Redis and back the one time you do initiate a Redis call. Here's some Lua pseudo-code that should give you the idea of what you want to be doing:
local genres = {KEYS[1], KEYS[2], KEYS[3], KEYS[4], KEYS[5], KEYS[6], KEYS[7], KEYS[8], KEYS[9], KEYS[10]}
local user_seen_genre_sets = {KEYS[11], KEYS[12], KEYS[13], KEYS[14], KEYS[15], KEYS[16], KEYS[17], KEYS[18], KEYS[19], KEYS[20]}
local user_id = ARGV[1]
to_return = {{},{},{},{},{},{},{},{},{},{}}
for i = 1, #genres do
possible_stories = redis.call('ZREVRANGEBYSCORE', genres[i], 'inf', 0, 'LIMIT', 0, 24)
--call SISMEMBER on each story above with the appropriate user_unseen_genre_sets key
--add the first 3 results to to_return[i], then stop the loop.
end
return to_return
Why use Lua instead of a pipeline?
Itamar Haber brought up a good point that you might want to use broken up pipelines for this instead of a single Lua script because the Lua script might block your Redis server for too long. Here's a few reasons for why you want to use a Lua script instead of broken up pipelines:
I've never seen a single Lua script on Redis that doesn't do something like KEYS(*) take longer than 10s of milliseconds. None of the costs of the operations mentioned should be considered more than log(n) either, so you're pretty future proofed as well in case you're expecting large data growth. If your Redis server is being blocked for too long, that's more an indication that you need a bigger server because none of the operations you're running are very intense (log(n) at most, as mentioned).
One of the major benefits of the Lua script is you are sending your logic to be run server-side instead of sending a bunch of data back and forth to run your logic client-side (i.e. Get all the possible stories and send them to the client. Now send them each back to Redis one by one to run an ISMEMBER). All of that data being sent over the network will be a much bigger bottle neck than running more operations within Redis and Lua, both of which are very, very fast.
So, all in all, despite the valid question, I firmly stand by the Lua approach. If you care to run benchmarks and share them with us, that would be really awesome, because I'm guessing switching this over will improve things around two orders of magnitude.
There is no need to use Lua scripting here, though depending on your data size, Lua versions of this calculation may be faster (depending on SISMEMBER vs. ZUNIONSTORE vs. SDIFFSTORE + ZINTERSTORE performance). Generally, you can calculate everything you want without multiple round-trips and without Lua scripting, provided the following 3 assumptions.
Your ZSETs all use positive non-zero scores (it is easier if all scores are >1, I will assume this)
Your SETs all contain the same members as your sorted sets
The maximum score in your ZSETs is fixed, or can at least be bounded (something like 1000000 is perfectly reasonable)
The important operation here is ZUNIONSTORE, which can take SETs as input, and whose members behave as though all scores are 1.
To get the top 3 stories from your news ZSETs, excluding those stories already read by a given user, you can make the following calls:
ZUNIONSTORE temp 2 news:sports seen:user1:sports WEIGHTS 1 -1000000
ZREVRANGEBYSCORE temp inf 0 LIMIT 0 3
DEL temp
You can wrap that with a MULTI/EXEC transaction so you don't have any extra data laying around after the query, pipeline it all, etc. This one has a limitation that as your number of stories read and the number of stories in each category increases, this one performs slower.
As an alternative, if you had a secondary SET that had the same members as (for example) your news:sports ZSET, you could instead perform:
SDIFFSTORE temp news:sports:set seen:user1:sports
ZINTERSTORE temp 2 temp news:sports WEIGHTS 0 1
ZREVRANGEBYSCORE temp inf -inf LIMIT 0 3
DEL temp
That would eliminate the score requirements, though would increase data size for each of your stories by an entry in a SET. This one also gets slower for every entry in the news SET, ZSET, and the user's seen SET, but the constants are different, so may be faster depending on your data sizes.

What is the fastest way to store huge amount of unique strings?

I wonder what is the best way for storing huge amount of strings and checking for duplication.
We have to think about our priority:
duplicate check speed
inserting new string time
storage space on hard disk
random access time
What is the best solution, when our target is fast duplicate checking and inserting new strings time (no random access or storage space matter) ?
I think about SQL database, but which of DB's is best for this solution ?
If we use SQL DB, like MySQL, which storage engine will be the best ? (of course, we have to exclude memory because of data amount)
Use a hash function on the input string. the output hash would be the primary key/id of the record.
Then you can check if the DB has this hash/id/primary key:
If it doesnt: this is a new string; you add a new record including the string and hash as id.
If it does: check that the string from the loaded record is the same as the input string.
if the string is the same: it is a duplicate
if the string is different: this is a collision. Use a collision resolution scheme to resolve. (A couple of examples below)
You will have to consider which hash function/scheme/strength to use based on speed and expected number of strings and hash collision requirements/guarantees.
A couple of ways to resolve collisions:
Use a 2nd hash function to come up with a new hash in the same table.
Mark the record (e.g. with NULL) and repeat with a stronger 2nd hash function (with wider domain) on a secondary "collision" table. On query, if the string is marked as collided (e.g. NULL) then do the lookup again in the collision table. You might also want to use dynamic perfect hashing to ensure that this second table does not have further collisions.
Of course, depending on how persistent this needs to be and how much memory you are expecting to take up/number of strings, you could actually do this without a database, directly in memory which would be a lot faster.
You may want to consider a NoSQL solution:
Redis. Some of the use cases solved using Redis:
http://highscalability.com/blog/2011/7/6/11-common-web-use-cases-solved-in-redis.html
http://dr-josiah.blogspot.com/2011/02/some-redis-use-cases.html
(Josiah L. Carlson is the author of Redis in Action)
http://www.paperplanes.de/2010/2/16/a_collection_of_redis_use_cases.html
memcached. Some comparisons between memcached and Redis:
http://www.quora.com/What-are-the-differences-between-memcached-and-redis
Is memcached a dinosaur in comparison to Redis?
http://coder.cl/2011/06/concurrency-in-redis-and-memcache/
Membase/Couchbase who counts OMGPOP's Draw Something as one of their success stories. Comparison between Redis and Membase:
What is the major difference between Redis and Membase?
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Some questions:
how large is the set of strings?
will the application be read heavy or write heavy? or both?
how often would you like data to be persisted to disk?
is there a N most recent strings requirement?
Hope this helps.
Generate Suffix trees to store strings . Ukkonen's algorithm as in http://www.daimi.au.dk/~mailund/slides/Ukkonen-2005.pdf will give some insight how to create Suffix tree .There are number of ways to store this suffix tree. But once generated , the lookup time is very low.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

Resources