We are at the beginning of an f# project involving real-time and historical analysis of streaming data. The data is contained in a c# object (see below) and is sent as part of a standard .net event. In real-time, the number of events we typically receive can vary greatly from less than 1/sec to upwards of around 800 events per second per instrument and thus can be very bursty. A typical day might accumulate 5 million rows/elements per insturment
A generic version of the C# event's data structure looks like this:
public enum MyType { type0 = 0, type1 = 1}
public class dataObj
{
public int myInt= 0;
public double myDouble;
public string myString;
public DateTime myDataTime;
public MyType type;
public object myObj = null;
}
We plan to use this data structure in f# in two ways:
Historical analysis using supervised && unsupervised machine learning (CRFs, clustering models, etc)
Real-time classification of data streams using the above models
The data structure needs to be able to grow as we add more events. This rules out array<t> because it does not allow for resizing, though it could be used for the historical analysis. The data structure also needs to be able to quickly access recent data and ideally needs to be able to jump to data x points back. This rules out Lists<T> because of the linear lookup time and because there is no random access to elements, just "forward-only" traversal.
According to this post, Set<T> may be a good choice...
> " ...Vanilla Set<'a> does a more than adequate job. I'd prefer a 'Set' over a 'List' so you always have O(lg n) access to the largest and smallest items, allowing you to ordered your set by insert date/time for efficient access to the newest and oldest items..."
EDIT: Yin Zhu response gave me some additional clarity into exactly what I was asking. I have edited the remainder of the post to reflect this. Also, the previous version of this question was muddied by the introduction of requirements for historical analysis. I have omitted them.
Here is a breakdown of the steps of the real-time process:
A realtime event is received
This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a Set<T>, or some other structure?
A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
Features for the model are generated and sent to a model for evaluation.
We may prune the data structure of older data to improve performance.
So the question is what is the best data structure to use for storing the real-time streaming events that we will use to generated features.
You should consider FSharpx.Collections.Vector. Vector<T> will give you Array-like features, including indexed O(log32(n)) look-up and update, which is within spitting distance of O(1), as well as adding new elements to the end of your sequence. There is another implementation of Vector which can be used from F# at Solid Vector. Very well documented and some functions perform up to 4X faster at large scale (element count > 10K). Both implementations perform very well up to and possibly beyond 1M elements.
In his answer, Jack Fox suggests using either the FSharpx.Collections Vector<'T> or the Solid Vector<'t> by Greg Rosenbaum (https://github.com/GregRos/Solid). I thought I might give back a bit to the community by providing instructions on how to get up and running with each of them.
Using the FSharpx.Collections.Vector<'T>
The process is pretty straight forward:
Download the FSharpx.Core nuget package using either the Project Manager Console or Manager Nuget Packages for Solution. Both are found in Visual Studio -> tools -> Library Manager.
If you're using it in F# script file add #r "FSharpx.Core.dll". You may need to use a full path.
Usage:
open FSharpx.Collections
let ListOfTuples = [(1,true,3.0);(2,false,1.5)]
let vector = ListOfTuples |> Vector.ofSeq
printfn "Last %A" vector.Last
printfn "Unconj %A" vector.Unconj
printfn "Item(0) %A" (vector.[0])
printfn "Item(1) %A" (vector.[1])
printfn "TryInitial %A" dataAsVector.TryInitial
printfn "TryUnconj %A" dataAsVector.Last
Using the Solid.Vector<'T>
Getting setup to use the Solid Vector<'t> is a bit more involved. But the Solid version has a lot more handy functionality and as Jack pointed out, has a number of performance benefits. It also has a lot of useful documentation.
You will need to download the visual studio solution from https://github.com/GregRos/Solid
Once you have downloaded it you will need to build it as there is no ready to use pre-built dll.
If you're like me, you may run into a number of missing dependencies that prevent the solution from being built. In my case, they were all related to the nuit testing frameworks (I use a different one). Just work through downloading/adding each of the dependencies until the solutions builds.
Once that is done and the solution is built, you will have a shiny new Solid.dll in the Solid/Solid/bin folder. This is where I went wrong. That is the core dll and is only enough for C# usage. If you only include a reference to the Solid.dll you will be able to create a vector<'T> in f#, but funky things will happen from then on.
To use this data structure in F# you will need to reference both the Solid.dll and the Solid.FSharp.dll which is found in \Solid\SolidFS\obj\Debug\ folder. You will only need one open statement -> open Solid
Here is some code showing usage in a F# script file:
#r "Solid.dll"
#r "Solid.FSharp.dll" // don't forget this reference
open Solid
let ListOfTuples2 = [(1,true,3.0);(2,false,1.5)]
let SolidVector = ListOfTuples2 |> Vector.ofSeq
printfn "%A" SolidVector.Last
printfn "%A" SolidVector.First
printfn "%A" (SolidVector.[0])
printfn "%A" (SolidVector.[1])
printfn "Count %A" SolidVector.Count
let test2 = vector { for i in {0 .. 100} -> i }
Suppose your dataObj contains a unique ID field, then any set data structure would be fine for your job. The immutable data structures are primarily used for functional style code or persistency. If you don't need these two, you can use HashSet<T> or SortedSet<T> in the .Net collection library.
Some stream specific optimization may be useful, e.g., keeping a fixed-size Queue<T>for the most recent data objects in the stream and store older objects in the more heavy weight set. I would suggest a benchmarking before switching to such hybrid data structure solutions.
Edit:
After reading your requirements more carefully, I found that what you want is a queue with user-accessible indexing or backward enumerator. Under this data structure, your feature extraction operations (e.g. average/sum, etc) cost O(n). If you want to do some of the operations in O(log n), you can use more advanced data structures, e.g. interval trees or skip lists. However, you will have to implement these data structures yourself as you need to store meta information in the tree nodes which are behind collection API.
This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a Set, a Queue, or some other structure?
Difficult to say without more information.
If your data are coming in with timestamps in ascending order (i.e. they are never out of order) then you can just use some kind of queue or extensible array.
If your data can come in out of order and you need them reordered then you want a priority queue or indexed collection instead.
to upwards of around 800 events per second
Those are extremely tame performance requirements for insertion rate.
A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
If you only ever want elements near the beginning why do you want random access? Do you really want random access by index or do you actually want random access by some other key like time?
From what you've said I would suggest using an ordinary F# Map keyed on index maintained by a MailboxProcessor that can append a new event and retrieve an object that allows all events to be indexed, i.e. wrap the Map in an object that provides its own Item property and implementation of IEnumerable<_>. On my machine that simple solution takes 50 lines of code and can handle around 500,000 events per second.
Related
I have a stream of chars that I need to keep in a big data structure (can contain billions of chars)
I need to be able to:
store these chars quickly.
get all the chars quickly in order to print them for example
Delete a range of chars without leaving any gaps in the memory.
my first thought was double linked list , but the problem is that is taking to long to get to the middle of the list (begnining of the range)in order to delete.
to solve that I was thinking about a skip list which will make the search of this range faster but then I'm facing the problem of having to re-index each node after deletion
([0,1,2,3,4,5,6,7]
=> delete (3,4)
=> [0,1,2,5,6,7]
=> delete (3,4)
=> [0,1,2,7]
in this example after the first delete I need to give numbers 5,6,7 new indexes )
what is the best way to do this ?
It might be helpful to read about the span<T> data structure.
Related Answer: What is a "span" and when should I use one?
A span<T> is:
A very lightweight abstraction of a contiguous sequence of values of type T somewhere in memory.
Basically a struct { T * ptr; std::size_t length; } with a bunch of convenience methods.
A non-owning type (i.e. a "reference-type" rather than a "value type"): It never allocates nor deallocates anything and does not keep
smart pointers alive.
I would add that if you are processing a stream of characters, you will probably want to use buffering (or perhaps more apt - "chunking") where each chunk is itself a span<char> of fixed-size (which are all stored in a separate bit of memory) but tracked in a central array (or a more complex data structure like a double-linked-list, to facilitate quick deletion)
It would be an anti-pattern to attempt to actually maintain your entire stream of data in a single piece of contiguous physical memory (which you seem to suggest in part 3 of your request) - especially if you plan on deleting chunks of it. There should other ways to facilitate fast deletion without sacrificing performance elsewhere.
For example if you wish to delete a range of characters that falls into a given span, you can create two new spans from the start and end of the original span, excluding the deleted characters, and then replace the original span instance in your larger data structure (e.g if it were a double-linked list) with the two new smaller spans. None of this requires copying the underlying data itself, just slicing up our lightweight references to the underlying data.
If your language of choice doesn't support span, or a similar structure, check out how span is implemented.
Depending on your language of choice, it may even have built-in support for streaming spans (as .NET Core 2.1+ (2018) does).
Any additional requirements (such fast indexing to any point in your data stream, net of any deletions) can be satisfied by maintaining separate data structures that carry metadata about your spans (such as the suggested linked list). They will need updating when spans are deleted or added to, but because spans are a thin layer on top of large strings of characters, they reduce the cardinality of data structures you are maintaining by several orders of magnitude, so while you could get fancy with maintaining a variety of heaps and maps to facilitate O(1) algorithms for every operation, you will probably find that basic structures and O(log(n)) or even O(N) (where N is actually N/chunk-size) maintenance operations are feasible.
Problem
I have two lists of objects. Each object contains the following:
GUID (allows to determine if objects are the same — from business
point of view)
Timestamp (updates to current UTC each time the
object changed)
Version (positive integer; increments each time
the object changed)
Deleted (boolean flag; switches to "true" instead
of actual object deleting)
Data (some useful payload)
Any other fields if need
Next, I need to sync two lists according to these rules:
If object with some GUID presented only in one list, it should be copied to another list
If object with some GUID presented in both lists, the instance with less Version should be replaced with one having greater Version (nothing to do if versions are equal)
Real-world requirements:
Each list has 50k+ objects, each object is about 1 Kb
Lists are placed on different machines connected via Internet (e.g., mobile app and remote server), thus, algorithm shouldn't waste the traffic or CPU much
Most of time (say, 96%) lists are already synced before sync process, hence, the algorithm should determine it with minimal effort
If there are any differences, most of time they are pretty small (3-5 objects changed/added)
Should proceed OK if one list is empty (and other still has 50k+ items)
Solution #1 (currently implemented)
Client stores the time-of-last-sync-succeed (say T)
Both lists are asked for all objects having Timestamp > T (i.e. recently modified; in the production it's ... > (T - day) for better robustness)
These lists of recently modified objects are synced naively:
items presented only in first list are saved to second list
items presented only in second list are saved to first list
other items has their Version's compared and saved to appropriative list (if need)
Procs:
Works great with small changes
Almost fits the requirements
Cons:
Depends on T, which makes the algorithm fragile: it's easy to sync last updates, but hard to make sure lists are completely synced (using minimal T like 1970-01-01 just hangs the sync process)
My questions:
Is there any common / best-practice / proved way to sync object lists?
Is there any better [than #1] solutions for my case?
P.S. Already viewed, not duplicates:
Compare Two List Of Objects For Synchronization
Two list synchronization
Summary
All answers has some worth points. To summarize, here is the compiled answer I was looking for, based on finally implemented working sync system:
In general, use Merkle trees. They are dramatically efficient in comparing large amounts of data.
If you can, rebuild your hash tree from scratch every time you need it.
Check the time required to rebuild hash tree. Most likely it's pretty fast (e.g., in my case on Nexus 4 rebuilding tree for 20k items takes ~2 sec: 1.8 sec for fetching data from DB + 0.2 sec for building tree; the server performs ~20x faster), so you don't need to store the tree in the DB and maintain it when data changed (my first try was rebuilding only relevant branches — it's not too complicated to implement, but is very fragile).
Nevertheless, it's ok to cache and reuse tree if no data modifications was done at all. Once modification happened, invalidate the whole cache.
Technical details
GUID is 32 chars long without any hyphens/braces, lowercase;
I use 16-ary tree with the height of 4, where each branch is related to the GUID's char. It may be implemented as actual tree or map:
0000 → (hash of items with GUID 0000*)
0001 → (hash of items with GUID 0001*)
...
ffff → (hash of items with GUID ffff*);
000 → (hash of hashes 000_)
...
00 → (hash of hashes 00_)
...
() → (root hash, i.e. hash of hashes _)
Thus, the tree has 65536 leafs and requires 2 Mb of memory; each leaf covers ~N/65536 data items. Binary trees would be 2x more efficient in terms of memory, but it's a harder to implement.
I had to implement these methods:
getHash() — returns root hash; used for primary check (as mentioned,
in 96% that's all we need to test);
getHashChildren(x) — returns list of hashes x_ (at most 16); used for effective, single-request discovering data difference;
findByIdPrefix(x) — returns items with GUID x*, x must contain exactly 4 chars; used for requesting leaf items;
count(x) — returns number of items with GUID x*; when reasonably small, we can dismiss checking tree branch-by-branch and transfer bunch of items with single request;
As far as syncing is done per-branch transmitting small amounts of data, it's very responsive (you can check the progress at any time) + very robust for unexpected terminating (e.g., due to network failure) and easily restarts from the last point if need.
IMPORTANT: sometimes you will stuck with conflicting state: {version_1 = version_2, but hash_1 != hash_2}: in this case you must make some decision (maybe with user's help or comparing timestamps as last resort) and rewrite some item with another to resolve the conflict, otherwise you'll end up with unsynced and unsyncable hash trees.
Possible improvements
Implement transmitting (GUID, Version) pairs without payload for lightweighting requests.
Two suggestions come to mind, the first one is possibly something you're doing already:
1) Don't send entire lists of items with timestamps > T. Instead, send a list of (UUID, Version) tuples of objects with timestamps > T. Then the other side can figure out which objects it needs to update from that. Send the UUIDs of those back to request the actual objects. This avoids sending full objects if they have timestamp > T, but are nonetheless newer already (or present already with the latest Version) on the other side.
2) Don't process the full list at once, but in chunks, i.e. first sync 10%, then the next 10% etc. to avoid transferring too much data at once for big syncs (and to allow for restarting points if a connection should break). This can be done by e.g. starting with all UUIDs with a checksum equivalent to 1 modulo 10, then 1 modulo 10 etc.
Another possibility would be proactive syncing, e.g. asynchronously posting chances, possibly via UCP (unreliable as opposed to TCP). You would still need to sync when you need current information, but chances are most of it is current.
You need to store not time of last synchronization, but the state of the objects (eg. the hash of object data) at time of last synchronization. Then you compare each list with the stored list and find, what objects have changed on each side.
This is much more reliable than rely on time, cause time requires that both sides have synchronized timer which gives precise time (and this is not the case on most systems). For the same reason your idea of detecting changes based on time + version can be more error-prone than it initially seems.
Also you don't initially transfer object data but only GUIDs.
BTW we've made a framework (free with source) which addresses exactly your problems. I am not giving the link because some alternatively talented people would complain.
Data for various stocks is coming from various stock exchange continuously. Which data structure is suitable to store these data?
things to consider are :
a) effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
I thought of using Heap as the number of stocks would be more or less constant and the most frequent used operations are retrieval and update so heap should perform well for this scenario.
b) need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
I am nt sure about how to got about this.
c) as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
Ps: This is a interview question from Morgan Stanley.
A heap doesn't support efficient random access (i.e. look-up by index) nor getting the top k elements without removing elements (which is not desired).
My answer would be something like:
A database would be the preferred choice for this, as, with a proper table structure and indexing, all of the required operations can be done efficiently.
So I suppose this is more a theoretical question about understanding of data structures (related to in-memory storage, rather than persistent).
It seems multiple data structures is the way to go:
a) Effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
A map would make sense for this one. Hash-map or tree-map allows for fast look-up.
b) How to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)?
Just about any sorted data structure seems to make sense here (with the above map having pointers to the correct node, or pointing to the same node). One for activity and one for profit.
I'd probably go with a sorted (double) linked-list. It takes minimal time to get the first or last n items. Since you have a pointer to the element through the map, updating takes as long as the map lookup plus the number of moves of that item required to get it sorted again (if any). If an item often moves many indices at once, a linked-list would not be a good option (in which case I'd probably go for a Binary Search Tree).
c) How can you store all the transactional data persistently?
I understand this question as - if the connection to the database is lost or the database goes down at any point, how do you ensure there is no data corruption? If this is not it, I would've asked for a rephrase.
Just about any database course should cover this.
As far as I remember - it has to do with creating another record, updating this record, and only setting the real pointer to this record once it has been fully updated. Before this you might also have to set a pointer to the old record so you can check if it's been deleted if something happens after setting the pointer away, but before deletion.
Another option is having a active transaction table which you add to when starting a transaction and remove from when a transaction completes (which also stores all required details to roll back or resume the transaction). Thus, whenever everything is okay again, you check this table and roll back or resume any transactions that have not yet completed.
If I have to choose , I would go for Hash Table:
Reason : It is synchronized and thread safe , BigO(1) as average case complexity.
Provided :
1.Good hash function to avoid the collision.
2. High performance cache.
While this is a language agnostic question, a few of the requirements jumped out at me. For example:
effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
The java class HashMap uses the hash code of a key value to rapidly access values in its collection. It actually has an O(1) runtime complexity, which is ideal.
need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
This is an implementation based issue. Your best bet is to implement a fast sorting algorithm, like QuickSort or Mergesort.
as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
A database would have been my first choice, but it depends on your resources.
I'm trying to translate an idea I had from OOP concepts to FP concepts, but I'm not quite sure how to best go about it. I want to have multiple collections of records, but have individual records linked across the collections. In C# I would probably use multiple Dictionary objects with an Entity-specific ID as a common key, so that given any set of the dictionaries, a method could extract a particular Entity using its ID/Name.
I guess I could do the same thing in F#, owing to its hybrid nature, but I'd prefer to be more purely functional. What is the best structure to do what I'm talking about here?
I had considered maybe a trie or a patricia trie, but I shouldn't need very deep name searching, and I'm more likely to have one or two of some things and lots of other things. It's a game design idea, so, for example, you'd only have one "Player" but could have tons of "Enemy1", "Enemy2" etc.
Is there a really good data structure for fast keyed lookup in FP, or should I just stick to Dictionary/Hashmaps?
A usual functional data structure for representing dictionaries that's available in F# is Map (as pointed out by larsmans). Under the cover, this is implemented as a ballanced binary tree, so the complexity of lookup is O(log N) for a tree containing N elements. This is slower than hash-based dictionary (which has O(1) for good hash keys), but it allows adding and removing elements without copying the whole collection - only a part of the tree needs to be changed.
From your description, I have the impression that you'll be creating the data structure only once and then using it for a long time without modifying it. In this case, you could implement a simple immutable wrapper type that uses Dictionary<_, _> under the cover, but takes all elements as a sequence in the constructor and doesn't allow modifications:
type ImmutableMap<'K, 'V when 'K : equality>(data:seq<'K * 'V>) = // '
// Store data passed in constructor in hash-based dictionary
let dict = new System.Collections.Generic.Dictionary<_, _>()
do for k, v in data do dict.Add(k, v)
// Provide read-only access
member x.Item with get(k) = dict.[k]
let f = new ImmutableMap<_,_ >( [1, "Hello"; 2, "Ahoj" ])
let str = f.[1]
This should be faster than using F# Map as long as you don't need to modify the collection (or, more precisely, create copies with elements added/removed).
Use the F# module Collections.Map. My bet is that it implements balanced binary search trees, the data structure of choice for this task in functional programming.
Tries are hard to program and mostly useful in specialized applications such as search engine indexing, where they are commonly used as a secondary store on top of an array/database/etc. Don't use them unless you know you need to.
Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.