Create sets of minimal cardinality from set of pairs - algorithm

I have a set of pairs of IDs like
(123;1765)
(1212;8977)...
I need to separate those pairs into n groups with an inividual size (number of pairs) each. Those sets should have minimum cardinality (=there should be as few as possible different ids in each group).
Are there any existing algorithms which solve this problem? I'm not sure where/how to search for it.
This is necessary, because I currently work on the load balancing of one of my projects and each node should have to load as few IDs as possible because of limited RAM (each ID is connected to a larger dataset).
Edit:
Some background:
Different nodes in a cluster have to compare datasets identified by IDs. Each comparison is a pair of IDs (compare dataset of ID1 with ID2). Each node gets a bunch of pairs to know which IDs it has to compare and loads the corresponding datasets into RAM. A master node divides a big bunch of pairs into smaller bunches and distributes them to the slave nodes. Because each node can only store a limited amount of datasets, those smaller bunches need to contain as few different IDs as possible. But the nodes have different amounts of RAM, so the groups with minimal cardinality should have different sizes.
The comparison is symmetric, so compare(ID1, ID2) is the same as compare(ID2, ID1), so each pair is unique. Which datasets need to be compared is degined by a client which sents those jobs to the master as a bunch of pairs of IDs.
An example:
A client wants the comparison of dataset (1;2), (7;9), (9;105), (7;105), (2;4), (4;1) (usually here should be much more comparisons, so millions usually)
The client sends those pairs to the master, which has two registered slaves. Now the master needs to divide that stack of work into two groups, but the more different IDs are part of each group the more datasets need to be loaded by the slaves (ID corresponds to specific dataset, remember?).
So ideally the master would create a group like ((1;2), (2;4), (4;1)) (only contains 3 different IDs, so the slave only has to load 3 datasets) and ((7;9), (9;105), (7; 105)) (again just three IDs) instead of:
((1;2), (9;105)...) and ((2;4), (7;105)...). Here both slaves need to load 4 IDs and more, and e.g. both slaves need to load the datasets no. 2 and 105.
This needs to be optimized somehow..

My first instinct is to say that perhaps this could be resolved with a special cluster analysis where you customize the aggregation and distance functions.
The cluster members would be pairs.
The cluster aggregate would be the set-theoretical union of all pairs in the
cluster (this is instead of an average or median in the standard approach).
The distance function of any pair in comparison to the cluster would be the
number of elements in the pair that are not found in the cluster aggregate
(so the cardinality of the set difference; this replaces the Euclidean
distance in the standard approach).
Some cluster algorithms have you set the number of desired clusters in
advance, so you would set it to two.
And finally, because you need to balance things so that the cluster
aggregates have the same number of elements, further tweaking, but still
doable.
But, you say you will have millions of points to compare. The processing required for cluster analysis increases exponentially the more input you put in. In this situation, it is worth researching whether your problem is NP or NP-complete. I'm not well versed in that, but I suspect it is, in which case a true optimum will always escape you.
But, if you discover that your problem is in fact NP-complete, then you can still optimize, you just won't be able to guarantee arrival at the global optimum in a reasonable amount of time. So, for instance, you can break your set of pairs into subsets and run an algorithm such as above on the subsets. That may still be an improvement.

Related

Distribution of Key value pairs in Jboss Data Grid

I am loading 20 million non expiry entries in the Jboss Data Grid using Hotrod clients. My Hot rod clients are running on 5 different machines to load the data. The entries got added successfully. We have given a replication factor of 2. So there will be total 40 million entries in the grid. We found a variation of more than 10 % in the no of entries being added in each node. For eg, One node has 7.8 million entries while other node has 12 million entries.
So I was thinking why the entries are not equally distributed, ideally each node should have about 10 million entries. Our objective of the above test was to check whether the load/requests are getting equally distributed on all the nodes.
Any pointers on how the key/value pairs are distributed in JDG would be appreciated.
In Infinispan the hash space is divided into segments which then get mapped to the nodes in the cluster.
Entries are hashed by their keys by applying the MurmurHash3 function to them. This determines the segment which owns the key. It could be possible that your keys are causing a somewhat uneven distribution. You could try increasing the number of segments in your configuration. With your cluster, use at least 100 segments.
Also I had to lookup the meaning of "crore" and "lakh", as I had no idea what they were. You should probably use the 10M and 100K notations instead to make it easier to understand.

How to decide order of a B-tree

B trees are said to be particularly useful in case of huge amount of data that cannot fit in main memory.
My question is then how do we decide order of B tree or how many keys to store in a node ? Or how many children a node should have ?
I came across that everywhere people are using 4/5 keys per node. How does it solve the huge data and disk read problem ?
Typically, you'd choose the order so that the resulting node is as large as possible while still fitting into the block device page size. If you're trying to build a B-tree for an on-disk database, you'd probably pick the order such that each node fits into a single disk page, thereby minimizing the number of disk reads and writes necessary to perform each operation. If you wanted to build an in-memory B-tree, you'd likely pick either the L2 or L3 cache line sizes as your target and try to fit as many keys as possible into a node without exceeding that size. In either case, you'd have to look up the specs to determine what size to use.
Of course, you could always just experiment and try to determine this empirically as well. :-)
Hope this helps!

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

Write output of two different Hadoop jobs to same set of reducers

I have a scenario where I need to run two Hadoop jobs calculating n-gram statistics for two different corpora and make sure that they write each n-gram (and it's score) to the same reducer (so that in future I can read the data locally and compare and contrast two scores from two corpora). For e.g. if job J1 executes one of its reducers on machine M and writes n-gram N locally, I would like job J2 to also write n-gram N to the same machine M.
I know how to compute n-gram statistics for a corpora (for reference, one can refer to this pubication from Google). I have also defined my custom partitioner (taking hash based on first two words in the n-gram). Now how do I make sure that two different runs of the same program (on two different corpora) end up writing corresponding output to the same reducers?
Check out MultipleInputs. By pointing two sibling mappers against sibling datasets you can avoid running an ID map on the combined set before reducing.

Comparing huge trees over slow network

I have a tree like data structure:
I have a list of work orders
each work orders has several operations
each operation has several roles
each role has several resources nodes
Each work order, operation, role and resource node has a number of attributes.
I have two instances of such data strucutre: master and slave. I wish to periodically update slave and make it in sync with master. My question is: how do I do it really fast?
The problems are:
those two instances are huge
those two instances are on separate networks, connected by low throughput network
speed is critical parameter
[edit] 4. I do not have access to transaction log on master, just state of the master at this point in time (I have only read access on SQL views and that's it). [/edit]
What I was thinking was creating Merkle tree on both sides by hashing together node ID, node atributes and child nodes' hashes (bottom up, obviously).
And then comparing the trees by:
transmiting the list of top level hashes over the network
determining nodes which are not equal
recursively repeating the proces for mismatching nodes
Thus I get a list of nodes which are not in sync and then I update them.
The problem I see here is that I have to recreate both Merkle trees every time I compare instances, which costs time.
So, I was wondering if there is any other algorithm which I can try out?

Resources