Partitioning N arrays into K groups with constraints - algorithm

I have been stuck in this problem and can't find the efficient solution for this problem .
I have N (Upto 10 Million ) arrays of say maximum 100 elements. These arrays contain numbers from 1-10000 .
Now my problem is to partition these arrays into K groups such that i minimize the duplicates across all the arrays i.e for an array containing 1, 4, 10 ,100 and another containing 1, 100. I would like them to go into same group because that minimizes duplicity. Two constraints my problem has are as follows -
i don't want to increase size of unique elements more than 110 for a group of arrays. So i have an array of size 100 and there is another array of size 100 which is a 60% match i would rather create new group because this increases no. of unique elements to 140 and this will go on increasing.
The number of vectors in the groups should be uniformly distributed.
Grouping these arrays based on size in decreasing order. Then finding unique vectors unique hashing and applying a greedy algo of maximum match with the constraints but the greedy doesn't seem to be working well because that will entirely depend on the partitions i picked first. I couldn't figure out how DP can be applied because number of combinations given total number of vectors is just huge. I am not sure what methodology should i take.
some of the fail cases of my algo are , say there are two vectors which are mutually exclusive of each other but if i form a group with them i could match 100% with a third vector which otherwise matched just 30% in a group and made that group full following the addition to that group this will increase my duplicity because the third vector should have formed a group with first two vectors.

Simple yet intensive on computing and memory is iterate 10 million times for each array to match maximum numbers match. Now store match numbers in an array and find match of such arrays similarly by iterating with criteria that match should be at least 60%

Related

Is there an algorithm for generating random pairs of a set of numbers?

I have a discontinuous list of numbers N (e.g. { 1, 2, 3, 6, 8, 10}) and i need to progressively create random pairs of numbers in N and store them in a list in which there can't be twice the same pair.
For example for a list of 3 different of numbers there is 6 possible pair (not counting same number pair):
example for the list { 4, 8, 9 }, the possible pairs are:
(4,8) (4,9) (8,4) (8,9) (9,4) (9,8)
When we arrive to a number list size of 30 for example, we get 870 possible pairs and with my current method I get less and less efficient the more possible pairs there are.
For now my strategy with a number list of size 30 for example is :
N = { 3, 8, 10, 15, 16, ... } // size = 30
// Lets say I have already a list with 200 different pairs
my_pairs = { (8,16), (23, 32), (16,10), ... }
// Get two random numbers in the list
rn1 = random(N)
rn2 = random(N)
Loop through my_pairs to see if the pair (rn1,rn2) has already been generated
If there is one, we pick two new numbers rn1 & rn2 at random and retry adding them to my_pairs
If not then we add it to the list
The issue is that the more pairs we have in my_pairs, the less likely it is for a pair to not be in that list. So we have to check multiple random pairs multiple times and go through the list every time.
I could try to generate all possible pairs at the start, shuffle the list and pop one element each time I need to add a random pair to my list.
But it will take a lot of space to store all possible pairs when my Numbers list size is increasing (like 9900 possible pairs for 100 different numbers).
And I add numbers in N during my process so I can't afford to recalculate all possible pairs every time.
Is there an algorithm for generating random unique pairs ?
Maybe it would be faster using matrices or storing my pairs in some sort of a tree graph ?
It depends a lot on what you want to optimize for.
If you want to keep things simple and easy to maintain, having a hash set of all generated numbers sounds reasonable. The assumption here is that both checking membership and adding a new element should be O(1) on average.
If you worry about space requirements, because you regularly use up to 70% of the possible pairs, then you could optimize for space. To do that, I'd first establish a mapping between each possible pair and a single integer. I'd do so in a way that allows for easy addition of more numbers to N.
+0 +1
0 (0,1) (1,0)
2 (0,2) (2,0)
4 (1,2) (2,1)
6 (0,3) (3,0)
8 (1,3) (3,1)
10 (2,3) (3,2)
Something like this would map a single integer i to a pair (a,b) of indices into your sequence N, which you could then look up in N to turn them into an actual pair of elements. You can come up with formulas for this mapping, although the conversion from i to (a,b) will entail a square root somewhere.
When you have this, the task of picking a pair from a set of arbitrary numbers becomes the task of picking an integer from a continuous range of integers. Now you could use a bit map to very efficiently store for each index whether you have already picked that one in the past. For low percentages of picked pairs that bitmap may be more memory-consuming than a hash map of only picked values would be, but as you approach 70% of all pairs getting picked, it will be way more efficient. I would expect a typical hash map entry to consume at least 3×64=192 bit of storage, so the bitmap will start saving memory once 1/192=0.52% of all values are getting picked. Growing the bit map might still be expensive, so estimating the maximal size of N might help allocating enough memory up front.
If you have a costly random number generator, or worry about the worst case time complexity of the whole thing, then you might want to avoid multiple attempts that might result in already picked pairs. To achieve that you would probably store the set of all picked pairs in some kind of search tree where each node also keeps track of how many leafs its subtree contains. That way you could generate a random number in a range that corresponds to the size of pairs that haven't been picked yet, and then use the information in that tree to add to the chosen value the number of all already picked indices smaller than that. I haven't worked out all details but I believe with this it should be possible to turn this into O(log n) worst case time complexity, as opposed to the O(1) average case but O(n) or even O(∞) worst case we had before.

Looking for an algorithm to a unique problem

I have six arrays that are each given a (not necessarily unique) value from one to fifty. I am also given a number of items to split between them. The value of each item is defined by the array it is in. Arrays can hold infinite or zero items, but the sum of items in all arrays must equal the original number of items given.
I want to find the best configuration of items in arrays where the sum of item values in each individual array are as close as possible to each other.
For instance, let's say that I have three arrays with a value of 10 and three arrays with a value of 20. For nine items, one would go in each of the '20' arrays and two would go into each of the '10' arrays so that the sum of each array is 20 and the total number of items is nine.
I can't add a fractional number of items to an array, and the numbers are hardly ever perfectly divisible like that example, but there always exists a solution where the difference between the sums is minimal.
I'm currently using brute force to solve this problem, but performance suffers with larger numbers of items. I feel like there is a mathematical answer to this problem, but I wouldn't even know where to begin.
It is easy to write a greedy algorithm that comes up with an approximate solution. Just always add the next item to the array with the lowest sum of values.
The array with the highest value should be within 1 item of being correct.
For each count of items in the array with the highest value, you can repeat the exercise. Getting the array with the second highest value to within 1.
Continue through all of them, and with 6 arrays you'll wind up with 3^5 = 243 possible arrangements of items (note that the number of items in the last array is entirely determined by the first 5). Pick the best of these and your combinatorial explosion is contained.
(This approach should work if you're trying to minimize the value difference between the largest and smallest array, and have a fixed number of arrays. )

Best algorithm to find N unique random numbers in VERY large array

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.
If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.
counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)
I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)
INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.
I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.
My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.
The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

sum property of consecutive numbers

Suppose we have a list of numbers like [6,5,4,7,3]. How can we tell that the array contains consecutive numbers? One way is ofcourse to sort them or we can find the minimum and maximum. But can we determine based on the sum of the elements ? E.g. in the example above, it is 25. Could anyone help me with this?
The sum of elements by itself is not enough.
Instead you could check for:
All elements being unique.
and either:
Difference between min and max being right
or
Sum of all elements being right.
Approach 1
Sort the list and check the first element and last element.
In general this is O( n log(n) ), but if you have a limited data set you can sort in O( n ) time using counting sort or radix sort.
Approach 2
Pass over the data to get the highest and lowest elements.
As you pass through, add each element into a hash table and see if that element has now been added twice. This is more or less O( n ).
Approach 3
To save storage space (hash table), use an approximate approach.
Pass over the data to get the highest and lowest elements.
As you do, implement an algorithm which will with high (read User defined) probability determine that each element is distinct. Many such algorithms exist, and are in use in Data Mining. Here's a link to a paper describing different approaches.
The numbers in the array would be consecutive if the difference between the max and the minimum number of the array is equal to n-1 provided numbers are unique ( where n is the size of the array ). And ofcourse minimum and maximum number can be calculated in O(n).

Resources