Paired comparisons algorithm design - algorithm

I have 15 groupings of 5 words. Let's say the first grouping is the 'happy' grouping and is the following: ["happy", "smile", "fun", "joy", "laugh"], and the second is the 'sad' grouping and is the following: ["sad", "frown", "bummer", "cry", "rain-cloud"]. All the other groupings are similar, with five words in an array.
I am designing a paired comparisons react app, and I need one word from each grouping to be randomly chosen and paired with a randomly chosen word from each other grouping. From the examples above, a pair for grouping 1 and 2 might be ["smile", "cry"]. There should be 120 total pairs (exactly one for each grouping with each other grouping).
I was thinking of using a loop and going through the groupings one by one, then for each of the remaining groupings, taking a random word from the grouping I'm looking at and one from the other and creating a pair.
I feel like this isn't very elegant or efficient, and I'm curious how I might design a better algorithm. I think recursion might be helpful, but I can't think of how I could use it in this scenario.
Any thoughts or ideas? Thanks!

I had a thought about using recursion, but unfortunately I couldn't think of any algorithm.
What I tried here is: while the list of all sets is not empty, iterate through every item on the first set, and pair it with a random item from any other set, then remove that chosen item. After the whole first set is iterated, simply remove that first set from the list of all sets. This way there is no need of extra calculation that checks for duplicates.
(I used javascript for implementing, and made the sets simpler)
let myArray = [["happy", "smile", "fun"],
["sad", "frown", "bummer"],
[1,2,3],
[4,5,6]]
let pairs = []
while (myArray.length != 1){
for (var i = 0; i<myArray[0].length; i++){
var next = myArray[Math.floor(Math.random() * (myArray.length-1)) + 1];
var nextindex = Math.floor(Math.random()*next.length)
pairs.push([myArray [0][i], next[nextindex]]);
next.splice(nextindex, 1)
}
anarray.splice(0, 1);
}
console.log(pairs)
Although, as you said, this code is not elegant (nor the most efficient)... There should be a better solution, so it is worth it to keep on thinking for a better algorithm!

Related

Algorithm for grouping non-transitive pairs of items into maximal (overlapping) subsets

I'm working on an algorithm to combine matching pairs of items into larger groups. The problem is that these pairs are not transitive; 1=2 and 2=3 does not necessarily mean that 1=3. They are, however, commutative, so 1=2 implies 2=1.
Each item can belong to multiple groups, but each group should be as large as possible; for example, if 1=2, 1=3, 1=4, 1=5, 2=3, 3=4, and 4=5, then we'd want to end up with groups of 1-2-3, 1-3-4, and 1-4-5.
The best solution I've come up with to this so far is to work recursively: for any given item, iterate through every later item, and if they match, recurse and iterate through every later item than that to see if it matches all of the ones you've collected so far. (and then check to make sure there isn't a larger group that already contains that combination, so e.g. in the above example I'd be about to output 4-5 but then would go back and find that they were already incorporated in 1-4-5)
The sets involved are not enormous - rarely more than 40 or 50 items - but I might be working with thousands of these little sets in a single operation. So computational-complexity-wise it's totally fine if it's O(n²) or whatever because it's not going to have to scale to enormous sets, but I'd like it to be as fast as possible on those little 50-item sets.
Anyway, while I can probably make do with the above solution, it feels needlessly awkward and slow, so if there's a better approach I'd love to hear about it.
If you want ALL maximal groups, then there is no subexponential algorithm for this problem. As https://cstheory.stackexchange.com/questions/8390/the-number-of-cliques-in-a-graph-the-moon-and-moser-1965-result points out, the number of maximal cliques to find may itself grow exponentially in the size of the graph.
If you want just a set of maximal groups that covers all of the original relationships, then you can solve this in polynomial time (though not with a great bound).
def maximal_groups (pairs):
related = {}
not_included = {}
for pair in pairs:
for i in [0, 1]:
if pair[i] not in related:
related[pair[i]] = set()
not_included[pair[i]] = set()
if pair[1-i] not in related:
related[pair[1-i]] = set()
not_included[pair[1-i]] = set()
related[pair[0]].add(pair[1])
related[pair[1]].add(pair[0])
not_included[pair[0]].add(pair[1])
not_included[pair[1]].add(pair[0])
groups = []
for item in sorted(related.keys()):
while 0 < len(not_included[item]):
other_item = not_included[item].pop()
not_included[other_item].remove(item)
group = [item, other_item]
available = [x for x in sorted(related[item]) if x in related[other_item]]
while 0 < len(available):
next_item = available[0]
for prev_item in group:
if prev_item in not_included[next_item]:
not_included[next_item].remove(prev_item)
not_included[prev_item].remove(next_item)
group.append(next_item)
available = [x for x in available if x in related[next_item]]
groups.append(group)
return groups
print(maximal_groups([[1,2], [1,3], [1,4], [1,5], [2,3], [3,4], [4,5]]))

Algorithm / Data structure for largest set intersection in a collection of sets with a given set

I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.
I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).
EDIT: approximate answers would be alright too
I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.
If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.
One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.
I don't see any way to do this in less than O(C) per query, but I have some ideas on how to maximize efficiency. The idea is basically to build a lookup table for each element. If some elements are rare and some are common, you can have positive and negative lookup tables:
s[i] // your query, an array of size 2 thousand, true/false
sign[i] // whether the ith element is positive/negative lookup. +/- 1
sets[i] // a list of all the sets that the ith element belongs/(doesn't) to
query(s):
overlaps[i] // an array of size C, initialized to 0's
for i in len(s):
if s[i]:
for j in sets[i]:
overlaps[j] += sign[i]
return max_index(overlaps)
Especially if many of your elements are of widely differing probabilities (as you said), this approach should save you some time: very rare or very common elements can be dealt with almost instantly.
To further optimize: you can sort the structure so that the elements that are most common/most rare are dealt with first. After you have done the first e.g. 3/4, you can do a quick pass to see if the closest matching set is so far ahead of the next set that it is not necessary to continue, though again whether that is worthwhile depends on the details of your data's distribution.
Yet another refinement: make sets[i] one of two possible structures: if the element is very rare or common, sets[i] is just a list of the sets that the ith element is in/not in. However, suppose the ith element is in half the sets. Then sets[i] is just a list of indices half as long as the number of sets, looping through it and incrementing overlaps is wasteful. Have a third value for sign[i]: if sign[i] == 0, then the ith element is relatively close to 50% commonality (this may just mean between 5% and 95%, or anything else), and instead of a list of sets in which it appears, it will simply be an array of 1's and 0's with length equal to C. Then you would just add the array in its entirety to overlaps which would be faster.
Put all of your elements, from the million sets into a Hashtable. The key will be the element, the value will be a set of indexes that point to a containing set.
HashSet<Element>[] AllSets = ...
// preprocess
Hashtable AllElements = new Hashtable(2000);
for(var index = 0; index < AllSets.Count; index++) {
foreach(var elm in AllSets[index]) {
if(!AllElements.ContainsKey(elm)) {
AllElements.Add(elm, new HashSet<int>() { index });
} else {
((HashSet<int>)AllElements[elm]).Add(index);
}
}
}
public List<HashSet<Element>> TopIntersect(HashSet<Element> set, int top = 1) {
// <index, count>
Dictionar<int, int> counts = new Dictionary<int, int>();
foreach(var elm in set) {
var setIndices = AllElements[elm] As HashSet<int>;
if(setIndices != null) {
foreach(var index in setIndices) {
if(!counts.ContainsKey(index)) {
counts.Add(index, 1);
} else {
counts[index]++;
}
}
}
}
return counts.OrderByDescending(kv => kv.Value)
.Take(top)
.Select(kv => AllSets[kv.Key]).ToList();
}

Divide a group of people into two disjoint subgroups (of arbitrary size) and find some values

As we know from programming, sometimes a slight change in a problem can
significantly alter the form of its solution.
Firstly, I want to create a simple algorithm for solving
the following problem and classify it using bigtheta
notation:
Divide a group of people into two disjoint subgroups
(of arbitrary size) such that the
difference in the total ages of the members of
the two subgroups is as large as possible.
Now I need to change the problem so that the desired
difference is as small as possible and classify
my approach to the problem.
Well,first of all I need to create the initial algorithm.
For that, should I make some kind of sorting in order to separate the teams, and how am I suppose to continue?
EDIT: for the first problem,we have ruled out the possibility of a set being an empty set. So all we have to do is just a linear search to find the min age and then put it in a set B. SetA now has all the other ages except the age of setB, which is the min age. So here is the max difference of the total ages of the two sets, as high as possible
The way you described the first problem, it is trivial in the way that it requires you to find only the minimum element (in case the subgroups should contain at least 1 member), otherwise it is already solved.
The second problem can be solved recursively the pseudo code would be:
// compute sum of all elem of array and store them in sum
min = sum;
globalVec = baseVec;
fun generate(baseVec, generatedVec, position, total)
if (abs(sum - 2*total) < min){ // check if the distribution is better
min = abs(sum - 2*total);
globalVec = generatedVec;
}
if (position >= baseVec.length()) return;
else{
// either consider elem at position in first group:
generate(baseVec,generatedVec.pushback(baseVec[position]), position + 1, total+baseVec[position]);
// or consider elem at position is second group:
generate(baseVec,generatedVec, position + 1, total);
}
And now just start the function with generate(baseVec,"",0,0) where "" stand for an empty vector.
The algo can be drastically improved by applying it to a sorted array, hence adding a test condition to stop branching, but the idea stays the same.

An algorithm to find a permutation in sequence

I'm asking for a simple problem: how to find one (and only one) permutation in a sequence of numbers (with repetitions) with the lowest complexity?
Suppose we have the sequence: 1 1 2 3 4. Then we permute 2 and 3, so we have: 1 1 3 2 4. How can I find that 2 and 3 have been permuted? The worst solution would be to generate all possibilities and compare each one with original permuted sequence, but I need something fast...
Thank you for your answer.
The problem with this is there will be multiple solutions to your problem without some constraints such as the order is sequentially found.
What I'd look at is first test that there are still the same values in the sequence and if so just step through one by one until a mismatch is found and then find where the first occurance of the other value is and mark that as the permutation. Now continue searching for the next modification and so on...
If you just want to know how much it's changed I'd look at levenshtein algorithm. The basis of this algorithm may even give you what you need for your own custom algorithm or inspire other approaches.
This is fast but it won't tell you which items have changed.
The only full solution I know of would be to record each change as it happens so you can just look at the history of changes to know the perfect answer.
function findswaps:
linkedlist old <- store old string in linkedlist
linkedlist new <- store new string in linkedlist
compare elements one by one:
if same
next iteration until exhausted
else
remember old item
iterate through future `new` elements one by one:
if old item is found
report its position in new list
else
error
My humble attempt please correct me if wrong, so I can help better. I'm guessing the data is unordered so it can't be any faster than linear?
If there is only 1 swap between the original and derived arrays, you could try something like this at O(n) for array length n:
int count = 0;
int[] mismatches;
foreach index in array {
if original[index] != derived[index] {
if count == 2 {
fail
}
mismatches[count++] = index;
}
}
if count == 2 and
original[mismatches[0]] == derived[mismatches[1]] and
original[mismatches[1]] == derived[mismatches[0]] {
succeed
}
fail
Note that this reports a fail when nothing was swapped between the arrays.

Good hash function for permutations?

I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.
e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.
I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).
This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?
Are there any good hash functions for permutations?
UPDATE
I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):
41 different hashes for just adding (so there are lots of collisions)
8 different hashes for XOR'ing values together
286 different hashes for multiplying
3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)
So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!
PS: I do not want to sort the values when I do not have to, because this would get too slow.
One potential candidate might be this.
Fix a odd integer R.
For each element e you want to hash compute the factor (R + 2*e).
Then compute the product of all these factors.
Finally divide the product by 2 to get the hash.
The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding
that the product will ever become 0. The division by 2 at the end is because
the product will always be odd, hence the division just removes a constant bit.
E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18].
The product (computed using 32-bit ints) is
(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311
Hence the hash would be
3376724311/2 = 1688362155.
Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.
If you sort your arrays before storing them or computing hashes, every good hash function will do.
If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.
If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.
The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.
depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.
This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine
I would suggest this:
1. Check if the lengths of permutations are the same (if not - they are not equal)
Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).
note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.
pseudo-code:
if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;
the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.
You can probably reduce the collisions a lot by using the product as well as the sum of the terms.
1*10*3*18=540 and 10*18*3*1=540
so the sum-product hash would be [32,540]
you still need to do something about collisions when they do happen though
I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes.
so if you first sort the array, and then generates a unique string using some delimiter.
so you can do the following (Java):
int[] arr = selectRandomNumbers();
Arrays.sort(arr);
int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();
if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format
String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);
String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare

Resources