For the last few days, I've tried to accomplish the following task regarding the analysis of a Set of Objects, and the solutions I've come up with rely heavily on Memory (obtaining OutOfMemory exceptions in some cases) or take an incredible long time to process. I now think is a good idea to post it here, as I'm out of ideas. I will explain the problem in detail, and provide the logic I've followed so far.
Scenario:
First, we have an object, which we'll name Individual, that contains the following properties:
A date
A Longitude - Latitude pair
Second, we have another object, which we'll name Group, which definition is:
A set of Individuals that, together, match the following conditions:
All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other.
The distance between each object is less than Y meters.
A group can have N>1 individuals, as long as each of the Individuals match the conditions within each other.
All individuals are stored in a database.
All groups would also be stored in a database.
The task:
Now, consider a new individual.
The system has to check if the new individual:
Belongs to an existing Group or Groups
The Individual now forms one or multiple new Groups with other Individuals.
Notes:
The new individual could be in multiple existing groups, or could create multiple new groups.
SubGroups of Individuals are not allowed, for example if we have a Group that contains Individuals {A,B,C}, there cannot exist a group that contains {A,B}, {A,C} or {B,C}.
Solution (limited in processing time and Memory)
First, we filter the database with all the Individuals that match the initial conditions. This will output a FilteredIndividuals enumerable, containing all the Individuals that we know will form a Group (of 2) with the new one.
Briefly, a Powerset is a set that contains all the possible subsets of
a particular set. For example, a powerset of {A,B,C} would be:
{[empty], A, B, C, AB, AC, BC, ABC}
Note: A powerset will output a new set with 2^N combinations, where N is the length of the originating set.
The idea with using powersets is the following:
First, we create a powerset of the FilteredIndividuals list. This will give all possible combinations of Groups within the FilteredIndividuals list. For analysis purposes and by definition, we can omit all the combinations that have less than 2 Individuals in them.
We Check if each of the Individuals in a combination of the powerset, match the conditions within each other.
If they match, that means that all of the Individuals in that combination form a Group with the new Individual. Then, to avoid SubGroups, we can eliminate all of the subsets that contain combinations of the Checked combination. I do this by creating a powerset of the Checked combination, and then eliminating the new powerset from the original one.
At this point, we have a list of sets that match the conditions to form a Group.
Before formally creating a Group, I compare the DB with other existing Groups that contain the same elements as the new sets:
If I find a match, I eliminate the newly created set, and add the new Individual to the old Group.
If I don't find a match, it means they are new Groups. So I add the new Individual to the sets and finally create the new Groups.
This solution works well when the FilteredIndividuals enumerable has less than 52 Individuals. After that, Memory exceptions are thrown (I know this is because of the maximum size allowed for data types, but incrementing such size is not of help with very big sets. For your consideration, the top amount of Individuals that match the conditions I've found is 345).
Note: I have access to the definition of both entities. If there's a new property that would reduce the processing time, we can add it.
I'm using the .NET framework with C#, but if the language is something that requires changing, we can accept this as long as we can later convert the results to object understandable by our main system.
All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other.
The distance between each object is less than Y meters.
So your problem becomes how to cluster these points in 3-space, a partitioning where X and Y are your latitude and longitude, Z is the time coordinate, and your metric is an appropriately scaled variant of the Manhattan distance. Specifically you scale Z so that 10*Z days equals your maximum distance of Y meters.
One possible shortcut would be to use the divide et impera and classify your points (Individuals) in buckets, Y meters wide and 10 days high. You do so by dividing their coordinates by Y and by 10 days (you can use Julian dates for that). If an individual is in bucket H { X=5, Y=3, Z=71 }, then it cannot be than any individual in buckets with X < (5-1) or X > (5+1), Y < (3-1) or Y > (3+1), or Z < (71-1) or Z > (71+1), is in his same group, because their distance would certainly be above the threshold. This means that you can quickly select a subset of 27 "buckets" and worry with only those individuals in there.
At this point you can enumerate the possible groups your new individual can be in (if you use a database back end, they would be SELECT groups.* FROM groups JOIN iig USING (gid) JOIN individuals USING (uid) WHERE individuals.bucketId IN ( #bucketsId )), and compare those with the group your individual may form from other individuals (SELECT individuals.id WHERE bucketId IN ( #bucketsId ) AND ((x-#newX)*(x-#newX)+(y-#newY)*(y-#newY)) < #YSquared AND ABS(z - #newZ) < 10)).
This approach is not very performant (it depends on the database, and you'll want an index on bucketId at a minimum), but it has the advantage of using as little memory as possible.
On some database backends with geographical extensions, you might want to use the native latitude and longitude functions instead of implicitly converting to meters.
Related
Problem:
Given a set of group registrations, each for a varying number of people (1-7),
and a set of seating groups (immutable, at least 2m apart) varying from 1-4 seats,
I'd like to find the optimal assignment of people groups to seating groups:
People groups may be split among several seating groups (though preferably not)
Seating groups may not be shared by different people groups
(optional) the assignment should minimize the number of 'wasted' seats, i.e. maximize the number of seats in empty seating groups
(ideally it should run from within a Google Apps script, so memory and computational complexity should be as small as possible)
First attempt:
I'm interested in the decision problem (is it feasible?) as well as the optimization problem (see optional optimization function). I've modeled it as a SAT problem, but this does not find an optimal solution.
For this reason, I've tried to model it as an optimization problem. I'm thinking along the lines of a (remote) variation of multiple-knapsack, but I haven't been able to name it yet:
items: seating groups (size -> weight)
knapsacks: people groups (size -> container size)
constraint: combined item weight >= container size
optimization: minimize the number of items
As you can see, the constraint and optimization are inverted compared to the standard problem. So my question is: Am I on the right track here or would you go about it another way? If it's correct, does this optimization problem have a name?
You could approach this as an Integer Linear Programming Problem, defined as follows:
let P = the set of people groups, people group i consists of p_i people;
let T = the set of tables, table j has t_j places;
let x_ij be 1 if people from people group i are placed at table j, 0 otherwise
let M be a large penalty factor for empty seats
let N be a large penalty factor for splitting groups
// # of free spaces = # unavailable - # occupied
// every time a group uses more than one table,
// a penalty of N * (#tables - 1) is incurred
min M * [SUM_j(SUM_i[x_ij] * t_j) - SUM_i(p_i)] + N * SUM_i[(SUM_j(x_ij) - 1)]
// at most one group per table
s.t. SUM_i(x_ij) <= 1 for all j
// every group has enough seats
SUM_j(x_ij * t_j) = p_i for all i
0 <= x_ij <= 1
Although this minimises the number of empty seats, it does not minimise the number of tables used or maximise the number of groups admitted. If you'd like to do that, you could expand the objective function by adding a penalty for every group turned away.
ILPs are NP-hard, so without the right solvers, it might not be possible to make this run with Google Apps. I have no experience with that, so I'm afraid I can't help you. But there are some methods to reduce your search space.
One would be through something called column generation. Here, the problem is split into two parts. The complex master problem is your main research question, but instead of the entire solution space, it tries to find the optimum from different candidate assignments (or columns).
The goal is then to define a subproblem that recommends these new potential solutions that are then incorporated in the master problem. The power of a good subproblem is that it should be reducable to a simpler model, like Knapsack or Dijkstra.
I have two groups of objects where each group consists of 4 objects. The goal is to compute the degree of similarity between thes two groups. The comparison between two objects results to an int number. The lowest this number is the more similar the objects are. The order of these objects withing the group doesn't matter to the group equality.
So what i must do is compare each object of group 1 with each object of group 2 and this will give me 16 different comparison result between objects. I store these in a 4x4 int table called costs.
int[][] costs= new int[4][4];
for(int i=0;i<4;i++){
for(int j=0;j<4;j++){
costs[i][j]=compare(objectGroup1[i],objectGroup2[j]);
}
}
Now i have 4 sets of 4 comparison results and I must choose one result from each set, in order to add them and compute the total distance metric between the groups. This is the point where i got stuck.
I must try all combinations of four and get the minimum sum but there is the restrition of using an object only once.
Example: if the first of four values to add is the comparison result between objectGroup1[1] - objectGroup2[1] then I can't use in this foursome any other comparison results that came using objectGroup1[1] and same goes for objectGroup2[1].
valid example: group1[1]-group2[2], group1[2]-group2[1], group1[3]-group2[3],group1[4]-group2[4]---->each object from each group appears only once
What kind of algorithm can I use here?
It sounds like you're trying to find the permutation of group 1's items that make it most similar to group 2's items when pairing the items off.
Eric Lippert has a good series of blog posts on producing permutations. So basically all you have to do is iterate over them, computing the score by pairing items, and return the best score. Basically just Zip-ing and MinBy-ing:
groupSimilarity =
item1.Groups
// (you have to implement Permutations)
.Permutations()
// we want to compute the best score, but we don't know which permutation will win
// so we MinBy a function computing the permutation's score
.MinBy(permutation =>
// pair up the items and combine them, using the Similarity function
permutation.Zip(item2.Groups, SimilarityFunction)
// add up the similarity scores
.Sum()
)
The above code is C#, written in a "Linqy" functional style (sorry if you're not familiar with that). MinBy is a useful function from MoreLinq, Zip is a standard Linq operator.
I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.
This little project / problem came out of left field for me. Hoping someone can help me here. I have some rough ideas but I am sure (or at least I hope) a simple, fairly efficient solution exists.
Thanks in advance.... pseudo code is fine. I generally work in .NET / C# if that sheds any light on your solution.
Given:
A pool of n individuals that will be meeting on a regular basis. I need to form pairs that have not previously meet. The pool of individuals will slowly change over time. For the purposes of pairing, (A & B) and (B & A) constitute the same pair. The history of previous pairings is maintained. For the purpose of the problem, assume an even number of individuals. For each meeting (collection of pairs) and individual will only pair up once.
Is there an algorithm that will allow us to form these pairs? Ideally something better than just ordering the pairs in a random order, generating pairings and then checking against the history of previous pairings. In general, randomness within the pairing is ok.
A bit more:
I can figure a number of ways to create a randomized pool from which to pull pairs of individuals. Check those against the history and either throw them back in the pool or remove them and add them to the list of paired individuals. What I can't get my head around is that at some point I will be left with a list of individuals that cannot be paired up. But... some of those individuals could possibly be paired with members that are in the paired list. I could throw one of those partners back in the pool of unpaired members but this seems to lead to a loop that would be difficult to test and that could run on forever.
Interesting idea for converting a standard search into a probability selection:
Load the history in a structure with O(1) "contains" tests e.g. a HashSet of (A,B) pairs.
Loop through each of 0.5*n*(n-1) possible pairings
check if this pairing is in history
if not then continue to the next iteration of loop
increase "number found" counter
save pairing as "result" with probability 1/"number found" (i.e. always for the first unused pairing found)
Finally if "result" has an answer then use it, else all possibilities are exhausted
This will run in O(n^2) + O(size of history), and nicely detects the case when all probabilities are exhausted.
Based on your requirements, I think what you really need is quasi-random numbers that ultimately result in uniform coverage of your data (i.e., everyone pairs up with everyone else one time). Quasi-random pairings give you a much less "clumped" result than simple random pairings, with the added benefit that you have a much much greater control of the resulting data, hence you can control the unique pairings rule without having to detect whether the newly randomized pairings duplicate the historically randomized pairings.
Check this wikipedia entry:
http://en.wikipedia.org/wiki/Low-discrepancy_sequence
More good reading:
http://www.google.com/url?sa=t&source=web&cd=10&ved=0CEQQFjAJ&url=http%3A%2F%2Fwww.johndcook.com%2Fblog%2F2009%2F03%2F16%2Fquasi-random-sequences-in-art-and-integration%2F&ei=6KQXTMSuDoG0lQfVwPmbCw&usg=AFQjCNGQga_MKXJgfEQnQXy1qrHcwfOr4Q&sig2=ox7FB0mnToQbrOCYm9-OpA
I tried to find a C# library that would help you generate the sort of quasi-random spreads you're looking for, but the only libs I could find were in c/c++. But I still recommend downloading the source since the full logic of the quasi-random algorithms (look for quasi-Monte Carlo) is there:
http://www.gnu.org/software/gsl/
I see that as a graph problem where individuals are Nodes and vertex join individuals not yet related. With this reformulation create new pairs is simply to find a set of independant vertexes (without any common node).
That is not yet an answer but there is chances that this is a common graph problem with well known solutions.
One thing we can say at that point is that in some cases there may be no solution (you would have to redo some previous pairs).
It may also be simpler to consider dual graph (exchanging role of vertexes and nodes: nodes would be pairs and common individual between pairs vertexes).
at startup, build a list of all possible pairings.
add all possible new pairings to this list as individuals are added, and remove any expired pairings as individuals are removed from the pool.
select new pairings randomly from this list, and remove them from the list when the pairing is selected.
Form an upper diagonal matrix with your elements
Individual A B C D
A *
B * *
C * * *
D * * * *
Each blank element will contain True if the pair have been formed and False if not.
Each pairing session consist of looping through columns for each row until a False is found, form the pair and set the matrix element to true.
When deleting an individual, delete row and column.
If performance is an issue, you can keep the last pair formed for a row in a counter, updating it carefully when deleting
When adding an individual, add a last row & col.
Your best bet is probably:
Load the history in a structure with fast access e.g. a HashSet of (A,B) pairs.
Create a completely random set of pairings (e.g. by randomly shuffling the list of individuals and partitioning into adjacent pairs)
Check if each pairing is in the history (both (A,B) and (B,A) should be checked)
If none of the pairings are found, you have a completely new pairing set as required, else goto 2
Note that step 1 can be done once and simply updated when new pairings are created if you need to efficiently create large numbers of new unique pairings.
Also note that you will need to take some extra precautions if there is a chance that all possible pairings will be exhausted (in which case you need to bail out of the loop!)
Is there any way of ordering two elements? If so, you can save one (possibly only half) a hash probe per iteration by always ordering a pair the same way. So, if you have A, B, C and D, the generated possible pairs would be [AB, CD] [AC, BD] or [AD, BC].
What I'd do then is something like:
pair_everyone (pool, pairs, history):
if pool is empty:
all done, update global history, return pairs
repeat for pool_size/2:
pick element1 (randomly from pool)
pick element2 (randomly from pool)
set pair=pair(e1, e2)
until pair not in history or all possible pairs tried:
pick element1 (randomly from pool)
pick element2 (randomly from pool)
set pair=pair(e1, e2)
if pair is not in history:
result=pair_everyone(pool-e1-e2, pairs+pair, history+pair)
if result != failure:
return result
else:
return failure
How about:
create a set CI of all current individuals
then:
randomly select one individual A and remove from CI
create a new set of possible partners PP by copying CI and removing all previous partners of A
if PP is empty scan the list of pairs found and swap A for an individual C who is paired with someone not in A's history and who still has possible partners in CI. Recalculate PP for A = C.
if PP is not empty select one individual B from PP to be paired with A
remove B from CI
repeat until no new pair can be found
I'm designing a piece of a game where the AI needs to determine which combination of armor will give the best overall stat bonus to the character. Each character will have about 10 stats, of which only 3-4 are important, and of those important ones, a few will be more important than the others.
Armor will also give a boost to 1 or all stats. For example, a shirt might give +4 to the character's int and +2 stamina while at the same time, a pair of pants may have +7 strength and nothing else.
So let's say that a character has a healthy choice of armor to use (5 pairs of pants, 5 pairs of gloves, etc.) We've designated that Int and Perception are the most important stats for this character. How could I write an algorithm that would determine which combination of armor and items would result in the highest of any given stat (say in this example Int and Perception)?
Targeting one statistic
This is pretty straightforward. First, a few assumptions:
You didn't mention this, but presumably one can only wear at most one kind of armor for a particular slot. That is, you can't wear two pairs of pants, or two shirts.
Presumably, also, the choice of one piece of gear does not affect or conflict with others (other than the constraint of not having more than one piece of clothing in the same slot). That is, if you wear pants, this in no way precludes you from wearing a shirt. But notice, more subtly, that we're assuming you don't get some sort of synergy effect from wearing two related items.
Suppose that you want to target statistic X. Then the algorithm is as follows:
Group all the items by slot.
Within each group, sort the potential items in that group by how much they boost X, in descending order.
Pick the first item in each group and wear it.
The set of items chosen is the optimal loadout.
Proof: The only way to get a higher X stat would be if there was an item A which provided more X than some other in its group. But we already sorted all the items in each group in descending order, so there can be no such A.
What happens if the assumptions are violated?
If assumption one isn't true -- that is, you can wear multiple items in each slot -- then instead of picking the first item from each group, pick the first Q(s) items from each group, where Q(s) is the number of items that can go in slot s.
If assumption two isn't true -- that is, items do affect each other -- then we don't have enough information to solve the problem. We'd need to know specifically how items can affect each other, or else be forced to try every possible combination of items through brute force and see which ones have the best overall results.
Targeting N statistics
If you want to target multiple stats at once, you need a way to tell "how good" something is. This is called a fitness function. You'll need to decide how important the N statistics are, relative to each other. For example, you might decide that every +1 to Perception is worth 10 points, while every +1 to Intelligence is only worth 6 points. You now have a way to evaluate the "goodness" of items relative to each other.
Once you have that, instead of optimizing for X, you instead optimize for F, the fitness function. The process is then the same as the above for one statistic.
If, there is no restriction on the number of items by category, the following will work for multiple statistics and multiple items.
Data preparation:
Give each statistic (Int, Perception) a weight, according to how important you determine it is
Store this as a 1-D array statImportance
Give each item-statistic combination a value, according to how much said item boosts said statistic for the player
Store this as a 2-D array itemStatBoost
Algorithm:
In pseudocode. Here assume that itemScore is a sortable Map with Item as the key and a numeric value as the value, and values are initialised to 0.
Assume that the sort method is able to sort this Map by values (not keys).
//Score each item and rank them
for each statistic as S
for each item as I
score = itemScore.get(I) + (statImportance[S] * itemStatBoost[I,S])
itemScore.put(I, score)
sort(itemScore)
//Decide which items to use
maxEquippableItems = 10 //use the appropriate value
selectedItems = new array[maxEquippableItems]
for 0 <= idx < maxEquippableItems
selectedItems[idx] = itemScore.getByIndex(idx)