java - compare two groups of objects efficiently - algorithm

I have two groups of objects where each group consists of 4 objects. The goal is to compute the degree of similarity between thes two groups. The comparison between two objects results to an int number. The lowest this number is the more similar the objects are. The order of these objects withing the group doesn't matter to the group equality.
So what i must do is compare each object of group 1 with each object of group 2 and this will give me 16 different comparison result between objects. I store these in a 4x4 int table called costs.
int[][] costs= new int[4][4];
for(int i=0;i<4;i++){
for(int j=0;j<4;j++){
costs[i][j]=compare(objectGroup1[i],objectGroup2[j]);
}
}
Now i have 4 sets of 4 comparison results and I must choose one result from each set, in order to add them and compute the total distance metric between the groups. This is the point where i got stuck.
I must try all combinations of four and get the minimum sum but there is the restrition of using an object only once.
Example: if the first of four values to add is the comparison result between objectGroup1[1] - objectGroup2[1] then I can't use in this foursome any other comparison results that came using objectGroup1[1] and same goes for objectGroup2[1].
valid example: group1[1]-group2[2], group1[2]-group2[1], group1[3]-group2[3],group1[4]-group2[4]---->each object from each group appears only once
What kind of algorithm can I use here?

It sounds like you're trying to find the permutation of group 1's items that make it most similar to group 2's items when pairing the items off.
Eric Lippert has a good series of blog posts on producing permutations. So basically all you have to do is iterate over them, computing the score by pairing items, and return the best score. Basically just Zip-ing and MinBy-ing:
groupSimilarity =
item1.Groups
// (you have to implement Permutations)
.Permutations()
// we want to compute the best score, but we don't know which permutation will win
// so we MinBy a function computing the permutation's score
.MinBy(permutation =>
// pair up the items and combine them, using the Similarity function
permutation.Zip(item2.Groups, SimilarityFunction)
// add up the similarity scores
.Sum()
)
The above code is C#, written in a "Linqy" functional style (sorry if you're not familiar with that). MinBy is a useful function from MoreLinq, Zip is a standard Linq operator.

Related

Analysis of different Sets and optimizations. Best approach?

For the last few days, I've tried to accomplish the following task regarding the analysis of a Set of Objects, and the solutions I've come up with rely heavily on Memory (obtaining OutOfMemory exceptions in some cases) or take an incredible long time to process. I now think is a good idea to post it here, as I'm out of ideas. I will explain the problem in detail, and provide the logic I've followed so far.
Scenario:
First, we have an object, which we'll name Individual, that contains the following properties:
A date
A Longitude - Latitude pair
Second, we have another object, which we'll name Group, which definition is:
A set of Individuals that, together, match the following conditions:
All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other.
The distance between each object is less than Y meters.
A group can have N>1 individuals, as long as each of the Individuals match the conditions within each other.
All individuals are stored in a database.
All groups would also be stored in a database.
The task:
Now, consider a new individual.
The system has to check if the new individual:
Belongs to an existing Group or Groups
The Individual now forms one or multiple new Groups with other Individuals.
Notes:
The new individual could be in multiple existing groups, or could create multiple new groups.
SubGroups of Individuals are not allowed, for example if we have a Group that contains Individuals {A,B,C}, there cannot exist a group that contains {A,B}, {A,C} or {B,C}.
Solution (limited in processing time and Memory)
First, we filter the database with all the Individuals that match the initial conditions. This will output a FilteredIndividuals enumerable, containing all the Individuals that we know will form a Group (of 2) with the new one.
Briefly, a Powerset is a set that contains all the possible subsets of
a particular set. For example, a powerset of {A,B,C} would be:
{[empty], A, B, C, AB, AC, BC, ABC}
Note: A powerset will output a new set with 2^N combinations, where N is the length of the originating set.
The idea with using powersets is the following:
First, we create a powerset of the FilteredIndividuals list. This will give all possible combinations of Groups within the FilteredIndividuals list. For analysis purposes and by definition, we can omit all the combinations that have less than 2 Individuals in them.
We Check if each of the Individuals in a combination of the powerset, match the conditions within each other.
If they match, that means that all of the Individuals in that combination form a Group with the new Individual. Then, to avoid SubGroups, we can eliminate all of the subsets that contain combinations of the Checked combination. I do this by creating a powerset of the Checked combination, and then eliminating the new powerset from the original one.
At this point, we have a list of sets that match the conditions to form a Group.
Before formally creating a Group, I compare the DB with other existing Groups that contain the same elements as the new sets:
If I find a match, I eliminate the newly created set, and add the new Individual to the old Group.
If I don't find a match, it means they are new Groups. So I add the new Individual to the sets and finally create the new Groups.
This solution works well when the FilteredIndividuals enumerable has less than 52 Individuals. After that, Memory exceptions are thrown (I know this is because of the maximum size allowed for data types, but incrementing such size is not of help with very big sets. For your consideration, the top amount of Individuals that match the conditions I've found is 345).
Note: I have access to the definition of both entities. If there's a new property that would reduce the processing time, we can add it.
I'm using the .NET framework with C#, but if the language is something that requires changing, we can accept this as long as we can later convert the results to object understandable by our main system.
All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other.
The distance between each object is less than Y meters.
So your problem becomes how to cluster these points in 3-space, a partitioning where X and Y are your latitude and longitude, Z is the time coordinate, and your metric is an appropriately scaled variant of the Manhattan distance. Specifically you scale Z so that 10*Z days equals your maximum distance of Y meters.
One possible shortcut would be to use the divide et impera and classify your points (Individuals) in buckets, Y meters wide and 10 days high. You do so by dividing their coordinates by Y and by 10 days (you can use Julian dates for that). If an individual is in bucket H { X=5, Y=3, Z=71 }, then it cannot be than any individual in buckets with X < (5-1) or X > (5+1), Y < (3-1) or Y > (3+1), or Z < (71-1) or Z > (71+1), is in his same group, because their distance would certainly be above the threshold. This means that you can quickly select a subset of 27 "buckets" and worry with only those individuals in there.
At this point you can enumerate the possible groups your new individual can be in (if you use a database back end, they would be SELECT groups.* FROM groups JOIN iig USING (gid) JOIN individuals USING (uid) WHERE individuals.bucketId IN ( #bucketsId )), and compare those with the group your individual may form from other individuals (SELECT individuals.id WHERE bucketId IN ( #bucketsId ) AND ((x-#newX)*(x-#newX)+(y-#newY)*(y-#newY)) < #YSquared AND ABS(z - #newZ) < 10)).
This approach is not very performant (it depends on the database, and you'll want an index on bucketId at a minimum), but it has the advantage of using as little memory as possible.
On some database backends with geographical extensions, you might want to use the native latitude and longitude functions instead of implicitly converting to meters.

Algorithm to assign best value between points based on distance

I am having trouble figuring out an algorithim to best assign values to different points on a diagram based on the distance between the points.
Essentially, I am given a diagram with a block and a dynamic amount of points. It should look something like this:
I am then given a list of values to assign to each point. Here are the rules and info:
I know the Lat,Long values for each point and the central block. In other words, I can get the direct distance from every object to another.
The list of values may be shorter that the total number of points. In this case, values can be repeated multiple times.
In the case where values must be repeated, the duplicate values should be as far away as possible from one another.
Here is an example using a value list of {1,2}:
In reality, this is a very simple example. In truth, there may be thousands of points.
Find out how many values you need to repeat, in your example you have 2 values and 5 points so, you need to have 2 repetition for 2 values, then you will have 2x2=4 positions [call this pNum] (you have to use different pairs as much as possible so that they are far apart from each other).
Calculate a distance array then find the max pNum values in the array, in other words find the greates 4 values in the array in your example.
assigne the repeated values for the the points found most far apart, and assign the rest of the points based on the array distance values.

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

How do I pick the most beneficial combination of items from a set of items?

I'm designing a piece of a game where the AI needs to determine which combination of armor will give the best overall stat bonus to the character. Each character will have about 10 stats, of which only 3-4 are important, and of those important ones, a few will be more important than the others.
Armor will also give a boost to 1 or all stats. For example, a shirt might give +4 to the character's int and +2 stamina while at the same time, a pair of pants may have +7 strength and nothing else.
So let's say that a character has a healthy choice of armor to use (5 pairs of pants, 5 pairs of gloves, etc.) We've designated that Int and Perception are the most important stats for this character. How could I write an algorithm that would determine which combination of armor and items would result in the highest of any given stat (say in this example Int and Perception)?
Targeting one statistic
This is pretty straightforward. First, a few assumptions:
You didn't mention this, but presumably one can only wear at most one kind of armor for a particular slot. That is, you can't wear two pairs of pants, or two shirts.
Presumably, also, the choice of one piece of gear does not affect or conflict with others (other than the constraint of not having more than one piece of clothing in the same slot). That is, if you wear pants, this in no way precludes you from wearing a shirt. But notice, more subtly, that we're assuming you don't get some sort of synergy effect from wearing two related items.
Suppose that you want to target statistic X. Then the algorithm is as follows:
Group all the items by slot.
Within each group, sort the potential items in that group by how much they boost X, in descending order.
Pick the first item in each group and wear it.
The set of items chosen is the optimal loadout.
Proof: The only way to get a higher X stat would be if there was an item A which provided more X than some other in its group. But we already sorted all the items in each group in descending order, so there can be no such A.
What happens if the assumptions are violated?
If assumption one isn't true -- that is, you can wear multiple items in each slot -- then instead of picking the first item from each group, pick the first Q(s) items from each group, where Q(s) is the number of items that can go in slot s.
If assumption two isn't true -- that is, items do affect each other -- then we don't have enough information to solve the problem. We'd need to know specifically how items can affect each other, or else be forced to try every possible combination of items through brute force and see which ones have the best overall results.
Targeting N statistics
If you want to target multiple stats at once, you need a way to tell "how good" something is. This is called a fitness function. You'll need to decide how important the N statistics are, relative to each other. For example, you might decide that every +1 to Perception is worth 10 points, while every +1 to Intelligence is only worth 6 points. You now have a way to evaluate the "goodness" of items relative to each other.
Once you have that, instead of optimizing for X, you instead optimize for F, the fitness function. The process is then the same as the above for one statistic.
If, there is no restriction on the number of items by category, the following will work for multiple statistics and multiple items.
Data preparation:
Give each statistic (Int, Perception) a weight, according to how important you determine it is
Store this as a 1-D array statImportance
Give each item-statistic combination a value, according to how much said item boosts said statistic for the player
Store this as a 2-D array itemStatBoost
Algorithm:
In pseudocode. Here assume that itemScore is a sortable Map with Item as the key and a numeric value as the value, and values are initialised to 0.
Assume that the sort method is able to sort this Map by values (not keys).
//Score each item and rank them
for each statistic as S
for each item as I
score = itemScore.get(I) + (statImportance[S] * itemStatBoost[I,S])
itemScore.put(I, score)
sort(itemScore)
//Decide which items to use
maxEquippableItems = 10 //use the appropriate value
selectedItems = new array[maxEquippableItems]
for 0 <= idx < maxEquippableItems
selectedItems[idx] = itemScore.getByIndex(idx)

How can I sort a 10 x 10 grid of 100 car images in two dimensions, by price and speed?

Here's the scenario.
I have one hundred car objects. Each car has a property for speed, and a property for price. I want to arrange images of the cars in a grid so that the fastest and most expensive car is at the top right, and the slowest and cheapest car is at the bottom left, and all other cars are in an appropriate spot in the grid.
What kind of sorting algorithm do I need to use for this, and do you have any tips?
EDIT: the results don't need to be exact - in reality I'm dealing with a much bigger grid, so it would be sufficient if the cars were clustered roughly in the right place.
Just an idea inspired by Mr Cantor:
calculate max(speed) and max(price)
normalize all speed and price data into range 0..1
for each car, calculate the "distance" to the possible maximum
based on a²+b²=c², distance could be something like
sqrt( (speed(car[i])/maxspeed)^2 + (price(car[i])/maxprice)^2 )
apply weighting as (visually) necessary
sort cars by distance
place "best" car in "best" square (upper right in your case)
walk the grid in zigzag and fill with next car in sorted list
Result (mirrored, top left is best):
1 - 2 6 - 7
/ / /
3 5 8
| /
4
Treat this as two problems:
1: Produce a sorted list
2: Place members of the sorted list into the grid
The sorting is just a matter of you defining your rules more precisely. "Fastest and most expensive first" doesn't work. Which comes first my £100,000 Rolls Royce, top speed 120, or my souped-up Mini, cost £50,000, top speed 180?
Having got your list how will you fill it? First and last is easy, but where does number two go? Along the top or down? Then where next, along rows, along the columns, zig-zag? You've got to decide. After that coding should be easy.
I guess what you want is to have cars that have "similar" characteristics to be clustered nearby, and additionally that the cost in general increases rightwards, and speed in general increases upwards.
I would try to following approach. Suppose you have N cars and you want to put them in an X * Y grid. Assume N == X * Y.
Put all the N cars in the grid at random locations.
Define a metric that calculates the total misordering in the grid; for example, count the number of car pairs C1=(x,y) and C2=(x',y') such that C1.speed > C2.speed but y < y' plus car pairs C1=(x,y) and C2=(x',y') such that C1.price > C2.price but x < x'.
Run the following algorithm:
Calculate current misordering metric M
Enumerate through all pairs of cars in the grid and calculate the misordering metric M' you obtain if you swapt the cars
Swap the pair of cars that reduces the metric most, if any such pair was found
If you swapped two cars, repeat from step 1
Finish
This is a standard "local search" approach to an optimization problem. What you have here is basically a simple combinatorial optimization problem. Another approaches to try might be using a self-organizing map (SOM) with preseeded gradient of speed and cost in the matrix.
Basically you have to take one of speed or price as primary and then get the cars with the same value of this primary and sort those values in ascending/descending order and primaries are also taken in the ascending/descending order as needed.
Example:
c1(20,1000) c2(30,5000) c3(20, 500) c4(10, 3000) c5(35, 1000)
Lets Assume Car(speed, price) as the measure in the above list and the primary is speed.
1 Get the car with minimum speed
2 Then get all the cars with the same speed value
3 Arrange these values in ascending order of car price
4 Get the next car with the next minimum speed value and repeat the above process
c4(10, 3000)
c3(20, 500)
c1(20, 1000)
c2(30, 5000)
c5(35, 1000)
If you post what language you are using them it would we helpful as some language constructs make this easier to implement. For example LINQ makes your life very easy in this situation.
cars.OrderBy(x => x.Speed).ThenBy(p => p.Price);
Edit:
Now you got the list, as per placing this cars items into the grid unless you know that there will be this many number of predetermined cars with these values, you can't do anything expect for going with some fixed grid size as you are doing now.
One option would be to go with a nonuniform grid, If you prefer, with each row having car items of a specific speed, but this is only applicable when you know that there will be considerable number of cars which has same speed value.
So each row will have cars of same speed shown in the grid.
Thanks
Is the 10x10 constraint necessary? If it is, you must have ten speeds and ten prices, or else the diagram won't make very much sense. For instance, what happens if the fastest car isn't the most expensive?
I would rather recommend you make the grid size equal to
(number of distinct speeds) x (number of distinct prices),
then it would be a (rather) simple case of ordering by two axes.
If the data originates in a database, then you should order them as you fetch them from the database. This should only mean adding ORDER BY speed, price near the end of your query, but before the LIMIT part (where 'speed' and 'price' are the names of the appropriate fields).
As others have said, "fastest and most expensive" is a difficult thing to do, you ought to just pick one to sort by first. However, it would be possible to make an approximation using this algorithm:
Find the highest price and fastest speed.
Normalize all prices and speeds to e.g. a fraction out of 1. You do this by dividing the price by the highest price you found in step 1.
Multiply the normalized price and speed together to create one "price & speed" number.
Sort by this number.
This ensures that is car A is faster and more expensive than car B, it gets put ahead on the list. Cars where one value is higher but the other is lower get roughly sorted. I'd recommend storing these values in the database and sorting as you select.
Putting them in a 10x10 grid is easy. Start outputting items, and when you get to a multiple of 10, start a new row.
Another option is to apply a score 0 .. 200% to each car, and sort by that score.
Example:
score_i = speed_percent(min_speed, max_speed, speed_i) + price_percent(min_price, max_price, price_i)
Hmmm... kind of bubble sort could be simple algorithm here.
Make a random 10x10 array.
Find two neighbours (horizontal or vertical) that are in "wrong order", and exchange them.
Repeat (2) until no such neighbours can be found.
Two neighbour elements are in "wrong order" when:
a) they're horizontal neighbours and left one is slower than right one,
b) they're vertical neighbours and top one is cheaper than bottom one.
But I'm not actually sure if this algorithm stops for every data. I'm almost sure it is very slow :-). It should be easy to implement and after some finite number of iterations the partial result might be good enough for your purposes though. You can also start by generating the array using one of other methods mentioned here. Also it will maintain your condition on array shape.
Edit: It is too late here to prove anything, but I made some experiments in python. It looks like a random array of 100x100 can be sorted this way in few seconds and I always managed to get full 2d ordering (that is: at the end I got wrongly-ordered neighbours). Assuming that OP can precalculate this array, he can put any reasonable number of cars into the array and get sensible results. Experimental code: http://pastebin.com/f2bae9a79 (you need matplotlib, and I recommend ipython too). iterchange is the sorting method there.

Resources