Partition matrix to minimize variance of parts - algorithm

I have a matrix of real numbers and I would like to find a partition of this matrix such that the both the number of parts and the variance of the numbers in each part are minimized. Intuitively, I want as few parts as possible, but I also want all the numbers within any given part to be close together.
More formally, I suppose for the latter I would find for each part the variance of the numbers in that part, and then take the average of those variances over all the parts. This would be part of the "score" for a given solution, the other part of the score would be, for instance, the total number of elements in the matrix minus the number of parts in the partition, so that fewer parts would lead to this part of the score being higher. The final score for the solution would be a weighted average of the two parts, and the best solution is the one with the highest score.
Obviously a lot of this is heuristic: I need to decide how to balance the number of parts versus the variances. But I'm stuck for even a general approach to the problem.
For instance, given the following simple matrix:
10, 11, 12, 20, 21
8, 13, 9, 22, 23
25, 23, 24, 26, 27
It would be a reasonable solution to partition into the following submatrices:
10, 11, 12 | 20, 21
8, 13, 9 | 22, 23
--------------+----------
25, 23, 24 | 26, 27
Partitioning is only allowed by slicing vertically and horizontally.
Note that I don't need the optimal solution, I just need an approach to get a "good" solution. Also, these matrices are several hundred by several hundred, so brute forcing it is probably not a reasonable solution, unless someone can propose a good way to pare down the search space.

I think you'd be better off by starting with a simpler problem. Let's call this
Problem A: given a fixed number of vertical and/or horizontal partitions, where should they go to minimize the sum of variances (or perhaps some other measure of variation, such as the sum of ranges within each block).
I'd suggest using a dynamic programming formulation for problem A.
Once you have that under control, then you can deal with
Problem B: find the best trade-off between variation and the number of vertical and horizontal partitions.
Obviously, you can reduce the variance to 0 by putting each element into its own block. In general, problem B requires you to solve problem A for each choice of vertical and horizontal partition counts that is considered.
To use a dynamic programming approach for problem B, you would have to formulate an objective function that encapsulates the trade-off you seek. I'm not sure how feasible this is, so I'd suggest looking for different approaches.
As it stands, problem B is a 2D problem. You might find some success looking at 2D clustering algorithms. An alternative might be possible if it can be reformulated as a 1D problem: trading off variation with the number of blocks (instead of the number of vertical and horizontal partition count). Then you could use something like the Jenks natural breaks classification method to decide where to draw the line(s).
Anyway, this answer clearly doesn't give you a working algorithm. But I hope that it does at least provide an approach (which is all you asked for :)).

Related

2d Tree Nearest Neighbor Algorithm Clarification

I am trying to implement a recursive nearest neighbour algorithm for a 2d-Tree.
Recursion (and unwinding recursion) is still kind of confusing for me and the best pseudocode I have found is from this StackOverflow question:
2D KD Tree and Nearest Neighbour Search
However the answer uses a "Median" value, which I am not sure how to compute. Also the wikipedia article on kd-trees has a nearest neighbour pseudo code that does not use a median value.
I would like to know if it is possible to construct a recursive version of the Nearest Neighbours algorithm without using a median value. If anyone can provide me with pseudo code for this I will be grateful.
If you are desperate in not using a median, you can use mean. Here, there is the simple approach:
Example 1: What is the Mean of these numbers?
6, 11, 7
Add the numbers: 6 + 11 + 7 = 24
Divide by how many numbers (there are 3 numbers): 24 / 3 = 8
The Mean is 8
However, I highly recommend you to go for the median, since the dimensions allow it in your case.
Example: find the Median of 12, 3 and 5
Put them in order:
3, 5, 12
The middle number is 5, so the median is 5.
Source
You do not really need to sort them. Pseudo-sorting is enough, be using Quickselect for example.
In C++ for example you could use use nth_element() to efficiently find the median. You can see my question here, where I needed the median for general dimensions. In the case of 2D, it can sure be simplified.

Algorithm to find unique set of items, one item from each of a set of sets

Assume you have a set of people J, and you need to take a photo of each person. Their is only one photographer and the photographer has a finite set of times T (|T| > |J|) available to take each photograph. At any given time t drawn from T, the photographer can take only one photograph. Each person in J is only available to have their photograph taken for some subset of the times in T, though each person has been asked to elect at least one time they are available. Essentially, based on each person's availability, the photographer wants to try and assign one person to each available timeslot in T such that everybody can get their photo taken. Is there a polynomial time algorithm to solve this problem? If not, what non-polynomial time problem reduces to this problem in polynomial time, i.e. how can it be shown that this problem is not in P?
Example:
The photographer is available at times [1, 12, 15, 33, 45, 77].
Person A is available at times [12, 33].
Person B is available at times [1, 12].
Person C is available at times [1, 12].
We can photograph everyone with the selection:
Person A: 33
Person B: 1
Person C: 12
If we were to start by choosing A: 12, B: 1, we would not be able to find a place for C, i.e. we'd have to backtrack and reassign A to 33.
Essentially I'm looking for a polynomial time algorithm to find an appropriate assignment of times if one exists, and otherwise be able to report that an appropriate assignment does not exist.
This can be modelled as an Assignment Problem (or Bipartite Graph matching problem).
The sources shall be people and destinations shall be the times available for photographer. The cost matrix can be built by making the cost of non-availability of a person at a time as 1, and availability as 0.
If the matrix is not a square one, then dummy people can be added with corresponding costs as 0. If the number of people is more than the number of times, then it is a case of impossible assignment.
If the resulting cost of an optimal solution is non-zero, it means that the assignment is not possible.
It can be solved using Hungarian Algorithm in polynomial time.
Abhishek's answer will work for this problem, but I wanted to add an alternative that I've found to be faster. Abhishek has already mentioned (in passing) bipartite matching, to which the Hopcroft-Karp algorithm is relevant. The Hopcroft-Karp algorithm is used to find a maximum cardinality matching and runs in $O(sqrt(V)*E)$ time vs O(n^3) for the Hungarian algorithm. "Maximum cardinality matching" basically means it finds the maximum number of assignments that can be made, so in my earlier example the maximum number of people that can be scheduled in for a photograph based on everyone's availability and the available time slots of the photographer. So if the returned maximum cardinality equals the number of people, you know that an assignment is possible for everyone.
Note that the reason we can use the Hopcroft-Karp algorithm in this example is that we don't care about edge weightings -- it makes no difference who gets assigned to which timeslot, as long as everyone gets some timeslot. We would need something like the Hungarian algorithm if we did care about weightings, e.g. if we had an "inconvenience factor" that every person assigned to each of their available timeslots, as the Hungarian algorithm is designed to optimize the result under these conditions where as Hopcroft-Karp only determines how many assignments are possible at all.
In practice I started off using the Hungarian algorithm and it took ~30 seconds to execute on my particular dataset. After switching it out for the Hopcroft-Karp algorithm, I could produce the same result in < 1 second.

Similarity algorithm (mathematics) of sampled signals

Let's say I have sampled some signals and constucted a vector of the samples for each. What is the most efficent way to calculate (dis)similarity of those vectors? Note that offset of the sampling must not count, for instance sample-vectors of sin and cos -signals should be considered similar since in sequential manner they are exately the same.
There is a simple way of doing this by "rolling" the units of the other vector, calculating euclidian distance for each roll-point and finally choosing the best match (smallest distance). This solution works fine since the only target for me is to find most similar sample-vector for input signal from a vector pool.
However, the solution above is also very inefficent when the dimension of the vectors grow. Compared to "non-sequential vector matching" for N-dimensional vector, the sequential one would have N-times more vector distance calculations to do.
Is there any higher/better mathematics/algorithms to compare two sequences with differing offsets?
Use case for this would be in sequence similarity visualization with SOM.
EDIT: How about comparing each vector's integrals and entropies? Both of them are "sequence-safe" (= time-invariant?) and very fast to calculate but I doubt they alone are enough to distinguish all possible signals from each other. Is there something else that could be used in addition for these?
EDIT2: Victor Zamanian's reply isn't directly the answer but it gave me an idea that might be. The solution might be to sample the original signals by calculating their Fourier transform coefficents and inserting those into sample vectors. First element (X_0) is the mean or "level" of the signal and the following (X_n) can be directly used to compare similarity with some other sample vector. The smaller the n is, the more it should have effect in similarity calculations, since the more coefficents there has been calculated with FT, the more accurate representation will the FT'd signal be. This brings up an bonus question:
Let's say we have FT-6 sampled vectors (values just fell out of the sky)
X = {4, 15, 10, 8, 11, 7}
Y = {4, 16, 9, 15, 62, 7}
Similarity value of these vectors could MAYBE be calculated like this: |16-15| + (|10 - 9| / 2 ) + (|8 - 15| / 3) + (|11-62| / 4) + (|7-7| / 5)
Those bolded ones are the bonus question. Is there some coefficents/some other way to know how much each FT-coefficent has effect on the similarity in relation to other coefficents?
If I understand your question correctly, maybe you would be interested in some type of cross-correlation implementation? I'm not sure if it's the most efficient thing to do or fits the purpose, but I thought I would mention it since it seems relevant.
Edit: Maybe a Fast Fourier Transform (FFT) could be an option? Fourier transforms are great for distinguishing signals from each other and I believe helpful to find similar signals too. E.g. a sine and a cosine wave would be identical in the real plane, and just have different imaginary parts (phase). FFTs can be done in O(N log N).
Google "translation invariant signal classificiation" and you'll find things like these.

Project Euler #75: ways to optimize the algorithm

I'm looking at ways to optimize my algorithm for solving Project Euler #75, two things I have done so far are,
Only check L with even values as this can be easily proved.
Store L values that have been verified to have only one way to form an integer sided right angle triangle. Later on, when checking a new L value, I look for L's divisors that are already verified to have this quality. If there are 2 or more divisors, then this value is skipped. E.g. 12, 30 and 40 are stored (24, 36, etc. are not stored because they are really enlarged versions of 12), so when I see 60 or 120, I can quickly determine that they should be skipped.
However my algorithm is still not quick enough. Do you have other suggestions or links to relevant articles? Thanks.
http://en.wikipedia.org/wiki/Pythagorean_triple
and
http://en.wikipedia.org/wiki/Formulas_for_generating_Pythagorean_triples
EDIT
I just solved the problem, using one of this formulas, If you need extra hint just post comment

Sort array in ascending order while minimizing "cost"

I'm taking comp 2210 (Data Structures) next semester and I've been doing the homework for the summer semester that is posted online. Until now, I've had no problems doing the assignments. Take a look at assignment 4 below, and see if you can give me a hint as to how to approach it. Please don't provide a complete algorithm, just an approach. Thanks!
A “costed sort” is an algorithm in which a sequence of values must be arranged in ascending order. The sort is
carried out by interchanging the position of two values one at a time until the sequence is in the proper order. Each
interchange incurs a cost, which is calculated as the sum of the two values involved in the interchange. The total
cost of the sort is the sum of the cost of the interchanges.
For example, suppose the starting
sequence were {3, 2, 1}. One possible
series of interchanges is
Interchange 1: {3, 1, 2} interchange cost = 0
Interchange 2: {1, 3, 2} interchange cost = 4
Interchange 3: {1, 2, 3} interchange cost = 5,
given a total cost of 9
You are to write a program that determines the minimal cost to arrange a specific sequence of numbers.
Edit: The professor does not allow brute forcing.
If you want to surprise your professor, you could use Simulated Annealing. Then again, if you manage that, you can probably skip a few courses :). Note that this algorithm will only give an approximate answer.
Otherwise: try a Backtracking algorithm, or Branch and Bound. These will both find the optimal answer.
What do you mean "brute forcing?" Do you mean "try all possible combinations and select the cheapest?" Just checking.
I think "branch and bound" is what you're looking for - check any source on algorithms. It is "like" brute force, except as you try a sequence of moves, as soon as that sequence of moves is less optimal than any other sequence of moves tried so far, you can abandon the sequence that got you to that point - the cost. This is one flavor of the "backtracking" mentioned above.
My preferred language for doing this would be Prolog but I'm weird.
Simulated Annealing is a PROBABLISTIC algorithm - if the solution space has local minima, then you may be trapped in one and get what you think is the right answer but isn't. There are ways around that and the literature all about that can be found but I don't agree that it's the tool you want.
You could try the related genetic algorithms too if that's the way you want to go.
Have you learned trees? You could create a tree with all possible changes leading to the desired result. The trick, of course, is to avoid creating the whole tree -- particularly when a part of it is obviously not the best solution, right?
I think the appropriate approach is to think hard about what defining properties a minimal "cost" sort has. Then figure out the cost by simulating this ideal sort. The key element here is you don't have to implement a general minimal cost sorting algorithm.
For example, let's say the defining property of a minimal cost sort is that every exchange puts at least one of the exchanged element in it's sorted position (I don't know if this is true). Every exchange based sort would love to be able to have this property, but it's not easy(possible?) in the general case. However You can easily create a program that takes an unsorted array, takes the sorted version (which itself can be generated by an unoptimal algorithm), and then using this information decides the minimum cost to achieve the sorted array from the unsorted array.
Description
I think the cheapest way to do this is to swap the cheapest misplaced item with the item that belongs in its spot. I believe this reduces cost by moving the most expensive things just once. If there are n-elements that are out of place, then there will be at most n-1 swaps to put them in place, at a cost of n-1 * cost of least item + cost of all other out of place.
If the globally cheapest element is not misplaced and the spread between this cheapest one and the cheapest misplaced one is great enough, it can be cheaper to swap the cheapest one in its correct place with the cheapest misplaced one. The cost then is n-1 * cheapest + cost of all out of place + cost of cheapest out of place.
Example
For [4,1,2,3], this algorithm exchanges (1,2) to produce:
[4,2,1,3]
and then swaps (3,1) to produce:
[4,2,3,1]
and then swaps (4,1) to produce:
[1,2,3,4]
Notice that each misplaced item in [2,3,4] is moved only once and is swapped with the lowest cost item.
Code
Ooops: "Please don't provide a complete algorithm, just an approach." Removed my code.
In an effort to only get you going on it this may not make complete sense.
Determine all possible moves, and the cost for each move and store those somehow, perform the least expensive move, then determine the moves that can be performed from this variation, storing those with the rest of your stored moves, perform the least expensive etc until the array is sorted.
I love solving things like this.
This problem is also known as the Silly Sort problem in some ACM contests. Take a look at this solution using Divide & Conquer.
Try different sorting algorithms on the same input data and print the minimum.

Resources