Effective clustering algorithm

Effective clustering algorithm - algorithm

I need help (preferably a full algorithm, but any hint or reference will be appreciated) with the following algorithmic problem:
We have a set of N elements. I can define a distance between any two elements, which satisfies the metric conditions. I need to group these elements into disjoint subsets (each element belonging to exactly one subset) according to the following rules:
The maximum distance between any two elements in each subset does not exceed specified threshold.
The number of the subsets is as small as possible.
If there is more than one possible grouping satisfying conditions (1) and (2), the maximum distance between any two elements in each subset should be as small as possible.
Example:
Assume we have the following points on a number axis: 1, 11, 12, 13, 23. The distance is simple the the difference between the points. Our distance threshold is 10. The two possible grouping satisfying conditions (1) and (2) are: (1, 11), (12), (13, 23) or (1), (11, 12, 13), (23). However, the condition (3) says that the latter grouping is the correct one.

In 1 dimensional data, sort your data, and divide into the desired number of bins, then move bin boundaries to optimize.
It gets more interesting in higher dimensionality. There, the problem will be NP hard. So finding the optimum will be expensive. You can indeed use clustering here: use complete-linkage clustering. For a O(n²) and O(n) memory approach, try CLINK. But in my experience, you will need to run this algorithm several times, on shuffled data, to get a good solution.

Related

Sorting algorithm based on subset inversion

I'm looking for a sorting algorithm based on subset inversion. It's like pancake sort, only instead of taking all the pancakes on top of the spatula, you can just invert any subset you want. Length of the subset doesn't matter.
Like this:
http://www.yourgenome.org/sites/default/files/illustrations/diagram/dna_mutations_inversion_yourgenome.png
So we can't simply swap numbers without inverting everything in between.
We're doing this to determine how one subspecies of fruitfly can mutate into the other. Both have the same genes but in a different order. The second subspecies' genome is 'sorted', i.e. the gene numbers are 1-25. The first subspecies genome is unsorted. Hence, we're looking for a sorting algorithm.
This is the "genome" we're looking at (though we should be able to have this work on all lists of numbers):
[23, 1, 2, 11, 24, 22, 19, 6, 10, 7, 25, 20, 5, 8, 18, 12, 13, 14, 15, 16, 17, 21, 3, 4, 9];
We're looking at two separate problems:
1) To sort a list of 25 numbers with the least amount of inversions
2) To sort a list of 25 numbers with the least amount of numbers moved
We also want to establish both upper and lower bounds for both.
We've already found a way to sort like this by just going from left to right, searching for the next lowest value and inverting everything in between, but we're absolutely certain we should be able to do this faster. However, we still haven't found any other methods so I'm asking for your help!
UPDATE: the method we currently use is based on the above method
but instead works both ways. It looks at the next elements needed
for both ends (e.g. 1 and 25 at the beginning) and then calculates
which inversion would be cheapest. All values at the ends can be
ignored for the rest of the algorithm because they get put into the
correct place immediately. Our first method took 18/19 steps and 148
genes, and this one does it in 17 steps and 101 genes. For both
optimalisation tactics (the two mentioned above), this is a better
method. It is however not cheaper in terms of code and processing.
Right now, we're working in Python because we have most experience with that, but I'd be happy with any pseudocode ideas on how we can more efficiently tackle this. If you think another language might be better suited, please let me know. Pseudocode, ideas, thoughts and actual code are all welcome!
Thanks in advance!

Regarding the first question: Do you know (and care about) which of the two strands the genes are on?
If so, you're in luck: This is called the inversion distance between signed permutations problem, and there is a linear-time algorithm for it: http://www.ncbi.nlm.nih.gov/pubmed/11694179. I haven't looked at the details.
If not, then unfortunately (as described on p. 2 of that paper) the problem is NP-hard, so it's very unlikely that any algorithm exists that is efficient (polynomial-time) in the worst case.
Regarding the second question: Assuming you mean that you want to find the minimum number of swaps needed to sort a list of numbers, you should be able to find solutions to this by searching here on SO and elsewhere. I think this is a clear and concise explanation. You can also use the optimal solution to this problem to get an upper bound for your first question: Any swap of positions i and j can be simulated using the two interval reversals (i, j) and (i+1, j-1). (This upper bound might be very bad, though, and in particular could be worse than your existing greedy algorithm.)

I think what you're looking for for the second question is the minimum number of swaps of adjacent elements to sort a sequence, which is equal to the number of inversions in the sequence (where a[i] > a[j] and i < j).
The first question seems quite a bit more complicated to me. One potential heuristic might be to think of the subset inversion as similar to the adjacent swap of more than one element. For example, if you've managed to get a sequence to this position,
5,6,1,2,3,4,7,8
we can "adjacent swap" indexes [0,1] with [2,3] (so inverting [0,1,2,3]),
2,1,6,5,3,4,7,8
and then [2,3] with [4,5] (inverting [2,3,4,5]),
2,1,4,3,5,6,7,8
and arrive at a sequence that now has significantly less element inversions, meaning less single adjacent swaps are needed to now complete the sort.
So maybe attempting to quantify inversions (in the sense of a[i] > a[j] and i < j) of sections rather than single elements could help move in the direction of estimating or building a method for the first question.

Dynamic algorithm to multiply elements in a sequence two at a time and find the total

I am trying to find a dynamic approach to multiply each element in a linear sequence to the following element, and do the same with the pair of elements, etc. and find the sum of all of the products. Note that any two elements cannot be multiplied. It must be the first with the second, the third with the fourth, and so on. All I know about the linear sequence is that there are an even amount of elements.
I assume I have to store the numbers being multiplied, and their product each time, then check some other "multipliable" pair of elements to see if the product has already been calculated (perhaps they possess opposite signs compared to the current pair).
However, by my understanding of a linear sequence, the values must be increasing or decreasing by the same amount each time. But since there are an even amount of numbers, I don't believe it is possible to have two "multipliable" pairs be the same (with potentially opposite signs), due to the issue shown in the following example:
Sequence: { -2, -1, 0, 1, 2, 3 }
Pairs: -2*-1, 0*1, 2*3
Clearly, since there are an even amount of pairs, the only case in which the same multiplication may occur more than once is if the elements are increasing/decreasing by 0 each time.
I fail to see how this is a dynamic programming question, and if anyone could clarify, it would be greatly appreciated!

A quick google for define linear sequence gave
A number pattern which increases (or decreases) by the same amount each time is called a linear sequence. The amount it increases or decreases by is known as the common difference.
In your case the common difference is 1. And you are not considering any other case.
The same multiplication may occur in the following sequence
Sequence = {-3, -1, 1, 3}
Pairs = -3 * -1 , 1 * 3
with a common difference of 2.
However this is not necessarily to be solved by dynamic programming. You can just iterate over the numbers and store the multiplication of two numbers in a set(as a set contains unique numbers) and then find the sum.

Probably not what you are looking for, but I've found a closed solution for the problem.
Suppose we observe the first two numbers. Note the first number by a, the difference between the numbers d. We then count for a total of 2n numbers in the whole sequence. Then the sum you defined is:
sum = na^2 + n(2n-1)ad + (4n^2 - 3n - 1)nd^2/3
That aside, I also failed to see how this is a dynamic problem, or at least this seems to be a problem where dynamic programming approach really doesn't do much. It is not likely that the sequence will go from negative to positive at all, and even then the chance that you will see repeated entries decreases the bigger your difference between two numbers is. Furthermore, multiplication is so fast the overhead from fetching them from a data structure might be more expensive. (mul instruction is probably faster than lw).

How to code the maximum set packing algorithm?

Suppose we have a finite set S and a list of subsets of S. Then, the set packing problem asks if some k subsets in the list are pairwise disjoint .
The optimization version of the problem, maximum set packing, asks for the maximum number of pairwise disjoint sets in the list.
http://en.wikipedia.org/wiki/Set_packing
So, Let S = {1,2,3,4,5,6,7,8,9,10}
and `Sa = {1,2,3,4}`
and `Sb = {4,5,6}`
and `Sc = {5,6,7,8}`
and `Sd = {9,10}`
Then the maximum number of pairwise disjoint sets are 3 ( Sa, Sc, Sd )
I could not find any articles about the algorithm involved. Can you shed some light on the same?
My approach:
Sort the sets according to the size. Start from the set of the smallest size. If no element of the next set intersects with the current set, then we unite the set and increase the count of maximum sets. Does this sound good to you? Any better ideas?

As hivert pointed out, this problem is NP-hard, so there's no efficient way to do this. However, if your input is relatively small, you can still pull it off. Exponential doesn't mean impossible, after all. It's just that exponential problems become impractical very quickly, as the input size grows. But for something like 25 sets, you can easily brute force it.
Here's one approach. Let's say you have n subsets, called S0, S1, ..., etc. We can try every combination of subsets, and pick the one with maximum cardinality. There are only 2^25 = 33554432 choices, so this is probably reasonable enough.
An easy way to do this is to notice that any non-negative number strictly below 2^N represents a particular choice of subsets. Look at the binary representation of the number, and choose the sets whose indices correspond to the bits that are on. So if the number is 11, the 0th, 1st and 3rd bits are on, and this corresponds to the combination [S0, S1, S3]. Then you just verify that these three sets are in fact disjoint.
Your procedure is as follows:
Iterate i from 0 to 2^N - 1
For each value of i, use the bits that are on to figure out the corresponding combination of subsets.
If those subsets are pairwise disjoint, update your best answer with this combination (i.e., use this if it is bigger than your current best).
Alternatively, use backtracking to generate your subsets. The two approaches are equivalent, modulo implementation tradeoffs. Backtracking will have some stack overhead, but can cut off entire lines of computation if you check disjointness as you go. For example, if S1 and S2 are not disjoint, then it will never bother with any bigger combinations containing those two, saving some time. The iterative method can't optimize itself in this way, but is fast and efficient because of the bitwise operations and tight loop.
The only nontrivial matter here is how to check if the subsets are pairwise disjoint. There are all sorts of tricks you can pull here as well, depending on the constraints.
A simple approach is to start with an empty set structure (pick whatever you want from the language of your choice) and add elements from each subset one by one. If you ever hit an element that's already in the set, then it occurs in at least two subsets, and you can give up on this combination.
If the original set S has m elements, and m is relatively small, you can map each of them to the range [0, m-1] and use bitmasks for each set. So if m <= 64, you can use a Java long to represent each subset. Turn on all the bits that correspond to the elements in the subset. This allows blazing fast set operation, because of the speed of bitwise operations. Bitwise AND corresponds to set intersection, and bitwise OR is a union. You can check if two subsets are disjoint by seeing if the intersection is empty (i.e., ANDing the two bitmasks gives you 0).
If you don't have so few elements, you can still avoid repeating the set intersections multiple times. You have very few sets, so precompute which ones are disjoint at the start. You can just store a boolean matrix D, such that D[i][j] = true iff i and j are disjoint. Then you just look up all pairs in a combination to verify pairwise disjointness, rather than doing real set operations.

You can solve the set packing problem searching a Maximum independent set. You encode your problem as follows:
for each set you put a vertex
you put an edge between two vertex if they share a common number.
Then you wan't a maximum set of vertex without two having two related vertex. Unfortunately this is a NP-Hard problem. Any know algorithm is exponential.

Algorithm to generate k element subsets in order of their sum

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.

If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:
Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
Restrict your set of n integers to those less than or equal to this bound.
Repeat (goto 2) until no matches are found within some fixed number of iterations.
Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).
Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.

Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.

Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:
min sum(i) w_i*x_i
x_i binary
sum over x_i = k
subject to (some constraints on w_i*x_i)
If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.

Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.
So, here's how you can loop through the elements with approximately increasing sum:
sort arr
for i = 1 to N
for v = 5-element subsets of (1, ..., i)
set = arr{v}
if condition(set) is satisfied
break_loop = true
compute sum(set), keep set if it is the best so far
break if break_loop
Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.

This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.

Divide list into two equal parts algorithm

Related questions:
Algorithm to Divide a list of numbers into 2 equal sum lists
divide list in two parts that their sum closest to each other
Let's assume I have a list, which contains exactly 2k elements. Now, I'm willing to split it into two parts, where each part has a length of k while trying to make the sum of the parts as equal as possible.
Quick example:
[3, 4, 4, 1, 2, 1] might be splitted to [1, 4, 3] and [1, 2, 4] and the sum difference will be 1
Now - if the parts can have arbitrary lengths, this is a variation of the Partition problem and we know that's it's weakly NP-Complete.
But does the restriction about splitting the list into equal parts (let's say it's always k and 2k) make this problem solvable in polynomial time? Any proofs to that (or a proof scheme for the fact that it's still NP)?

It is still NP complete. Proof by reduction of PP (your full variation of the Partition problem) to QPP (equal parts partition problem):
Take an arbitrary list of length k plus additional k elements all valued as zero.
We need to find the best performing partition in terms of PP. Let us find one using an algorithm for QPP and forget about all the additional k zero elements. Shifting zeroes around cannot affect this or any competing partition, so this is still one of the best performing unrestricted partitions of the arbitrary list of length k.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio