How to make one-dimensional k-means clustering using Ruby? - ruby

My question:
I have searched through available Ruby gems to find one that performs k-means clustering. I've found quite a few: kmeans, kmeans-clustering, reddavis-k_means and k_means_pp. My problem is that none of the gems deals with one-dimensional k-means clustering. They all expect input like this:
[[1, 2], [3, 4], [5, 6]]
My input looks like this:
[1, 2, 3, 4, 5, 6]
Hence my question: How do I perform a one-dimensional k-means clustering using Ruby?
The context (my task):
I have 100 input values:
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 8, 8, 10, 16, 18, 22, 22, 35, 50, 50
Each value represents a response time, i.e. the number of minutes it took for some customer service agent to respond to an email from a customer. So the first value 0 indicates that the customer only waited 0 minutes for a response.
I need to find out how many fast, medium-fast and slow response time instances there is. In other words, I want to cut my input values up in 3 pools, and then count how many there are in each pool.
The complicating factor is that I based on the overall slope steepness have to figure out where to make the cuts. There is no fixed definition of fast, medium-fast and slow. The first cut (between fast and medium-fast) should occur where the steepness of the slope starts to increase more drastically than before. The second cut (between medium-fast and slow) should occur when an even more dramatic steepness increase occur.
Here is a graphical representation of the input values.
In the above example, common sense would probably define fast as 0-3, because there are many instances of 0, 1, 2, and 3. 4-8 or 4-10 looks like common sense choices for medium-fast. But how to determine something like this mathematically? If the response times were generally faster, then the customers would be expecting this, so an even smaller increase towards the end should trigger the cut.
Finishing notes:
I did find the gem davidrichards-kmeans that deals with one-dimensional k-means clustering, but it don't seem to work properly (the example code raises a syntax error).

k-means is the wrong tool for this job anyway.
It's not designed for fitting an exponential curve.
Here is a much more sound proposal for you:
Look at the plot, mark the three points, and then you have your three groups.
Or look at quantiles... Report the median response time, the 90% quantile, and the 99% quantile...
Clustering is about structure discovery in multivariate data. It's probably not what you want it to be, sorry.
If you insist on trying k-means, try encoding the data as
[[1], [2], [3], [4], [5]]
and check if the results are at least a little bit what you want them to be (also remember that k-means is randomized. Running it multiple times may yield very different results).

Related

Minimize the difference of the distance between points on the line

My problem is as follows:
Given n points on a line segment and a threshold k, pick the points on the line so that would minimize the average difference of the distance between each consecutive point and the threshold.
For example:
If we were given an array of points n = [0, 2, 5, 6, 8, 9], k = 3
Output: [0, 2, 6, 9]
Explanation: when we choose this path, the difference from the threshold in each interval is [1, 1, 0] which gets an average of .66 difference.
If I chose [0, 2, 5, 8, 9], the differences would be [1, 0, 0, 2], which averages to .75.
I understand enough dynamic programming to consider several solutions including memorization and depth-first search, but I was hoping someone could offer a specific algorithm with the best efficiency.

Combinatorial algorithm for assigning people to groups

A coworker came to me with an interesting problem, a practical one having to do with a "new people in town" group she's a part of.
18 friends want to have dinner in groups for each of the next 4 days. The rules are as follows:
Each day the group will split into 4 groups of 4, and a group of 2.
Any given pair of people will only see each other at most once over the course of the 4 days.
Any given person will only be part of the size 2 group at most once.
A brute force recursive search for a valid set of group assignment is obviously impractical. I've thrown in some simple logic for pruning parts of the tree as soon as possible, but not enough to make it practical.
Actually, I'm starting to suspect that it might be impossible to follow all the rules, but I can't come up with a combinatorial argument for why that would be.
Any thoughts?
16 friends can be scheduled 4x4 for 4 nights using two mutually orthogonal latin squares of order 4. Assign each friend to a distinct position in the 4x4 grid. On the first night, group by row. On the second, group by column. On the third, group by similar entry in latin square #1 (card rank in the 4x4 example). On the fourth, group by similar entry in latin square #2 (card suit in the 4x4 example). Actually, the affine plane construction gives rise to three mutually orthogonal latin squares, so a fifth night could be scheduled, ensuring that each pair of friends meets exactly once.
Perhaps the schedule for 16 could be extended, using the freedom of the unused fifth night.
EDIT: here's the schedule for 16 people over 5 nights. Each row is a night. Each column is a person. The entry is the group to which they're assigned.
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]
[0, 1, 2, 3, 1, 0, 3, 2, 2, 3, 0, 1, 3, 2, 1, 0]
[0, 2, 3, 1, 1, 3, 2, 0, 2, 0, 1, 3, 3, 1, 0, 2]
[0, 3, 1, 2, 1, 2, 0, 3, 2, 1, 3, 0, 3, 0, 2, 1]

Creating auto-correlated random values

We are trying to create auto-correlated random values which will be used as timeseries.
We have no existing data we refer to and just want to create the vector from scratch.
On the one hand we need of course a random process with distribution and its SD.
On the other hand the autocorrelation influencing the random process has to be described. The values of the vector are autocorrelated with decreasing strengh over several timelags.
e.g. lag1 has 0.5, lag2 0.3, lag1 0.1 etc.
So in the end the vector should look something that:
2, 4, 7, 11, 10 , 8 , 5, 4, 2, -1, 2, 5, 9, 12, 13, 10, 8, 4, 3, 1, -2, -5
and so on.

Is this equivalent to insertion sort?

Say we have a 0-indexed sequence S, take S[0] and insert it in a place in S where the next value is higher than S[0] and the previous value is lower than S[0]. Formally, S[i] should be placed in such a place where S[i-1] < S[i] < S[i+1]. Continue in order on the list doing the same with every item. Remove the element from the list before putting it in the correct place. After one iteration over the list the list should be ordered. I recently had an exam and I forgot insertion sort (don't laugh) and I did it like this. However, my professor marked it wrong. The algorithm, as far as I know, does produce a sorted list.
Works like this on a list:
Sorting [2, 8, 5, 4, 7, 0, 6, 1, 10, 3, 9]
[2, 8, 5, 4, 7, 0, 6, 1, 10, 3, 9]
[2, 8, 5, 4, 7, 0, 6, 1, 10, 3, 9]
[2, 5, 4, 7, 0, 6, 1, 8, 10, 3, 9]
[2, 4, 5, 7, 0, 6, 1, 8, 10, 3, 9]
[2, 4, 5, 7, 0, 6, 1, 8, 10, 3, 9]
[2, 4, 5, 0, 6, 1, 7, 8, 10, 3, 9]
[0, 2, 4, 5, 6, 1, 7, 8, 10, 3, 9]
[0, 2, 4, 5, 1, 6, 7, 8, 10, 3, 9]
[0, 1, 2, 4, 5, 6, 7, 8, 10, 3, 9]
[0, 1, 2, 4, 5, 6, 7, 8, 3, 9, 10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Got [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Since every time an element is inserted into the list up to (n-1) numbers in the list may be moved and we must do this n times the algorithm should run in O(n^2) time.
I had a Python implementation but I misplaced it somehow. I'll try to write it again in a bit, but it's kinda tricky to implement. Any ideas?
The Python implementation is here: http://dpaste.com/hold/522232/. It was written by busy_beaver from reddit.com when it was discussed here http://www.reddit.com/r/compsci/comments/ejaaz/is_this_equivalent_to_insertion_sort/
It's a while since this was asked, but none of the other answers contains a proof that this bizarre algorithm does in fact sort the list. So here goes.
Suppose that the original list is v1, v2, ..., vn. Then after i steps of the algorithm, I claim that the list looks like this:
w1,1, w1,2, ..., w1,r(1), vσ(1), w2,1, ... w2,r(2), vσ(2), w3,1 ... ... wi,r(i), vσ(i), ...
Where σ is the sorted permutation of v1 to vi and the w are elements vj with j > i. In other words, v1 to vi are found in sorted order, possibly interleaved with other elements. And moreover, wj,k ≤ vj for every j and k. So each of the correctly sorted elements is preceded by a (possibly empty) block of elements less than or equal to it.
Here's a run of the algorithm, with the sorted elements in bold, and the preceding blocks of elements in italics (where non-empty). You can see that each block of italicised elements is less than the bold element that follows it.
[4, 8, 6, 1, 2, 7, 5, 0, 3, 9]
[4, 8, 6, 1, 2, 7, 5, 0, 3, 9]
[4, 6, 1, 2, 7, 5, 0, 3, 8, 9]
[4, 1, 2, 6, 7, 5, 0, 3, 8, 9]
[1, 4, 2, 6, 7, 5, 0, 3, 8, 9]
[1, 2, 4, 6, 7, 5, 0, 3, 8, 9]
[1, 2, 4, 6, 5, 0, 3, 7, 8, 9]
[1, 2, 4, 5, 6, 0, 3, 7, 8, 9]
[0, 1, 2, 4, 5, 6, 3, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
If my claim is true, then the algorithm sorts, because after n steps all the vi are in order, and there are no remaining elements to be interleaved. But is the claim really true?
Well, let's prove it by induction. It's certainly true when i = 0. Suppose it's true for i. Then when we run the (i + 1)st step, we pick vi+1 and move it into the first position where it fits. It certainly passes over all vj with j ≤ i and vj < vi+1 (since these are sorted by hypothesis, and each is preceded only by smaller-or-equal elements). It cannot pass over any vj with j ≤ i and vj ≥ vi+1, because there's some position in the block before vj where it will fit. So vi+1 ends up sorted with respect to all vj with j ≤ i. So it ends up somewhere in the block of elements before the next vj, and since it ends up in the first such position, the condition on the blocks is preserved. QED.
However, I don't blame your professor for marking it wrong. If you're going to invent an algorithm that no-one's seen before, it's up to you to prove it correct!
(The algorithm needs a name, so I propose fitsort, because we put each element in the first place where it fits.)
Your algorithm seems to me very different from insertion sort. In particular, it's very easy to prove that insertion sort works correctly (at each stage, the first however-many elements in the array are correctly sorted; proof by induction; done), whereas for your algorithm it seems much more difficult to prove this and it's not obvious exactly what partially-sorted-ness property it guarantees at any given point in its processing.
Similarly, it's very easy to prove that insertion sort always does at most n steps (where by a "step" I mean putting one element in the right place), whereas if I've understood your algorithm correctly it doesn't advance the which-element-to-process-next pointer if it's just moved an element to the right (or, to put it differently, it may sometimes have to process an element more than once) so it's not so clear that your algorithm really does take O(n^2) time in the worst case.
Insertion sort maintains the invariant that elements to the left of the current pointer are sorted. Progress is made by moving the element at the pointer to the left into its correct place and advancing the pointer.
Your algorithm does this, but sometimes it also does an additional step of moving the element at the pointer to the right without advancing the pointer. This makes the algorithm as a whole not an insertion sort, though you could call it a modified insertion sort due to the resemblance.
This algorithm runs in O(n²) on average like insertion sort (also like bubble sort). The best case for an insertion sort is O(n) on an already sorted list, for this algorithm it is O(n) but for a reverse-sorted list since you find the correct position for every element in a single comparison (but only if you leave the first, largest, element in place at the beginning when you can't find a good position for it).
A lot of professors are notorious for having the "that's not the answer I'm looking for" bug. Even if it's correct, they'll say it doesn't meet their criteria.
What you're doing seems like insertion sort, although using removes and inserts seems like it would only add unnecessary complexity.
What he might be saying is you're essentially "pulling out" the value and "dropping it back in" the correct spot. Your prof was probably looking for "swapping the value up (or down) until you found it's correct location."
They have the same result but they're different in implementation. Swapping would be faster, but not significantly so.
I have a hard time seeing that this is insert sort. Using insert sort, at each iteration, one more element would be placed correctly in the array. In your solution I do not see an element being "fully sorted" upon each iteration.
The insert sort algorithm begin:
let pos = 0
if pos == arraysize then return
find the smallest element in the remaining array from pos and swap it with the element at position pos
pos++
goto 2

Permutations distinct under given symmetry (Mathematica 8 group theory)

Given a list of integers like {2,1,1,0} I'd like to list all permutations of that list that are not equivalent under given group. For instance, using symmetry of the square, the result would be {{2, 1, 1, 0}, {2, 1, 0, 1}}.
Approach below (Mathematica 8) generates all permutations, then weeds out the equivalent ones. I can't use it because I can't afford to generate all permutations, is there a more efficient way?
Update: actually, the bottleneck is in DeleteCases. The following list {2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0} has about a million permutations and takes 0.1 seconds to compute. Apparently there are supposed to be 1292 orderings after removing symmetries, but my approach doesn't finish in 10 minutes
removeEquivalent[{}] := {};
removeEquivalent[list_] := (
Sow[First[list]];
equivalents = Permute[First[list], #] & /# GroupElements[group];
DeleteCases[list, Alternatives ## equivalents]
);
nonequivalentPermutations[list_] := (
reaped = Reap#FixedPoint[removeEquivalent, Permutations#list];
reaped[[2, 1]]
);
group = DihedralGroup[4];
nonequivalentPermutations[{2, 1, 1, 0}]
What's wrong with:
nonequivalentPermutations[list_,group_]:= Union[Permute[list,#]& /# GroupElements[group];
nonequivalentPermutations[{2,1,1,0},DihedralGroup[4]]
I don't have Mathematica 8, so I can't test this. I just have Mathematica 7.
I got an elegant and fast solution from Maxim Rytin, relying on ConnectedComponents function
Module[{gens, verts, edges},
gens = PermutationList /# GroupGenerators#DihedralGroup[16];
verts =
Permutations#{2, 2, 2, 2, 2, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0};
edges = Join ## (Transpose#{verts, verts[[All, #]]} &) /# gens;
Length#ConnectedComponents#Graph[Rule ### Union#edges]] // Timing

Resources