Similar questions in the database seem to be much more complicated than my example. I want to cluster 100'ish points on a line. Number of groups is irrelevant; the closeness of points is more important.
What is a term, method or algorithm to deal with this grouping problem? K-means, Hamming distance, hierarchical agglomeration, clique or complete linkage??
I've reduced two examples to bare minimum for clarification:
Simple example:
Set A = {600, 610, 620, 630} and the set of differences between its elements is diff_A = {10, 20, 30, 10, 20, 10}. I can then group as follows: {10, 10, 10}, {20, 20}, and {30}. Done.
Problematic example:
Set B = {600, 609, 619, 630} and the set of differences is diff_B = {9, 10, 11, 19, 21, 30}. I try to group with a tolerance of 1, i.e. differences that are 1 (or less) are 'similar enough' to be grouped but I get a paradox: {9, 10} AND/OR {10, 11}, {19}, {21}, and {30}.
Issue:
9 and 10 are close enough, 10 and 11 are close enough, but 9 and 11 are not, so how should I handle these overlapping groups? Perhaps this small example is unsolvable because it is symmetrical?
Why do you work on the pairwise differences? Consider the values 1, 2, 101, 102, 201, 202. Pairwise differences are 1,100,101,200,201,99,100,199,200,1,100,101,99,100,1
The values of ~200 bear no information. There is a different "cluster" inbetween. You shouldn't use them for your analysis.
Instead, grab a statistics textbook and look up Kernel Density Estimation. Don't bother to look for clustering - these methods are usually designed for the multivariate case. Your data is 1 dimensional. It can be sorted (it probably already is), and this can be exploited for better results.
There are well-established heuristics for density estimation on such data, and you can split your data on local minimum density (or simply at a low density threshold). This is much simpler, yet robust and reliable. You don't need to set a paramter such as k for k-means. There are cases where k-means is a good choice - it has origins in signal detection, where it was known that there are k=10 different signal frequencies. Today, it is mostly used for multidimensional data.
See also:
Cluster one-dimensional data optimally?
1D Number Array Clustering
partitioning an float array into similar segments (clustering)
What clustering algorithm to use on 1-d data?
Related
I am working on a clustering algorithm where I need to cluster values based on their frequency in the data. This would indicate which values are not important and would be treated as the part of a larger cluster than individual entity.
I am new to data science and would like to know the best algorithm/approach to achieve this.
For example, I have the following data set. The first column are the property values and second column denotes their frequency of occurrence.
Value = [1, 1.5, 2, 3, 4, 6, 8, 16, 32, 128]
Frequency = [207, 19, 169, 92, 36, 7, 12, 5, 2, 2]
Here, Frequency[i] corresponds to Value[i]
The frequency can be thought of as the importance of a value. The other thing which denotes the importance of a value is the distance between the elements in the array. For example, 1.5 is not that significant compared to 32 or 128, since it has elements much closer such as 1 and 2.
When approaching to cluster these values, I need to look at distances between values and also the frequency of their occurrence. A possible output for the above problem would be
Clust_value = [(1, 1.5), 2, 3, 4, (6, 8), 16, (32, 128)]
This is not the best cluster but one possible answer. I need to know the best algorithm to approach this problem.
Firstly, I tried to solve this problem without taking into account the spread of elements in the values array, but that gave wrong answers in some situations. We have tried using mean and median for clustering values again with no successful outcome.
We have tried comparing frequencies of the neighbors and then clubbing the values into one cluster. We also tried to find the minimum distance between the elements of the values array and then putting them into one cluster if their difference was greater than a threshold value, but this failed to cluster values if they had low frequencies. I also looked for clustering algorithms on-line but did not get any useful resource relevant to the problem defined above.
Is there any better way to approach the problem?
You need to come up with some mathematical quality criterion of what makes one solution better than another. Unless you have thousands of numbers, you can afford a rather 'brute force' method: begin with the first number, add the next as long as your quality increases, otherwise begin a new cluster. Because your data are sorted this will be fairly efficient and find a rather good solution (you can try additional splits to further improve quality).
So it all boils down to you needing to specify quality.
Do not assume that existing criteria (e.g. variance in k-means) work for you. At most, you may be able to find a data transformation such that your requirements turn into variance, but that also will be specific to your problem.
I'd like to as a variation on this question regarding Huffman tree building. Is there anyway to calculate the depth of a Huffman tree from the input (or frequency), without drawing tree.
if there is no quick way, How the answer of that question was found? Specific Example is : 10-Input Symbol with Frequency 1 to 10 is 5.
If you are looking for an equation to take the frequencies and give you the depth, then no, no such equation exists. The proof is that there exist sets of frequencies on which you will have arbitrary choices to make in applying the Huffman algorithm that result in different depth trees! So there isn't even a unique answer to "What is the depth of the Huffman tree?" for some sets of frequencies.
A simple example is the set of frequencies 1, 1, 2, and 2, which can give a depth of 2 or 3 depending on which minimum frequencies are paired when applying the Huffman algorithm.
The only way to get the answer is to apply the Huffman algorithm. You can take some shortcuts to get just the depth, since you won't be using the tree at the end. But you will be effectively building the tree no matter what.
You might be able to approximate the depth, or at least put bounds on it, with an entropy equation. In some special cases the bounds may be restrictive enough to give you the exact depth. E.g. if all of the frequencies are equal, then you can calculate the depth to be the ceiling of the log base 2 of the number of symbols.
A cool example that shows that a simple entropy bound won't be strong enough to get the exact answer is when you use the Fibonacci sequence for the frequencies. This assures that the depth of the tree is the number of symbols minus one. So the frequencies 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, and 610 will result in a depth of 14 bits even though the entropy of the lowest frequency symbol is 10.64 bits.
I really don't know what the name of this problem is, but it's something like lossy compression, and I have a bad English, but I will try to describe it as much as I can.
Suppose I have list of unsorted unique numbers from unknown source, the length is usually between 255 to 512 with a range from 0 to 512.
I wonder if there is some kind of an algorithm that reads the data and return something like a seed number that I can use to generate a list somehow close to the original but with some degree of error.
For example
original list
{5, 13, 25, 33, 3, 10}
regenerated list
{4, 10, 30, 30, 5, 5} or {8, 20, 20, 35, 5, 9} //and so on
Does this problem have a name, and is there an algorithm that can do what I just described?
Is it the same as Monte Carlo method because from what I understand it isn't.
Is it possible to use some of the techniques used in lossy compression to get this kind of approximation ?
What I tried to do to solve this problem is to use a simple 16 bit RNG and brute-force all the possible values comparing them to the original list and pick the one with the minimum difference, but I think this way is rather dumb and inefficient.
This is indeed lossy compression.
You don't tell us the range of the values in the list. From the samples you give we can extrapolate that they count at least 6 bits each (0 to 63). In total, you have from 0 to 3072 bits to compress.
If these sequences have no special property and appear to be random, I doubt there is any way to achieve significant compression. Think that the probability of an arbitrary sequence to be matched from a 32 bits seed is 2^32.2^(-3072)=7.10^(-916), i.e. less than infinitesimal. If you allow 10% error on every value, the probability of a match is 2^32.0.1^512=4.10^(-503).
A trivial way to compress with 12.5% accuracy is to get rid of the three LSB of each value, leading to 50% savings (1536 bits), but I doubt this is what you are looking for.
Would be useful to measure the entropy of the sequences http://en.wikipedia.org/wiki/Entropy_(information_theory) and/or possible correlations between the values. This can be done by plotting all (V, Vi+1) pairs, or (Vi, Vi+1, Vi+2) triples and looking for patterns.
I have a set of frequency values and I'd like to find the most likely subsets of values meeting the following conditions:
values in each subset should be harmonically related (approximately multiples of a given value)
the number of subsets should be as small as possible
every subset should have a minimum number of missing harmonics smaller than the highest value
E.g. [1,2,3,4,10,20,30] should return [1,2,3,4] and [10,20,30] (a set with all the values is not optimal because, even if they are harmonically related, there are many missing values)
The brute force method could be to compute all the possible subsets of values in the sets and compute some cost value, but that would take way too long time.
Is there any efficient algorithm to perform this task (or something similar)?
I would reduce the problem to minimum set cover, which, although NP-hard, often is efficiently solvable in practice via integer programming. I'm assuming that it would be reasonable to decompose [1, 2, 3, 4, 8, 12, 16] as [1, 2, 3, 4] and [4, 8, 12, 16], with 4 repeating.
To solve set cover (well, to use stock integer-program solvers, anyway), we need to enumerate all of the maximal allowed subsets. If the fundamental (i.e., the given value) must belong to the set, then, for each frequency, we can enumerate its multiples in order until too many in a row are missing. If not, we try all pairs of frequencies, assume that their fundamental is their approximate greatest common divisor, and extend the subset downward and upward until too many frequencies are missing.
For example: 1,2,4,5 has the following sum:
1,2,4,5
3,6,9
7,11
12
and every sum is unique.
Now, 1,2,3 has the following sum:
1,2,3
3,5
6
and apparently not every sum is unique.
Is there any efficient way to generate similar sequence to the first example with the goal of picking every number as small as possible (not just 1,2,4,8,16...)? I understand I could write a program to perhaps bruteforce this, but I'm just curious can it be done in a better way.
I think what you're looking for here is a Golomb Ruler. If you take the numbers you're describing above as the distance between marks, you've described a Golomb Ruler. When the set of marks on a ruler has no duplicates, as you've described, that's what makes it a Golomb Ruler.
It appears the standard way to describe a Golomb Ruler is by representing the location of each mark, not the distances between them. Therefore, your 1,2,4,5 would be described as a Golomb Ruler 0-1-3-7-12.
Quoting Wikipedia:
Currently, the complexity of finding OGRs of arbitrary order n (where
n is given in unary) is unknown. In the past there was some
speculation that it is an NP-hard problem. Problems related to the
construction of Golomb Rulers are provably shown to be NP-hard, where
it is also noted that no known NP-complete problem has similar flavor
to finding Golomb Rulers.
Seen <- emtpy set # Sums seen so far
Open <- empty set # Sums ending at the last element
for x from 1 to Limit do
if x in Seen then
# quick fail
continue with next x
end
# Build new set
Pending <- empty set
add x to Pending
for each s in Open do
add (s+x) to Pending
end
# Check if these numbers are all unique
if (Pending intersection Seen) is empty then
# If so, we have a new result
yield x
Open <- Pending
Seen <- Seen union Pending
end
end
It looks at all sums seen so far, and the sums ending at the last element. There is no need to keep track of starting and ending positions.
If n is the value of Limit, this algorithm would take O(n2 log n), assuming set member check and insertion are O(log n), and intersection/union are not slower than O(n log n).
(Though I might be mistaken on the last assumption)
The first few values would be:
1, 2, 4, 5, 8
The first 30 values (when selecting always the smallest possible value for the next difference) are
1, 2, 4, 5, 8, 10, 14, 21, 15, 16, 26, 25, 34, 22, 48, 38, 71, 40, 74, 90, 28, 69, 113, 47, 94, 54, 46, 143, 153, 83
These "lexicograpic first" Golomb rulers are less than twice as long as the best known Golomb rulers with up to 30 differences. For most purposes such rulers are sufficient - for signals it just matters that all distances between any two marks are different. Finding the OGR (the optimum Golomb ruler) is rather a mathematical challenge.
#David There is no right or wrong in using one or the other format for describing Golomb rulers. Either is valid, displaying the positions of the marks (0, 1, 3, 7, 12, 20, ...) or the values of the differences. Personally, I prefer the differences for calculations, like #zack does in the question.