I am working on a clustering algorithm where I need to cluster values based on their frequency in the data. This would indicate which values are not important and would be treated as the part of a larger cluster than individual entity.
I am new to data science and would like to know the best algorithm/approach to achieve this.
For example, I have the following data set. The first column are the property values and second column denotes their frequency of occurrence.
Value = [1, 1.5, 2, 3, 4, 6, 8, 16, 32, 128]
Frequency = [207, 19, 169, 92, 36, 7, 12, 5, 2, 2]
Here, Frequency[i] corresponds to Value[i]
The frequency can be thought of as the importance of a value. The other thing which denotes the importance of a value is the distance between the elements in the array. For example, 1.5 is not that significant compared to 32 or 128, since it has elements much closer such as 1 and 2.
When approaching to cluster these values, I need to look at distances between values and also the frequency of their occurrence. A possible output for the above problem would be
Clust_value = [(1, 1.5), 2, 3, 4, (6, 8), 16, (32, 128)]
This is not the best cluster but one possible answer. I need to know the best algorithm to approach this problem.
Firstly, I tried to solve this problem without taking into account the spread of elements in the values array, but that gave wrong answers in some situations. We have tried using mean and median for clustering values again with no successful outcome.
We have tried comparing frequencies of the neighbors and then clubbing the values into one cluster. We also tried to find the minimum distance between the elements of the values array and then putting them into one cluster if their difference was greater than a threshold value, but this failed to cluster values if they had low frequencies. I also looked for clustering algorithms on-line but did not get any useful resource relevant to the problem defined above.
Is there any better way to approach the problem?
You need to come up with some mathematical quality criterion of what makes one solution better than another. Unless you have thousands of numbers, you can afford a rather 'brute force' method: begin with the first number, add the next as long as your quality increases, otherwise begin a new cluster. Because your data are sorted this will be fairly efficient and find a rather good solution (you can try additional splits to further improve quality).
So it all boils down to you needing to specify quality.
Do not assume that existing criteria (e.g. variance in k-means) work for you. At most, you may be able to find a data transformation such that your requirements turn into variance, but that also will be specific to your problem.
Related
Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?
you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.
You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.
I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.
First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.
Similar questions in the database seem to be much more complicated than my example. I want to cluster 100'ish points on a line. Number of groups is irrelevant; the closeness of points is more important.
What is a term, method or algorithm to deal with this grouping problem? K-means, Hamming distance, hierarchical agglomeration, clique or complete linkage??
I've reduced two examples to bare minimum for clarification:
Simple example:
Set A = {600, 610, 620, 630} and the set of differences between its elements is diff_A = {10, 20, 30, 10, 20, 10}. I can then group as follows: {10, 10, 10}, {20, 20}, and {30}. Done.
Problematic example:
Set B = {600, 609, 619, 630} and the set of differences is diff_B = {9, 10, 11, 19, 21, 30}. I try to group with a tolerance of 1, i.e. differences that are 1 (or less) are 'similar enough' to be grouped but I get a paradox: {9, 10} AND/OR {10, 11}, {19}, {21}, and {30}.
Issue:
9 and 10 are close enough, 10 and 11 are close enough, but 9 and 11 are not, so how should I handle these overlapping groups? Perhaps this small example is unsolvable because it is symmetrical?
Why do you work on the pairwise differences? Consider the values 1, 2, 101, 102, 201, 202. Pairwise differences are 1,100,101,200,201,99,100,199,200,1,100,101,99,100,1
The values of ~200 bear no information. There is a different "cluster" inbetween. You shouldn't use them for your analysis.
Instead, grab a statistics textbook and look up Kernel Density Estimation. Don't bother to look for clustering - these methods are usually designed for the multivariate case. Your data is 1 dimensional. It can be sorted (it probably already is), and this can be exploited for better results.
There are well-established heuristics for density estimation on such data, and you can split your data on local minimum density (or simply at a low density threshold). This is much simpler, yet robust and reliable. You don't need to set a paramter such as k for k-means. There are cases where k-means is a good choice - it has origins in signal detection, where it was known that there are k=10 different signal frequencies. Today, it is mostly used for multidimensional data.
See also:
Cluster one-dimensional data optimally?
1D Number Array Clustering
partitioning an float array into similar segments (clustering)
What clustering algorithm to use on 1-d data?
I really don't know what the name of this problem is, but it's something like lossy compression, and I have a bad English, but I will try to describe it as much as I can.
Suppose I have list of unsorted unique numbers from unknown source, the length is usually between 255 to 512 with a range from 0 to 512.
I wonder if there is some kind of an algorithm that reads the data and return something like a seed number that I can use to generate a list somehow close to the original but with some degree of error.
For example
original list
{5, 13, 25, 33, 3, 10}
regenerated list
{4, 10, 30, 30, 5, 5} or {8, 20, 20, 35, 5, 9} //and so on
Does this problem have a name, and is there an algorithm that can do what I just described?
Is it the same as Monte Carlo method because from what I understand it isn't.
Is it possible to use some of the techniques used in lossy compression to get this kind of approximation ?
What I tried to do to solve this problem is to use a simple 16 bit RNG and brute-force all the possible values comparing them to the original list and pick the one with the minimum difference, but I think this way is rather dumb and inefficient.
This is indeed lossy compression.
You don't tell us the range of the values in the list. From the samples you give we can extrapolate that they count at least 6 bits each (0 to 63). In total, you have from 0 to 3072 bits to compress.
If these sequences have no special property and appear to be random, I doubt there is any way to achieve significant compression. Think that the probability of an arbitrary sequence to be matched from a 32 bits seed is 2^32.2^(-3072)=7.10^(-916), i.e. less than infinitesimal. If you allow 10% error on every value, the probability of a match is 2^32.0.1^512=4.10^(-503).
A trivial way to compress with 12.5% accuracy is to get rid of the three LSB of each value, leading to 50% savings (1536 bits), but I doubt this is what you are looking for.
Would be useful to measure the entropy of the sequences http://en.wikipedia.org/wiki/Entropy_(information_theory) and/or possible correlations between the values. This can be done by plotting all (V, Vi+1) pairs, or (Vi, Vi+1, Vi+2) triples and looking for patterns.
I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!
Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).
How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.
Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.
Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.
This came out being incomprehensible. I will rephrase
Is there an algorithm or approach that will allow sorting an array in such a way that it minimizes the differences between successive elements?
struct element
{
uint32 positions[8];
}
These records are order-insensitive.
The output file format is defined to be:
byte present; // each bit indicating whether position[i] is present
uint32 position0;
-- (only bits set in Present are actually written in the file).
uint32 positionN; // N is the bitcount of "present"
byte nextpresent;
All records are guaranteed to be unique, so a 'present' byte of 0 represents EOF.
The file is parsed by updating a "current" structure with the present fields, and the result is added to the list.
Eg: { 1, 2, 3}, { 2, 3, 2}, { 4, 2, 3}
Would be: 111b 1 2 3 001b 4 111b 2 3 2
Saving 2 numbers off the unsorted approach.
My goal is to to minimize the output file size.
Your problem
I think this question should really be tagged with 'compression'.
As I understand it, you have unordered records which consist of eight 4-byte integers: 32 bytes in total. You want to store these records with a minimum file size, and have decided to use some form of delta encoding based on a Hamming distance. You're asking how to best sort your data for the compression scheme you've constructed.
Your assumptions
From what you've told us, I don't see any real reason for you to split up your 32 bytes in the way you've described (apart from the fact that word boundaries are convenient)! If you get the same data back, do you really care if it's encoded as eight lots of 4 bytes, or sixteen lots of 2 bytes, or as one huge 32-byte integer?
Furthermore, unless there's something about the problem domain which makes your method the favourite, your best bet is probably to use a tried-and-tested compression scheme. You should be able to find code that's already written, and you'll get good performance on typical data.
Your question
Back to your original question, if you really do want to take this route. It's easy to imagine picking a starting record (I don't think it will make much difference which, but it probably makes sense to pick the 'smallest' or 'largest'), and computing the Hamming distance to all other records. You could then pick the one with the minimum distance to store next, and repeat. Obviously this is O(n^2) in the number of records. Unfortunately, this paper (which I haven't read or understood in detail) makes it look like computing the minimum Hamming distance from one string to a set of others is intrinsically hard, and doesn't have very good approximations.
You could obviously get better complexity by sorting your records based on Hamming weight (which comes down to the population count of that 32-byte integer), which is O(n log(n)) in the number of records. Then use some difference coding on the result. But I don't think this will make a terribly good compression scheme: the integers from 0 to 7 might end up as something like:
000, 100, 010, 001, 101, 011, 110, 111
0, 4, 2, 1, 5, 3, 6, 7
Which brings us back to the question I asked before: are you sure your compression scheme is better than something more standard for your particular data?
You're looking at a pair of subproblems, defining the difference between structures, then the sort.
I'm not terribly clear on your description of the structure, nor on the precedence of differences, but I'll assume you can work that out and compute a difference score between two instances. For files, there are known algorithms for discussing these things, like the one used in diff.
For your ordering, you're looking at a classic travelling salesman problem. If you're sorting a few of these things, its easy. If you are sorting a lot of them, you'll have to settle for a 'good enough' sort, unless you're ready to apply domain knowledge and many little tricks from TSP to the effort.