I am looking for an algorithm to generate a histogram over a large amount of streaming data, the max and min are not known in advance but standard deviation and mean are in a particular range.
I appreciate your ideas.
Cheers,
I just found one solution. Sec. 2.2 of "On-line histogram building from A streaming parallel decision tree algorithm" paper. The algo is implemented by NumericHistogram class in Hive project :
A generic, re-usable histogram class
that supports partial aggregations.
The algorithm is a heuristic adapted
from the following paper: Yael
Ben-Haim and Elad Tom-Tov, "A
streaming parallel decision tree
algorithm", J. Machine Learning
Research 11 (2010), pp. 849--872.
Although there are no approximation
guarantees, it appears to work well
with adequate data and a large (e.g.,
20-80) number of histogram bins.
I use a package called "GoHistogram" which provides two streaming approximattion histograms (NumericHistogram and Weighted Numeric Histogram). It is implemented in Golang (https://code.google.com). Here is the link:
https://github.com/VividCortex/gohistogram
Standard deviation and mean do not matter for a histogram. Simply choose your resolution and draw a bar as high as you have hits for its range. This will, of course, get more expensive with a higher resolution. You can try adjusting the resolution by trying to fit the existing data into a normal curve (or whatever model you like) and finding the standard deviation to choose a reasonable granularity.
Edit: Read it wrong the first time around. If you know the approximate standard deviation, you can choose reasonable sizes for your histogram groups from the get-go. Just compare every new entry to your current min and max and adjust your range accordingly.
Related
I am working with sample data set to learn clustering. This data set contains number of occurrences for the keywords.
Since all are number of occurrences for the different keywords, will it be OK not to scale the values and use them as it is?
I read couple of articles on internet where its emphasized that scaling is important as it will adjust the relativity of the frequency. Since most of frequencies are 0 (95%+), z score scaling will change the shape of distribution, which I am feeling could be problem as I am changing the nature of data.
I am thinking of not changing values at all to avoid this. Will that affect the quality of results I get from the clustering?
As it was already noted, the answer heavily depends on an algorithm being used.
If you're using distance-based algorithms with (usually default) Euclidean distance (for example, k-Means or k-NN), it'll rely more on features with bigger range just because a "typical difference" of values of that feature is bigger.
Non-distance based models can be affected, too. Though one might think that linear models do not get into this category since scaling (and translating, if needed) is a linear transformation, so if it makes results better, then the model should learn it, right? Turns out, the answer is no. The reason is that no one uses vanilla linear models, they're always used with with some sort of a regularization which penalizes too big weights. This can prevent your linear model from learning scaling from data.
There are models that are independent of the feature scale. For example, tree-based algorithms (decision trees and random forests) are not affected. A node of a tree partitions your data into 2 sets by comparing a feature (which splits dataset best) to a threshold value. There's no regularization for the threshold (because one should keep height of the tree small), so it's not affected by different scales.
That being said, it's usually advised to standardize (subtract mean and divide by standard deviation) your data.
Probably it depends on the classification algorithm. I'm only familiar with SVM. Please see Ch. 2.2 for the explanation of scaling
The type of feature (count of words) doesn't matter. The feature ranges should be more or less similar. If the count of e.g. "dignity" is 10 and the count of "have" is 100000000 in your texts, then (at least on SVM) the results of such features would be less accurate as when you scaled both counts to similar range.
The cases, where no scaling is needed are those, where the data is scaled implicitly e.g. features are pixel-values in an image. The data is scaled already to the range 0-255.
*Distance based algorithm need scaling
*There is no need of scaling in tree based algorithms
But it is good to scale your data and train model ,if possible compare the model accuracy and other evaluations before scaling and after scaling and use the best possibility
These is as per my knowledge
OK, so you have some historic data in the form of [say] an array of integers. This, for example, could represent free-space on a server HDD over a two-year period, with each array element representing a daily sample.
The data (free-space in this example) has a downward trend, but also has periodic positive spikes where files have been removed/compressed, Etc.
How would you go about identifying the overall trend for the two-year period, i.e.: iron out the peaks and troughs in the data?
Now, I did A-level statistics and then a stats module in my degree, but I've slept over 7,000 times since then, and well, it's leaked out of my brain.
I'm not after a bit of code as such, more of a description of how you'd approach this problem...
Thanks in advance!
You'll get many different answers, and the one you choose really depends on more specific requirements you may have. Examples:
Low-pass filter, or any other spectral analysis technique, and use the low frequencies to determine trend.
Linear regression (time/value) to find "r" (the correlation between time and the value).
Moving average of last "n" samples. If "n" is large enough this is my favorite as many times this is sufficient, and is very easy to code. It's a sort of approximation to #1 above.
I'm sure they'll be others.
If I was doing this to produce a line through points for me to look at, I would probably use a some variant of Loess, described at http://en.wikipedia.org/wiki/Local_regression, http://stat.ethz.ch/R-manual and /R-patched/library/stats/html/loess.html. Basically, you find the smoothed value at any particular point by doing a weighted regression on the data points near that point, with the nearest points given the most weight.
Articles on image compression often focus on generating the best possible image quality (PSNR) given a fixed compression ratio. I'm curious about getting the best possible compression ratio given a maximum permissible per-pixel error. My instinct is to greedily remove the smallest coefficients in the transformed data, keep track of the error I've caused, until I can't remove any more without passing the maximum error. But I can't find any papers to confirm it. Can anyone point me to a reference about this problem?
edit
Let me give some more details. I'm trying to compress depth images from 3D scanners, not regular images. Color is not a factor. Depth images tend to have large smooth patches, but accurate discontinuities are important. Some pixels will be empty - outside the scanner's range or low confidence level - and not require compression.
The algorithm will need to run fast - optimally at 30 fps like the Microsoft Kinect, or at least somewhere in the 100 millisecond area. The algorithm will be included in a library I distribute. I prefer to minimize dependencies, so compression schemes that I can implement myself in a reasonably small amount of code are preferable.
This answer won't satisfy your request for references, but it's too long to post as a comment.
First, depth buffer compression for computer generated imagery may apply to your case. Usually this compression is done at the hardware level with a transparent interface, so it's typically designed to be simple and fast. Given this, it may be worth your while to search for depth buffer compression.
One of the major issues you're going to have with transform-based compressors (DCTs, Wavelets, etc...) is that there's no easy way to find compact coefficients that meet your hard maximum error criteria. (The problem you end up with looks a lot like linear programming. Wavelets can have localized behavior in most of their basis vectors which can help somewhat, but it's still rather inconvenient.) To achieve the accuracy you desire you may need to add on another refinement step but this will also add more computation time, complexity, and will introduce another layer imperfect entropy coding leading to a loss of compression efficiency.
What you want is more akin to lossless compression than lossy compression. In this light, one approach would be to simply throw away the bits under your error threshold: if your maximum allowable error is X and your depths are represented as integers, integer-divide your depths by X and then apply lossless compression.
Another issue you're facing is the representation of depth -- depending on your circumstances it may be a floating point number, an integer, it may be in a projective coordinate system, or even more bizarre.
Given these restrictions, I recommend a scheme like
BTPC as it allows for a more easily adapted wavelet-like scheme where errors are more clearly localized and easier to understand and account for. Additionally, BTPC has shown a great resistance to many types of images and a good ability to handle continuous gradients and sharp edges with low loss of fidelity -- exactly the sorts of traits you're looking for.
Since BTPC is predictive, it doesn't matter particularly how your depth format is stored -- you just need to modify your predictor to take your coordinate system and numeric type (integer vs. floating) into account.
Since BTPC doesn't do terribly much math, it can run pretty fast on general CPUs, too, although it may not be as easy to vectorize as you'd like. (It sounds like you're possibly doing low level optimized game programming, so this may be a serious consideration for you.)
If you're looking for something simpler to implement I'd recommend a "filter" type of approach (similar to PNG) with a Golomb-Rice coder strapped on. Rather than coding the deltas perfectly to end up with lossless compression, you can code to a "good enough" degree. The advantage of doing this as compared to a quantize-then-lossless-encode style compressor is that you can potentially maintain more continuity.
"greedily remove the smallest coefficients" reminds me of SVD compression, where you use the data associated with the first k largest eigenvalues to approximate the data. The rest of the eigenvalues that are small don't hold significant information and can be discarded.
Large k -> high quality, low compression
Small k -> lower quality, high compression
(disclaimer: I have no idea what I'm talking here but it might help)
edit:
here is a better illustration of SVD compression
I am not aware of any references in case of the problem you have proposed.
However, one direction in which I can think is using optimization techniques to select best coefficients. Techniques like genetic algorithms, hill climbing, simulated annihilation can be used in this regard.
Given that I have experience in genetic algorithms, I can suggest the following process. If you are not aware about genetic algorithm, I recommend you to read up the wiki page on genetic algorithms.
Your problem can be thought of has selecting a subset of coefficients which give the minimum reconstruction error. Say there are N coefficients. It is easy to establish that there are 2^N subsets. Each subset can be represented by a string on N binary numbers. For example, for N=5,
the string 11101 represents that the selected subset contains all the coeff except the coeff4. With genetic algorithms it is possible to find an optimum bit sting. The objective function can be chosen as the absolute error between the reconstructed and the original signals. However, I am aware that you can get an error of zero when all the coeffs are taken.
To get around this problem, you may choose to modulate the objective function with an appropriate function which discourages objective function value near zero and is a monotonically increasing function after a threshold. A function like | log( \epsion + f ) | may suffice.
If what I propose seem interesting to you do let me know. I have an implementation of genetic algorithm with me. But it is tailored to my needs and you might not be in a position to adapt it for this problem. I am willing to work with you on this problem as the problem seem interesting to explore.
Do let me know.
I think you are pretty close to the solution, but there are an issue that i think you should pay some attention.
Because different wavelet coefficients are corresponding to functions with different scale (and shift), so error being introduced by elimination of a partricular coefficient depends not only on it value, but on it position (especially scale), so the weight of the coefficient should be something like w(c) = amp(c) * F(scale, shift) where amp(c) is an amplitude of coefficient and F is a function that depends on the compressed data. When you determine weigths like that the problem is reduced to the backpack problem, that could be solved in many ways (for example reorder the coefficients and eliminate tha smallest one until you get a threshold error on a pixel affected by the corresponding function). The hard part is to determine F(scale,shift). You can do it in the following way. If the data that you are compressing is relatively stable (for example surveillance video), you could estimate F as a middle probability of recieve an unacceptable error eliminating the component with given scale and shift from the wavelet decomposition. So you could perform SVD (or PCA) decomposition on historical data and calculate 'F(scale, shift)' as a weighted (with weights equal to eigenvalues) summ of scalar products of the component with given scale and shift to the eigenvectors
F(scale,shift) = summ eValue(i) * (w(scale,shift) * eVector(i)) where eValue is eigenvalue corresponding to the eigenvector - eVector(i), w(scale,shift) is a wavelet function with given scale and shift.
Iteratively evaluating different sets of coefficients will not help your goal of being able to compress frames as quickly as they are generated, and will not help you to keep complexity low.
Depth maps are different from intensity maps in several ways that can help you.
Large areas of "no data" can be handled very efficiently by run-length encoding.
Measurement error in intensity images is constant across the image after fixed-noise has been subtracted, but depth maps from both Kinects and stereo vision systems have errors that increase as an inverse function of depth. If these are the scanners you are targeting then you can use lossier compression for closer pixels - because the errors your lossy function introduces are independent of sensor error, the total error won't be increased until your lossy function's error is greater than the sensor error.
A team at Microsoft had a lot of success with a very low-loss algorithm that relied heavily on run-length encoding (see paper here), beating out JPEG 2000 with better compression and excellent performance; however, part of their success seemed to step from the relatively crude depth maps their sensor produces. If you are targeting Kinects, you may find it hard to improve on their method.
I think you are looking for something like JPEG-LS algorithm, which tries to limit the maximum amount pixel error. Albeit, it is mainly designed for compression of natural or medical images and is not well designed for depth images ( which are smoother).
The term "near-lossless compression" refers to a lossy algorithm for which each reconstructed image sample differs from the corresponding original image sample by not more than a pre-specified value, the (usually small) "loss." Lossless compression corresponds to loss=0.link to the original reference
I'd try preprocessing the image, then compressing with a general method, such as PNG.
Preprocessing for PNG (first read this)
for y in 1..height
for x in 1..width
if(abs(A[y][x-1] - A[y][x]) < threshold)
A[y][x] = A[y][x-1]
elsif (abs(A[y-1][x] - A[y][x]) < threshold)
A[y][x] = A[y-1][x]
I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.
Anyone know of an algorithm that will group pictures into events based on the date the picture was taken. Obviously I can group by the date, but I'd like something a little more sophisticated that would(might) be able to group pictures spanning multiple days based on the frequency over a certain timespan. Consider the following groupings:
1/2/2009 15 photos
1/3/2009 20 photos
1/4/2009 13 photos
1/5/2009 19 photos
1/15/2009 5 photos
Potentially these would be grouped into two groups:
1/2/2009 -> 1/5/2009
1/15/2009
Obviously there will be some tolerance(s) that need to be established.
Is there any well established way of doing this, other then inventing my own top/down approach?
You can apply pretty much any standard clustering technique to this, it's just a matter of defining your distance function correctly. When you are making your matrix of distances between your photos you should consider a combination of physical distance between locations - if you have it - and temporal distance between their creation timestamps. Normalise them and put them on separate dimensions and you may even just be able to take a regular euclidean distance.
Best of luck.
Just group the pictures that were taken on successive days (no days on which no pictures were taken) together.
You might try to dynamically calculate tolerance based on how many or how big (absolute or %) clusters you want to create.
To get a useful clustering of pictures according to date you require the following:
1) The number of clusters should be variable and not fixed a priori to the clustering
2) The diameter of each cluster should not exceed a specific amount.
The clustering algorithm that best satisfies both requirements is the QT (quality threshold) clustering algorithm. From Wikipedia:
QT (quality threshold) clustering
(Heyer, Kruglyak, Yooseph, 1999) is an
alternative method of partitioning
data, invented for gene clustering. It
requires more computing power than
k-means, but does not require
specifying the number of clusters a
priori, and always returns the same
result when run several times.
Although it is mainly used for gene clustering I think it would fit in very well for what you need.
Try to detect the Gaps instead of the Clusters.