My limited understanding is quantile and quartile are some sort of similar but totally different ways of measurement. I googled but could not find an easy to understand explanation. There is a D3 related question here but no answer yet.
My specific question is when we should use quantile instead of quartile or vice versa? I appreciate for any lay term explanation or trivial example. Thanks!
The cumulative density function gives you the probability of a random variable being on or below a certain value.
The quantile function is the opposite of that. i.e. you give it a probability and it tells you the random variable value.
So the median is the value of the quantile at the probability value of 0.5.
A quartile is the value of the quantile at the probabilities 0.25, 0.5 and 0.75.
So, in general, you can use the quantile. The quartile is a special case.
From Wikipedia:
Quantiles are values taken at regular intervals from the inverse of the cumulative distribution function (CDF) of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets.
The 4-quantiles are called quartiles.
Related
I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:
Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.
I have a 2D array of floating-point numbers, and I'd like to divide this array into an arbitrary number of regions such that the sum of all the regions' elements are more or less equal. The regions must be continuous. By as-equal-as-possible, I mean that the standard deviation of the region sums should be reduced as much as possible.
I'm doing this because I have a map of values corresponding to the "population" in an area, and I want to divide this area into groups of relatively equal population.
Thanks!
I would do it like this:
1.compute the whole sum
2.compute local centers of mass (coordinates)
3.now compute the region sum
for example:
region sum = whole sum / number of centers of masses
4.for each center of mass
start a region
and incrementally increase the size until it sum match region sum
avoid intersection of regions (use some map of usage for that)
if region has the desired sum or has nowhere to grow stop
You will have to tweak this algorithm a little to suite your needs and input data
Hope it helps a little ...
Standard deviation is way to measure that whether the divisions are close to equal. Lower standard deviation means closer the sums are.
As the problem seems n-p like clustering problems , Genetic algorithms can be used to get good solutions to the problem :-
Standard deviation can be used as fitness measure for chromosomes.
Consider k contagious regions then each gene(element) will have one of the k values which maintain the contagious nature of the regions.
apply genetic algorithm on the chromosomes and get the best chromosome for that value of k after a fixed amount of generations.
vary k from 2 to n and get best chromosome by applying genetic algorithms.
I've implemented code in MATLAB that similar to hamming distance. for input i have one matix .I want to apply my formula that use hamming distance. my formula like this:
way is Considers two row(x,y) and apply formula. |x-y| is hamming distance two row. and then obtain max item-item of these row. like
x=(1,0.3 , 0 )
y=(0 , 0.1, 1)
for every two row of matrix obtain S,
cod is in matlab :
for j=1:4
x=fin(j,:)
for i=j+1:5
y=fin(i,:)
s1= 1-hamming1
end
end
my question is : what is complexity or big-o in my code and formula?
what is complexity hamming distance?
The algorithm is linear in the product of lengths of x and y - O(len(x)*len(y)) - as indicated by the double sum.
Note, however, that it is very hard to be absolutely sure because of so many typos in your question, as well as hard-coded constants in your code (which, technically, make the algorithm complexity constant).
There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.
I've got a relatively little (~100 values) set of integers: each of them represents how much time (in millisecond) a test I ran lasted.
The trivial algorithm to calculate the average is to sum up all the n values and divide the result by n, but this doesn't take into account that some ridiculously high/low value must be wrong and should get discarded.
What algorithms are available to estimate the actual average value?
As you said you can discard all values that diverge more than a given value from the average and then recompute the average. Another value that can be interesting is the Median, that is the most frequent value.
It depends on different conditions of your test. And it is a task from probability theory.
One of the simplest way is to try calculate a median, that you can deal with ridiculously high/low values. Look at link below:
Wiki about median
As you noted, the arithmetic mean isn't good if there are very high/low values.
You could compute the median, as someone suggested, which is, in a sorted list of your values, the "middle" value (if your set contains an uneven amount of items) or the arithmetic mean of the two "middle" values (else).
Another method would be to drop, say, the lowest and highest five percentiles and compute the arithmetic mean of the rest.
Some options:
First discard N highest and lowest values and compute arithmetic mean for the rest. Set N to suitable value so that, for example 1% or 10% of values are discarded.
Use the the median, or middle value.
Use geometric mean that give less weight for the outliers.
Wikipedia lists some ways to compute different "mean" values