Random values consistent with a best-fit curve - random

I'm contemplating the generation of test data with interesting distributions.
I understand methods for the generation of uniform distribution and normal distribution, but how can I transform an arbitrary function into a weighted distribution function? My terminology may be off here - I won't mind corrections.
For example, let's say that I have a function over time which generally increases, but cycles periodically. "Activity" which increases generally over a year, but weekly cycles with sharp falloff on the weekends.
The function could be algebraic, but it would be valuable if it could be any function (imperative(?) with discrete/discontinuous ranges(?)).
If the Activity curve from the example is f(t), I could just make f(t) the mean and provide a fixed standard deviation, but how do I chose t if it too needs distribution? I don't want to have to iterate through T, I just want to select among T randomly with the appropriate distributions.
So the TestActivityGenerator() function takes parameters for curves between, say, an absolute date range, another curve over weeks, and another curve over hours in the day, and spits out DateTimes in the proper distributions. Results are not generated in any specific ordering.
Another scenario might be: a generator of reals which is, say, 1.652 times more likely to spit out a prime number than a composite. No tricks on this one - there are trivial ways to do this, but I'm looking for a general solution.
Thanks!
Edit: I've change the wording of the title to look at the problem from a different angle - How can we backtrack from a curve of best-fit to random samples that are consistent with that curve. If I have a histogram of stock market data, how can I generate data that is distributed similarly to the real data. Not just pairwise-values that average to the same value for each t, because they would fail other randomness tests.

Related

How to form precision-recall curve using one test dataset for my algorithm?

I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?
Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.

Finding optimal solution to multivariable function with non-negligible solution time?

So I have this issue where I have to find the best distribution that, when passed through a function, matches a known surface. I have written a script that creates the distribution given some parameters and spits out a metric that compares the given surface to the known, but this script takes a non-negligible time, so I can't just run through a very large set of parameters to find the optimal set of parameters. I looked into the simplex method, and it seems to be the right path, but its not quite what I need, because I dont exactly have a set of linear equations, and dont know the constraints for the parameters, but rather one method that gives a single output (an thats all). Can anyone point me in the right direction to how to solve this problem? Thanks!
To quickly go over my process / problem again, I have a set of parameters (at this point 2 but will be expanded to more later) that defines a distribution. This distribution is used to create a surface, which is compared to a known surface, and an error metric is produced. I want to find the optimal set of parameters, but cannot run through an arbitrarily large number of parameters due to the time constraint.
One situation consistent with what you have asked is a model in which you have a reasonably tractable probability distribution which generates an unknown value. This unknown value goes through a complex and not mathematically nice process and generates an observation. Your surface corresponds to the observed probability distribution on the observations. You would be happy finding the parameters that give a good least squares fit between the theoretical and real life surface distribution.
One approximation for the fitting process is that you compute a grid of values in the space output by the probability distribution. Each set of parameters gives you a probability for each point on this grid. The not nice process maps each grid point here to a nearest grid point in the space of the surface. The least squares fit is a quadratic in the probabilities calculated for the first grid, because the probabilities calculated for a grid point in the surface are the sums of the probabilities calculated for values in the first grid that map to something nearer to that point in the surface than any other point in the surface. This means that it has first (and even second) derivatives that you can calculate. If your probability distribution is nice enough you can use the chain rule to calculate derivatives for the least squares fit in the initial parameters. This means that you can use optimization methods to calculate the best fit parameters which require not just a means to calculate the function to be optimized but also its derivatives, and these are generally more efficient than optimization methods which require only function values, such as Nelder-Mead or Torczon Simplex. See e.g. http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math4/optim/package-summary.html.
Another possible approach is via something called the EM Algorithm. Here EM stands for Expectation-Maximization. It can be used for finding maximum likelihood fits in cases where the problem would be easy if you could see some hidden state that you cannot actually see. In this case the output produced by the initial distribution might be such a hidden state. One starting point is http://www-prima.imag.fr/jlc/Courses/2002/ENSI2.RNRF/EM-tutorial.pdf.

Theory on how to find the equation of a curve given a variable number of data points

I have recently started working on a project. One of the problems I ran into was converting changing accelerations into velocity. Accelerations at different points in time are provided through sensors. If you get the equation of these data points, the derivative of a certain time (x) on that equation will be the velocity.
I know how to do this on the computer, but how would I get the equation to start with? I have searched around but I have not found any existing programs that can form an equation given a set of points. In the past, I have created a neural net algorithm to form an equation, but it takes an incredibly long time to run.
If someone can link me a program or explain the process of doing this, that would be fantastic.
Sorry if this is in the wrong forum. I would post into math, but a programming background will be needed to know the realm of possibility of what a computer can do quickly.
This started out as a comment but ended up being too big.
Just to make sure you're familiar with the terminology...
Differentiation takes a function f(t) and spits out a new function f'(t) that tells you how f(t) changes with time (i.e. f'(t) gives the slope of f(t) at time t). This takes you from displacement to velocity or from velocity to acceleration.
Integreation takes a function f(t) and spits out a new function F(t) which measures the area under the function f(t) from the beginning of time up until a given point t. What's not obvious at first is that integration is actually the reverse of differentiation, a fact called the The Fundamental Theorem of Calculus. So integration takes you from acceleration to velocity or velocity to displacement.
You don't need to understand the rules of calculus to do numerical integration. The simplest (and most naive) method for integrating a function numerically is just by approximating the area by dividing it up into small slices between time points and summing the area of rectangles. This approximating sum is called a Reimann sum.
As you can see, this tends to really overshoot and undershoot certain parts of the function. A more accurate but still very simple method is the trapezoid rule, which also approximates the function with a series of slices, except the tops of the slices are straight lines between the function values rather than constant values.
Still more complicated, but yet a better approximation, is Simpson's rules, which approximates the function with parabolas between time points.
(source: tutorvista.com)
You can think of each of these methods as getting a better approximation of the integral because they each use more information about the function. The first method uses just one data point per area (a constant flat line), the second method uses two data points per area (a straight line), and the third method uses three data points per area (a parabola).
You could read up on the math behind these methods here or in the first page of this pdf.
I agree with the comments that numerical integration is probably what you want. In case you still want a function going through your data, let me further argue against doing that.
It's usually a bad idea to find a curve that goes exactly through some given points. In almost any applied math context you have to accept that there is a little noise in the inputs, and a curve going exactly through the points may be very sensitive to noise. This can produce garbage outputs. Finding a curve going exactly through a set of points is asking for overfitting to get a function that memorizes rather than understands the data, and does not generalize.
For example, take the points (0,0), (1,1), (2,4), (3,9), (4,16), (5,25), (6,36). These are seven points on y=x^2, which is fine. The value of x^2 at x=-1 is 1. Now what happens if you replace (3,9) with (2.9,9.1)? There is a sixth order polynomial passing through all 7 points,
4.66329x - 8.87063x^2 + 7.2281x^3 - 2.35108x^4 + 0.349747x^5 - 0.0194304x^6.
The value of this at x=-1 is -23.4823, very far from 1. While the curve looks ok between 0 and 2, in other examples you can see large oscillations between the data points.
Once you accept that you want an approximation, not a curve going exactly through the points, you have what is known as a regression problem. There are many types of regression. Typically, you choose a set of functions and a way to measure how well a function approximates the data. If you use a simple set of functions like lines (linear regression), you just find the best fit. If you use a more complicated family of functions, you should use regularization to penalize overly complicated functions such as high degree polynomials with large coefficients that memorize the data. If you either use a simple family or regularization, the function tends not to change much when you add or withhold a few data points, which indicates that it is a meaningful trend in the data.
Unfortunately, integrating accelerometer data to get velocity is a numerically unstable problem. For most applications, your error will diverge far too soon to get results of any practical value.
Recall that:
So:
However well you fit a function to your accelerometer data, you will still essentially be doing a piecewise interpolation of the underlying acceleration function:
Where the error terms from each integration will add!
Typically you will see wildly inaccurate results after just a few seconds.

clustering algorithm for objects which have multiple feature time series information

I am looking for clustering algorithm which can handle with multiple time series information for each objects.
For example, for company "A" we have time series of 3 features(ex. income, sales, inventory)
At the same way, company "B" also has same time series of same features. and so on..
Then, how we can make cluster between set of company?
Is there some wise way to handle this?
A lot of clustering algorithms ask you to provide some measure of the similarity or distance between two points. It is really up to you to decide what features are important and what the distance really is. One way forwards would be to use the correlation between two time series. This gives you a similarity. If you have to convert this to a distance I would use sqrt(1-r), where r is the correlation, because if you look e.g. at the equation at the bottom of http://www.analytictech.com/mb876/handouts/distance_and_correlation.htm you can see that this is proportional to a distance if you have points in n-dimensional space. If you have three different time series (income, sales, inventory) I would use the sum of the three distances worked out from the correlations between the two time series of the same type.
Another option, especially if the time series are not very long, would be to regard a time series of length n as a point in n-dimensional space and feed this into the clustering algorithm, or use http://en.wikipedia.org/wiki/Principal_component_analysis to reduce the n dimensions down to 1 by looking at the most significant components (while you are doing this, it never hurts to plot the points using the least significant components and investigate points that stand out from the others. Points where the data is in error sometimes stand out here).

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

Resources