How choose a custom seasonality on fbprophet without overfitting - forecast

I was looking at custom prophet seasonality parameters from this link: https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html
And they say: "The default values are often appropriate, but they can be increased when the seasonality needs to fit higher-frequency changes, and generally be less smooth. Increasing the number of Fourier terms allows the seasonality to fit faster changing cycles, but can also lead to overfitting: N Fourier terms corresponds to 2N variables used for modelling the cycle"
What does the last part "N Fourier terms corresponds to 2N variables used for modelling the cycle" mean, and what method should I follow to select a good value for this? For example with ARIMA, I can use auto_arima and or auto correlation to determine values of p,d,q.

Related

Calculate "moving" Covariance

I've been trying to figure out how to efficiently calculate the covariance in a moving window, i.e. moving from a set of values (x[0], y[0])..(x[n-1], y[n-1]) to a new set of values (x[1], y[1])..(x[n], y[n]). In other words, the value (x[0], y[0]) gets replaces by the value (x[n], y[n]). For performance reasons I need to calculate the covariance incrementally in the sense that I'd like to express the new covariance Cov(x[1]..x[n], y[1]..y[n]) in terms of the previous covariance Cov(x[0]..x[n-1], y[0]..y[n-1]).
Starting off with the naive formula for covariance as described here:
[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance][1]
All I can come up with is:
Cov(x[1]..x[n], y[1]..y[n]) =
Cov(x[0]..x[n-1], y[0]..y[n-1]) +
(x[n]*y[n] - x[0]*y[0]) / n -
AVG(x[1]..x[n]) * AVG(y[1]..y[n]) +
AVG(x[0]..x[n-1]) * AVG(y[0]..y[n-1])
I'm sorry about the notation, I hope it's more or less clear what I'm trying to express.
However, I'm not sure if this is sufficiently numerically stable. Dealing with large values I might run into arithmetic overflows or other (for example cancellation) issues.
Is there a better way to do this?
Thanks for any help.
It looks like you are trying some form of "add the new value and subtract the old one". You are correct to worry: this method is not numerically stable. Keeping sums this way is subject to drift, but the real killer is the fact that at each step you are subtracting a large number from another large number to get what is likely a very small number.
One improvement would be to maintain your sums (of x_i, y_i, and x_i*y_i) independently, and recompute the naive formula from them at each step. Your running sums would still drift, and the naive formula is still numerically unstable, but at least you would only have one step of numerical instability.
A stable way to solve this problem would be to implement a formula for (stably) merging statistical sets, and evaluate your overall covariance using a merge tree. Moving your window would update one of your leaves, requiring an update of each node from that leaf to the root. For a window of size n, this method would take O(log n) time per update instead of the O(1) naive computation, but the result would be stable and accurate. Also, if you don't need the statistics for each incremental step, you can update the tree once per each output sample instead of once per input sample. If you have k input samples per output sample, this reduces the cost per input sample to O(1 + (log n)/k).
From the comments: the wikipedia page you reference includes a section on Knuth's online algorithm, which is relatively stable, though still prone to drift. You should be able to do something comparable for covariance; and resetting your computation every K*n samples should limit the drift at minimal cost.
Not sure why no one has mentioned this, but you can use the Welford online algorithm which relies on the running mean:
The equations should look like:
the online mean given by:

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

Clustering by date (by distance) in Ruby

I have a huge journal with actions done by users (like, for example, moderating contents).
I would like to find the 'mass' actions, meaning the actions that are too dense (the user probably made those actions without thinking it too much :) ).
That would translate to clustering the actions by date (in a linear space), and to marking the clusters that are too dense.
I am no expert in clustering algorithms and methods, but I think the k-means clustering would not do the trick, since I don't know the number of clusters.
Also, ideally, I would also like to 'fine tune' the algorithm.
What would you advice?
P.S. Here are some resources that I found (in Ruby):
hierclust - a simple hierarchical clustering library for spatial data
AI4R - library that implements some clustering algorithms
K-means would probably do a good job as long as you're interested in an a priori known number of clusters. Since you don't you might consider reading about the LBG algorithm, which is based on k-means and is used in data compression for vector quantisation. It's basically iterative k-means which splits centroids after they converge and keeps splitting until you achieve an acceptable number of clusters.
On the other hand, since your data is one-dimensional, you could do something completely different.
Assume that you've got actions which took place at 5 points in time: (8, 11, 15, 16, 17). Let's plot a Gaussian for each of these actions with μ equal to the time and σ = 3.
Now let's see how a sum of values of these Gaussians looks like.
It shows a density of actions with a peak around 16.
Based on this observation I propose a following simple algorithm.
Create a vector of zeroes for the time range of interest.
For each action calculate the Gaussian and add it to the vector.
Scan the vector looking for values which are greater than the maximum value in the vector multiplied by α.
Note that for each action only a small section of the vector needs updates because values of a Gaussian converge to zero very quickly.
You can tune the algorithm by adjusting values of
α ∈ [0,1], which indicates how significant a peak of activity has to be to be noted,
σ, which affects the distance of actions which are considered close to each other, and
time periods per vector's element (minutes, seconds, etc.).
Notice that the algorithm is linear with regard to the number of actions. Moreover, it shouldn't be difficult to parallelise: split your data across multiple processes summing Gaussians and then sum generated vectors.
Have a look at density based clustering. E.g. DBSCAN and OPTICS.
This sounds like exactly what you want.

why overfitting gives a bad hypothesis function

In linear or logistic regression if we find a hypothesis function which fits the training set perfectly then it should be a good thing because in that case we have used 100 % of the information given to predict new information.
While it is called to be overfitting and said to be bad thing.
By making the hypothesis function simpler we may be actually increasing the noise instead of decreasing it.
Why is it so?
Overfitting occurs when you try "too hard" to make the examples in the training set fit the classification rule.
It is considered bad thing for 2 reasons main reasons:
The data might have noise. Trying too hard to classify 100% of the examples correctly, will make the noise count, and give you a bad rule while ignoring this noise - would usually be much better.
Remember that the classified training set is just a sample of the real data. This solution is usually more complex than what you would have got if you tolerated a few wrongly classified samples. According to Occam's Razor, you should prefer the simpler solution, so ignoring some of the samples, will be better,
Example:
According to Occam's razor, you should tolerate the misclassified sample, and assume it is noise or insignificant, and adopt the simple solution (green line) in this data set:
Because you actually didn't "learn" anything from your training set, you've just fitted to your data.
Imagine, you have a one-dimensional regression
x_1 -> y_1
...
x_n -> y_1
The function, defined this way
y_n, if x = x_n
f(x)=
0, otherwise
will give you perfect fit, but it's actually useless.
Hope, this helped a bit:)
Assuming that your regression accounts for all source of deviation in your data, then you might argue that your regression perfectly fits the data. However, if you know all (and I mean all) of the influences in your system, then you probably don't need a regression. You likely have an analytic solution that perfectly predicts new information.
In actuality, the information you possess will fall short of this perfect level. Noise (measurement error, partial observability, etc) will cause deviation in your data. In response, a regression (or other fitting mechanism) should seek the general trend of the data while minimizing the influence of noise.
Actually, the statement is not quite correct as written. It is perfectly fine to match 100% of your data if your hypothesis function is linear. Every continuous nonlinear function may be approximated locally by a linear function which gives important information on it's local behavior.
It is also fine to match 100 points of data to a quadratic curve if that data matches 100%. You can have high confidence that you are not overfitting your data, since the data consistently shows quadratic behavior.
However, one can always get 100% fit by using a polynomial function of high enough degree. Even without the noise that others have pointed out, though, you shouldn't assume your data has some high degree polynomial behavior without having some kind of theoretical or experimental confirmation of that hypothesis. Two good indicators that polynomial behavior is indicated are:
You have some theoretical reason for expecting the data to grow as x^n in one of the directional limits.
You have data that has been supporting a fixed degree polynomial fit as more and more data has been collected.
Notice, though, that even though exponential and reciprocal relationships may have data that fits a polynomial of high enough degree, they don't tend to obey eith of the two conditions above.
The point is that your data fit needs to be useful to prediction. You always know that a linear fit will give information locally, but that information becomes more useful the more points are fit. Even if there are only two points and noise, a linear fit still gives the best theoretical look at the data collected so far, and establishes the first expectations of the data. Beyond that, though, using a quadratic fit for three points or a cubic fit for four is not validly giving more information, as it assumes both local and asymptotic behavior information with the addition of one point. You need justification for your hypothesis function. That justification can come from more points or from theory.
(A third reason that sometimes comes up is
You have theoretical and experimental reason to believe that error and noise do not contribute more than some bounds, and you can take a polynomial hypothesis to look at local derivatives and the behavior needed to match the data.
This is typically used in understanding data to build theoretical models without having a good starting point for theory. You should still strive to use the smallest polynomial degree possible, and look to substitute out patterns in the coefficients with what they may indicate (reciprocal, exponential, gaussian, etc.) in infinite series.)
Try imagining it this way. You have a function from which you pick n different values to represent a sample / training set:
y(n) = x(n), n is element of [0, 1]
But, since you want to build a robust model, you want to add a little noise to your training set, so you actually add a little noise when generating the data:
data(n) = y(n) + noise(n) = x(n) + u(n)
where by u(n) I marked a uniform random noise with a mean 0 and standard deviation 1: U(0,1). Quite simply, it's a noise signal which is most probable to take an value 0, and less likely to take a value farther it is from 0.
And then you draw, let's say, 10 points to be your training set. If there was no noise, they would all be lying on a line y = x. Since there was noise, the lowest degree of polynomial function that can represent them is probably of 10-th order, a function like: y = a_10 * x^10 + a_9 * x^9 + ... + a_1 * x + a_0.
If you consider, by just using an estimation of the information from the training set, you would probably get a simpler function than the 10-th order polynomial function, and it would have been closer to the real function.
Consider further that your real function can have values outside the [0, 1] interval but for some reason the samples for the training set could only be collected from this interval. Now, a simple estimation would probably act significantly better outside the interval of the training set, while if we were to fit the training set perfectly, we would get an overfitted function that meandered with lots of ups and downs all over :)
Overfitting is termed as bad due to the bais it has to the true solution. The solution which is overfit is 100% fitting to the training data which is used but with any small data point addition the model will change drastically. This is called variance of the model. Hence the bais-variance tradeoff where we try to have a balance between both the factors so that, the model does not change drastically on small data changes but also reasonably properly predicts the output.

Question about Backpropagation Algorithm with Artificial Neural Networks -- Order of updating

Hey everyone, I've been trying to get an ANN I coded to work with the backpropagation algorithm. I have read several papers on them, but I'm noticing a few discrepancies.
Here seems to be the super general format of the algorithm:
Give input
Get output
Calculate error
Calculate change in weights
Repeat steps 3 and 4 until we reach the input level
But here's the problem: The weights need to be updated at some point, obviously. However, because we're back propagating, we need to use the weights of previous layers (ones closer to the output layer, I mean) when calculating the error for layers closer to the input layer. But we already calculated the weight changes for the layers closer to the output layer! So, when we use these weights to calculate the error for layers closer to the input, do we use their old values, or their "updated values"?
In other words, if we were to put the the step of updating the weights in my super general algorithm, would it be:
(Updating the weights immediately)
Give input
Get output
Calculate error
Calculate change in weights
Update these weights
Repeat steps 3,4,5 until we reach the input level
OR
(Using the "old" values of the weights)
Give input
Get output
Calculate error
Calculate change in weights
Store these changes in a matrix, but don't change these weights yet
Repeat steps 3,4,5 until we reach the input level
Update the weights all at once using our stored values
In this paper I read, in both abstract examples (the ones based on figures 3.3 and 3.4), they say to use the old values, not to immediately update the values. However, in their "worked example 3.1", they use the new values (even though what they say they're using are the old values) for calculating the error of the hidden layer.
Also, in my book "Introduction to Machine Learning by Ethem Alpaydin", though there is a lot of abstract stuff I don't yet understand, he says "Note that the change in the first-layer weight delta-w_hj, makes use of the second layer weight v_h. Therefore, we should calculate the changes in both layers and update the first-layer weights, making use of the old value of the second-layer weights, then update the second-layer weights."
To be honest, it really seems like they just made a mistake and all the weights are updated simultaneously at the end, but I want to be sure. My ANN is giving me strange results, and I want to be positive that this isn't the cause.
Anyone know?
Thanks!
As far as I know, you should update weights immediately. The purpose of back-propagation is to find weights that minimize the error of the ANN, and it does so by doing a gradient descent. I think the algorithm description in the Wikipedia page is quite good. You may also double-check its implementation in the joone engine.
You are usually backpropagating deltas not errors. These deltas are calculated from the errors, but they do not mean the same thing. Once you have the deltas for layer n (counting from input to output) you use these deltas and the weigths from the layer n to calculate the deltas for layer n-1 (one closer to input). The deltas only have a meaning for the old state of the network, not for the new state, so you should always use the old weights for propagating the deltas back to the input.
Deltas mean in a sense how much each part of the NN has contributed to the error before, not how much it will contribute to the error in the next step (because you do not know the actual error yet).
As with most machine-learning techniques it will probably still work, if you use the updated, weights, but it might converge slower.
If you simply train it on a single input-output pair my intuition would be to update weights immediately, because the gradient is not constant. But I don't think your book mentions only a single input-output pair. Usually you come up with an ANN because you have many input-output samples from a function you would like to model with the ANN. Thus your loops should repeat from step 1 instead of from step 3.
If we label your two methods as new->online and old->offline, then we have two algorithms.
The online algorithm is good when you don't know how many sample input-output relations you are going to see, and you don't mind some randomness in they way the weights update.
The offline algorithm is good if you want to fit a particular set of data optimally. To avoid overfitting the samples in your data set, you can split it into a training set and a test set. You use the training set to update the weights, and the test set to measure how good a fit you have. When the error on the test set begins to increase, you are done.
Which algorithm is best depends on the purpose of using an ANN. Since you talk about training until you "reach input level", I assume you train until output is exactly as the target value in the data set. In this case the offline algorithm is what you need. If you were building a backgammon playing program, the online algorithm would be a better because you have an unlimited data set.
In this book, the author talks about how the whole point of the backpropagation algorithm is that it allows you to efficiently compute all the weights in one go. In other words, using the "old values" is efficient. Using the new values is more computationally expensive, and so that's why people use the "old values" to update the weights.

Resources