I am using multi linear regression to do sales quantity forecasting in retail. Due to practical issues, I cannot use use ARIMA or Neural Networks.
I split the historical data into train and validation sets. Using a walk forward validation method would be computationally quite expensive at this point. I have to take x number of weeks preceding current date as my validation set. The time series prior to x is my training set. The problem I am noting with this method is that accuracy is far higher during the validation period as compared to the future prediction. That is, the further we move from the end of the training period, the less accurate the prediction / forecast. How best can I control this problem?
Perhaps a smaller validation period, will allow the training period to come closer to the current date and hence provide a more accurate forecast; but this hurts the value of validation.
Another thought is to cheat and give both the training and validation historical data during training. As I am not using neural nets, the selected algo should not be over-fitted. Please correct me if this assumption is not right.
Any other thoughts or solution would be most welcome.
Thanks
Regards,
Adeel
If you're not using ARIMA or DNN, how about using rolling windows of regressions to train and test the historical data?
The problem is as follows:
I want to use a forecasting algorithm to predict heat demand of a not further specified household during the next 24 hours with a time resolution of only a few minutes within the next three or four hours and lower resolution within the following hours.
The algorithm should be adaptive and learn over time. I do not have much historic data since in the beginning I want the algorithm to be able to be used in different occasions. I only have very basic input like the assumed yearly heat demand and current outside temperature and time to begin with. So, it will be quite general and unprecise at the beginning but learn from its Errors over time.
The algorithm is asked to be implemented in Matlab if possible.
Does anyone know an apporach or an algortihm designed to predict sensible values after a short time by learning and adapting to current incoming data?
Well, this question is quite broad as essentially any algorithm for forcasting or data assimilation could do this task in principle.
The classic approach I would look into first would be Kalman filtering, which is a quite general approach at least once its generalizations to ensemble Filters etc. are taken into account (This is also implementable in MATLAB easily).
https://en.wikipedia.org/wiki/Kalman_filter
However the more important part than the actual inference algorithm is typically the design of the model you fit to your data. For your scenario you could start with a simple prediction from past values and add daily rhythms, influences of outside temperature etc. The more (correct) information you put into your model a priori the better your model should be at prediction.
For the full mathematical analysis of this type of problem I can recommend this book: https://doi.org/10.1017/CBO9781107706804
In order to turn this into a calibration problem, we need:
a model that predicts the heat demand depending on inputs and parameters,
observations of the heat demand.
Calibrating this model means tuning the parameters so that the model best predicts the heat demand.
If you go for Python, I suggest to use OpenTURNS, which provides several data assimilation methods, e.g. Kalman filtering (also called BLUE):
https://openturns.github.io/openturns/latest/user_manual/calibration.html
In "look up metrics” I’m trying to know how my players improve in playing my game.
I have the score (both as desing event and progression, just to try) and in look up metrics I try to “filter” with session number or days since install but, even if I group by Dimension, this doesn’t produce any result.
For instance if I do the same but with device filter it shows me the right histogram with score's mean per device.
What am I doing wrong?
From the customer care:
The session filter works only on core metrics at this point (like DAU). We hope to make this filter compatible with custom metrics as well but this might take time as we first need to include this improvement to our roadmap and then evaluate it by comparing it with our other tasks. As a result, there is no ETA on making a release.
I would recommend you to download the raw data (go to "Export data" in the settings of the game) and perform an analysis on your own for this sort of "per user" analysis. You should be able to create stats per user. GA does not do this since your game can reach millions of users and there's no way you can plot this amount of entries in a browser.
I have some time series to analyze.
Given the domain the data is coming from -
Time series is supposed to have some fluctuations.
A regular periodicity might not be present at all in some cases. There might be some irregular periods of droughts (no fluctuations happening at all)
These fluctuations may be a part of an overall down/up trend.
I am trying to avoid modeling techniques like ARIMA etc. since I am only interested in knowing the following features for each one of them:
Average amplitude of fluctuations.
Average time period of fluctuations (how long it takes for values to rise and fall back to almost same level?).
Average frequency of fluctuations. After what period do these fluctuations occur?
Following is what some of the data looks like:
The approach I am taking is to -
First build some sort of annotation on the time-axis (e.g. flat, increasing, decreasing)
Then based on these tags study further the patterns to answer the above questions. In case there is an overall up/down trend in the series I am de-trending it by removing mean/linear-fit, etc.
I was wondering if there is any other approach or technique to answer the above mentioned questions for my data.
Take a look at Singular Spectrum Analysis (ssa) which has an R package Rssa behind it. We did some research where SSA was compared with established auto-regressive algorithms and SSA did quite well.
http://axibase.com/environmental-monitoring-using-big-data/
I'm building, for example, an application that monitors your health. Each day, you're jogging and doing push-ups and you enter the information on a web site.
What I would like to do is building a chart combining the hours you jogged and the number of push-ups/sit-ups you did. Let's say on the first day, you jogged 1 hour and did 10 push-ups and on the second day, you jogged 50 minutes and did 20 push-ups, you would see a progression in your training.
I know it may sound strange but I want to have an overall-view of your health, not different views for jogging and push-ups. I don't want a double y-axis chart because if I have, as example, 6 runners, I will end up with 12 lines on the chart.
First I would redefine your terms. You are not tracking "health" here, you are tracking level of exertion through exercise.
Max Exertion != Max Health. If you exert yourself to the max and don't eat or drink, you will actually damage your health. :-)
To combine and plot your total "level of exertion" for multiple exercises you need to convert them to a common unit ... something like "calories burned".
I'm pretty sure there are many sources for reference tables with rough conversion factors for how many calories various exercises burn.
Does that help any?
Then you need a model of how push-ups and jogging affect yourself, and for this you should be asking a doctor or fitness expert, not a programmer :-). This question should probably be taken elsewhere.
Sounds like a double y-axis chart.
You can just do a regular excel-type chart with 2 lines, scaled appropriately, one for push-ups, one for jogging time. There are graphics libraries that let you do that in back-end language of your choice. X-axys is date.
You may want to have 2 scaled graphs, one for last week and one for last year (ala Yahoo Finance charts for different intervals).
Show the first set of values as a line graph above the x axis, and the second set below the x axis. If both sets of values increase over time this will show as an "expansion" of the graph; should be easy to recognize if one set is growing but the other is not.
Because the two quantities have no intrinsic relationship, you're stuck with either displaying them independently, such as two curves with two y-axes, or making up a measure that combines them, such as an estimate of calories burned, muscles used, mental anguish from exercising, etc. But it's tricky... taking from your example, I suspect one will never approach the calories burned from a 50 mile run by doing push-ups. Combining these in a meaningful way depends not on mathematics but on approximations and knowledge of the quantities that you start with and are interested in.
One compromise might be a graph with a single y-axis that shows some combined quantity, but where the independent values at each point are also graphically represented, for example, by a line where the local color represents the ratio of miles to pushups, or any of the many variants that display information in the shapes or colors in the plot.
Another option is to do a 3D plot, and then rotate it around and look for trends or whatever interests you.
If you want one overall measure of exercise levels, you could try using total exercise time. Another alternative is to define a points system, whereby users score points for each exercise.
I do think that there is virtue in letting the users see how much of each individual exercise they have done - in this case use a different graph for different exercises rather than using dual y-axes, if the scales are not comparable (e.g. time jogging and number of push-ups). There is a very good article on the problems with dual y-axes by business intelligence guru Stephen Few, here (pdf).
If you want to know more about presenting data well, I can also recommend his book "Now you see it", and the classic "The Visual Display of Quantitative Information" by Edward Tufte.