I am using multi linear regression to do sales quantity forecasting in retail. Due to practical issues, I cannot use use ARIMA or Neural Networks.
I split the historical data into train and validation sets. Using a walk forward validation method would be computationally quite expensive at this point. I have to take x number of weeks preceding current date as my validation set. The time series prior to x is my training set. The problem I am noting with this method is that accuracy is far higher during the validation period as compared to the future prediction. That is, the further we move from the end of the training period, the less accurate the prediction / forecast. How best can I control this problem?
Perhaps a smaller validation period, will allow the training period to come closer to the current date and hence provide a more accurate forecast; but this hurts the value of validation.
Another thought is to cheat and give both the training and validation historical data during training. As I am not using neural nets, the selected algo should not be over-fitted. Please correct me if this assumption is not right.
Any other thoughts or solution would be most welcome.
Thanks
Regards,
Adeel
If you're not using ARIMA or DNN, how about using rolling windows of regressions to train and test the historical data?
Related
I want to train a CNN, but I want to use all data to train the network thus not performing validation. Is this a good choice? am I risking to overfit my CNN if using only the training loss as the criterium for early stopping the CNN?
In other words, what is the best 'monitor' parameter in KERAS (for example) for early stopping, among the options below?
early_stopper=EarlyStopping(monitor='train_loss', min_delta=0.0001, patience=20)
early_stopper=EarlyStopping(monitor='train_acc', min_delta=0.0001, patience=20)
early_stopper=EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=20)
early_stopper=EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=20)
There is a discussion like this in stackoverflow Keras: Validation error is a good measure for stopping criteria or validation accuracy?, however, they talk about the validation only. Is it better using criteria in validation or training data to early stopping a CNN training?
I want to train a CNN, but I want to use all data to train the network thus not performing validation. Is this a good choice? am I
risking to overfit my CNN if using only the training loss as the
criterium for early stopping the CNN?
Answer: No, your purpose is to predict on new samples, even you got 100% training accuracy but you may got bad prediction on new samples. You don't have a way to check whether you have an overfitting
In other words, what is the best 'monitor' parameter in KERAS (for example) for early stopping, among the options below?
Answer: It should be the criteria closest to the reality
early_stopper=EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=20)
In addition, you may need train, validate, and test data. Train is to train your model, validate is to perform validating some models+parameters and select the best, and test is to verify independently your result (it's not used for choosing models, parameters, so it's equivalent to new samples)
I've already up-voted Tin Luu's answer, but wanted to refine one critical, practical point: the best criterion is the one that best matches your success criteria. To wit, you have to define your practical scoring function before your question makes complete sense for us.
What is important to the application for which you're training this model? If it's nothing more than top-1 prediction accuracy, then validation accuracy (val_acc) is almost certainly your sole criterion. If you care about confidence levels (e.g. hedging your bets when 48% chance it's a cat, 42% it's a wolf, 10% it's a Ferrari), then proper implementation of an error function will make validation error (val_err) a better choice.
Finally, I stress again that the ultimate metric is actual performance according to your chosen criteria. Test data are a representative sampling of your actual input. You can use an early stopping criterion for faster training turnaround, but you're not ready for deployment until your real-world criteria are tested and satisfied.
The problem is as follows:
I want to use a forecasting algorithm to predict heat demand of a not further specified household during the next 24 hours with a time resolution of only a few minutes within the next three or four hours and lower resolution within the following hours.
The algorithm should be adaptive and learn over time. I do not have much historic data since in the beginning I want the algorithm to be able to be used in different occasions. I only have very basic input like the assumed yearly heat demand and current outside temperature and time to begin with. So, it will be quite general and unprecise at the beginning but learn from its Errors over time.
The algorithm is asked to be implemented in Matlab if possible.
Does anyone know an apporach or an algortihm designed to predict sensible values after a short time by learning and adapting to current incoming data?
Well, this question is quite broad as essentially any algorithm for forcasting or data assimilation could do this task in principle.
The classic approach I would look into first would be Kalman filtering, which is a quite general approach at least once its generalizations to ensemble Filters etc. are taken into account (This is also implementable in MATLAB easily).
https://en.wikipedia.org/wiki/Kalman_filter
However the more important part than the actual inference algorithm is typically the design of the model you fit to your data. For your scenario you could start with a simple prediction from past values and add daily rhythms, influences of outside temperature etc. The more (correct) information you put into your model a priori the better your model should be at prediction.
For the full mathematical analysis of this type of problem I can recommend this book: https://doi.org/10.1017/CBO9781107706804
In order to turn this into a calibration problem, we need:
a model that predicts the heat demand depending on inputs and parameters,
observations of the heat demand.
Calibrating this model means tuning the parameters so that the model best predicts the heat demand.
If you go for Python, I suggest to use OpenTURNS, which provides several data assimilation methods, e.g. Kalman filtering (also called BLUE):
https://openturns.github.io/openturns/latest/user_manual/calibration.html
I have dataset of gold prices and after modifying and some preprocessing i ended up with dataframe below:
There is 50,000 record in dataset and there are morethan 500 different markets with different frequencies, all columns expect date are int type and date is datetime object. i need to predict price per unit in some specific dates. but somehow i baffled with so many methods.
My question is what regression algorithm/method is results good prediction for this kind of data ?
In machine learning or data mining as they always say, a lot of things can be done in a lot of ways. Lets try to use elimination to decide on the algorithm for the given problem.The primary case is that the class variable (feature to be predicted) is continuous hence you should use any regression algorithms. I would suggest to go with linear regression, check the accuracy using r^2 score which is basically a squared difference between an actual and a predicted value. If it is not on par, try randomforest regressor.
So I trying to perform a 4-fold cross validation on my training set. I have divided my training data into four quarters. I use three quarters for training and one quarter for validation. I repeat this three more times till all the quarters are given a chance to be the validation set, atleast once.
Now after training I have four caffemodels. I test the models on my validation sets. I am getting different accuracy in each case. How should I proceed from here? Should I just choose the model with the highest accuracy?
Maybe it is a late reply, but in any case...
The short answer is that, if the performances of the four models are similar and good enough, then you re-train the model on all the data available, because you don't want to waste any of them.
The n-fold cross validation is a practical technique to get some insights on the learning and generalization properties of the model you are trying to train, when you don't have a lot of data to start with. You can find details everywhere on the web, but I suggest the open-source book Introduction to Statistical Learning, Chapter 5.
The general rule says that after you trained your n models, you average the prediction error (MSE, accuracy, or whatever) to get a general idea of the performance of that particular model (in your case maybe the network architecture and learning strategy) on that dataset.
The main idea is to assess the models learned on the training splits checking if they have an acceptable performance on the validation set. If they do not, then your models probably overfitted tha training data. If both the errors on training and validation splits are high, then the models should be reconsidered, since they don't have predictive capacity.
In any case, I would also consider the advice of Yoshua Bengio who says that for the kind of problem deep learning is meant for, you usually have enough data to simply go with a training/test split. In this case this answer on Stackoverflow could be useful to you.
I have a dataset that contians the following information, time of the day, day of the week, performance of the post. The post is a blog post made on a certain blog, performance is computed using the number of visits, commenets, etc. We are trying to find a correlation between the time of posting, day of posting and performance. I am inclined to use a clustering algorithm, but I am not sure how to go about this, what algorithm would you recommend and why ?
Giving an advice on general things like the choice of the method is usually not easy -- and even more so if there is no data and only the principles are concerned.
Nevertheless, put in usual terms, it seems as if you wanted a model f(time of day, day of the week) which outputs a prediction on the performance. For this, you basically can use any regression method in which you feed your measured data, such as neural networks, kernel regression, regression trees (CART), etc.
Moreover, in order to get a first graphical interpretation, you can also use a histogram where you choose some time-window (like a quarter of an hour) and attribute to it the average performance in that time-window.
As said, so far these are only general things -- I hope that helps nevertheless.