I was wondering if there is some mechanism in pymc3 to re-run a model with new data. After setting up the model and before sampling, I assume that pymc3 does some optimization (and compilation?) of the model which takes quite some time. I would like to setup the model once, and then run a long sequence different (independent) data sets through it.
I tried setting up the model outside a loop (defining all the priors, etc.) and only updating the likelihood with new measurements inside the loop (and run the sampling inside). The estimates, however, do not change with changing data. Hence I think the model is always using the data provided first.
Many thanks and best regards
Jan
Related
I am doing my research regarding object detection using YOLO although I am from civil engineering field and not familiar with computer science. My advisor is asking me to validate my YOLO detection model trained on custom dataset. But my problem is I really don't know how to validate my model. So, please kindly point me out how to validate my model.
Thanks in advance.
I think first you need to make sure that all the cases you are interested in (location of objects, their size, general view of the scene, etc) are represented in your custom dataset - in other words, the collected data reflects your task. You can discuss it with your advisor. Main rule - you label data qualitatively in same manner as you want to see it on the output. more information can be found here
It's really important - garbage in, garbage out, the quality of output of your trained model is determined by the quality of the input (labelled data)
If this is done, it is common practice to split your data into training and test sets. During model training only train set is used, and you can later validate the quality (generalizing ability, robustness, etc) on data that the model did not see - on the test set. It's also important, that this two subsets don't overlap - than your model will be overfitted and the model will not perform the tasks properly.
Than you can train few different models (with some architectural changes for example) on the same train set and validate them on the same test set, and this is a regular validation process.
I am trying to train with new labelled document(TaggedDocument) with the pre-trained model.
Pretrained model is the trained model with documents which the unique id with label1_index, for instance, Good_0, Good_1 to Good_999
And the total size of trained data is about 7000
Now, I want to train the pre-trained model with new documents which the unique id with label2_index, for instance, Bad_0, Bad_1... to Bad_1211
And the total size of trained data is about 1211
The train itself was successful without any error, but the problem is that whenever I try to use 'most_similar' it only suggests the similar document labelled with Good_... where I expect the labelled with Bad_.
If I train altogether from the beginning, it gives me the answers I expected - it infers a newly given document similar to either labelled with Good or Bad.
However, the practice above will not work as the one trained altogether from the beginning.
Is continuing train not working properly or did I make some mistake?
The gensim Doc2Vec class can always be fed extra examples via train(), but it only discovers the working vocabulary of both word-tokens and document-tags during an initial build_vocab() step. So unless words/tags were available during the build_vocab(), they'll be ignored as unknown later. (The words get silently dropped from the text; the tags aren't trained or remembered inside the model.)
The Word2Vec superclass from which Doc2Vec borrows a lot of functionality has a newer, more-experimental parameter on its build_vocab() called update. If set true, that call to build_vocab() will add to, rather than replace, any prior vocabulary. However, as of February 2018, this option doesn't yet work with Doc2Vec, and indeed often causes memory-fault crashes.
But even if/when that can be made to work, providing incremental training examples isn't necessarily a good idea. By only updating parts of the model – those exercised by the new examples – the overall model can get worse, or its vectors made less self-consistent with each other. (The essence of these dense-embedding models is that the optimization over all varied examples results in generally-useful vectors. Training over just some subset causes the model to drift towards being good on just that subset, at likely cost to earlier examples.)
If you need new examples to also become part of the results for most_similar(), you might want to create your own separate set-of-vectors outside of Doc2Vec. When you infer new vectors for new texts, you could add those to that outside set, and then implement your own most_similar() (using the gensim code as a model) to search over this expanding set of vectors, rather than just the fixed set that is created by initial bulk Doc2Vec training.
I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.
On my test set, the observed variables are not the same for each data point. A given variable can be observed on a data point, and not on the next one. Thus I would like to change the observed flag of those variables without reconstructing the full PyMC model. I read that it wasn't possible (and couldn't manage to do it). Is there any way to do it?
I thus decided to rebuild a PyMC model for each of my test set data point. I instantiate a new PyMC model at each iteration of a for loop.
The problem is that it seems that the memory used by each PyMC model is not deleted. The fact is that my network is huge (1000 binomial/sigmoid nodes) and densely connected. The model takes about 200MB (just the model, without the traces). I am wondering if maybe the python garbage collector wasn't able to delete it because of the numerous circular references between the PyMC nodes of my network.
What do you think? Do you see a proper way to do such a thing?
If you are rebuilding the PyMC model for each data point, then presumably you are not using the built-in samplers (e.g. MCMC). In that case, you can use the set_value() method of the nodes you need to set at each iteration and then call model.draw_from_prior() to get draw a random value for the other nodes.
In other words, instead of using observed=True, you can create your nodes with observed=False and then manually fix the value with set_value().
I am building a multiple regression model - wrapped in a function - with one dependent variable and a dozen independent variables. The reason why I am building a function is that I need to do this analysis with approximately 75 different datasets.
The challenge is that the independent variables correlate better with the dependent variable when they are lagged in time. Unfortunately, not all time lags are the same for each variable and I would like to determine the optimal mix of time lags for each variable while getting the most optimum Adjusted R^2 value for the multiple regression model. Moreover, after building an initial model I will try to reduce the model using the step(modelbase, direction="both") function on the model.
In the approach I currently have I time lag all the independent variables with the same number of weeks. This results in the best possible model where all independent variables have the same time lag, but I believe (with a valid hypothesis supporting this) that there is a better model out there when we differ the time lag for each independent variable. My question is what is the best strategy to determine the best fit model without making the number of options huge. If I want to determine between 0 and 20 weeks time lag in weekly steps for 12 independent variables I am quickly up to trying to find a match between 4.096e+15 variables (=20^12).
I can imagine reducing the problem with the following strategy: Start by finding the best fit model with one independent variable at different time lags. The second step will be to add a second independent variable with its different time lags and find the best model with the two independent variables where the second is tried at different time lags while the first is kept constant. Then add a third variable for which we take a similar approach as the second by keeping the first two variables constant and change try the third with different time lags. Something tells me that this strategy might be decent approach, but something that there also might be a better overall model that contains the not optimal variables for each individual independent variable.
Is there anybody who shine some light on how to tackle this challenge?