General question:
Using scikit-optimize for a black box optimization. Can't find in the doc what model_queue_size does. I'm doing the ask-tell because I can parallelize the calculation of y as described in the example. So, doing some profiling, it looks like the opt.tell() call runs faster when model_queue_size is set smaller. Is that what model_queue_size does - limits the number of sample used in the opt.tell() call?
2nd question, how can I set kappa when using the Optimizier - ask-tell method?
When using the default model_queue_size=None all surrogate models are stored in the optimizer's models attribute. If a number is specified, only model_queue_size models will be remembered. That is in the the docs.
A new model is added each time tell is called and old models will be discarded once model_queue_size is reached. So only the most recent models are remembered. That can be seen by looking at the code.
Not sure why this would affect runtime in your case. I suppose if you run many iterations and models are very large it could be a memory thing.
Kappa can be set by using the acq_func_kwargs parameter of the Optimizer constructor as shown in the exploration vs exploitation example.
I am working on Word2Vec model. Is there any way to get the ideal value for one of its parameter i.e iter. Like the way we used do in K-Means (Elbo curve plot) to get the K value.Or is there any other way for parameter tuning on this model.
There's no one ideal set of parameters for a word2vec session – it depends on your intended usage of the word-vectors.
For example, some research has suggested that using a larger window tends to position the final vectors in a way that's more sensitive to topical/domain similarity, while a smaller window value shifts the word-neighborhoods to be more syntactic/functional drop-in replacements for each other. So depending on your particular project goals, you'd want a different value here.
(Similarly, because the original word2vec paper evaluated models, & tuned model meta-parameters, based on the usefulness of the word-vectors to solve a set of English-language analogy problems, many have often tuned their models to do well on the same analogy task. But I've seen cases where the model that scores best on those analogies does worse when contributing to downstream classification tasks.)
So what you really want is a project-specific way to score a set of word-vectors, well-matched to your goals. Then, you run many alternate word2vec training sessions, and pick the parameters that do best on your score.
The case of iter/epochs is special, in that by the logic of the underlying stochastic-gradient-descent optimization method, you'd ideally want to use as many training-epochs as necessary for the per-epoch running 'loss' to stop improving. At that point, the model is plausibly as good as it can be – 'converged' – given its inherent number of free-parameters and structure. (Any further internal adjustments that improve it for some examples worsen it for others, and vice-versa.)
So potentially, you'd watch this 'loss', and choose a number of training-iterations that's just enough to show the 'loss' stagnating (jittering up-and-down in a tight window) for a few passes. However, the loss-reporting in gensim isn't yet quite optimal – see project bug #2617 – and many word2vec implementations, including gensim and going back to the original word2vec.c code released by Google researchers, just let you set a fixed count of training iterations, rather than implement any loss-sensitive stopping rules.
is there a way to infer multiple documents at the same time to preserve the random state of the model using Gensim Doc2Vec?
The function infer_vector is defined as
infer_vector(doc_words, alpha=None, min_alpha=None, epochs=None, steps=None)¶
where doc_words (list of str) – A document for which the vector representation will be inferred. And I could not find any opther option to infer multiple documents at the same time.
There's no current option to infer multiple documents at once. It's one of many wishlist improvements for infer_vector() (collected in an open issue), but there's no work in progress or targeted release for that to arrive.
I'm not sure what you mean by "preserve the random state of the model". The main motivations for batching that I can see would be user convenience, or added performance via multithreading.
If what you really want is deterministic inference, see an answer in the Gensim FAQ which explains why deterministic Doc2Vec inference isn't necessarily a good idea. (It also includes a link to an issue with some ideas for how to force it, if you're determined to do that despite the good reasons not to.)
I'm working with embedded systems. For the sake of explanation, I'm working with a dsPIC33EP and a simple serial EEPROM.
Suppose I'm building a controller that uses a linear control scheme (y=mx+b). If the controller needs may different setting It's easy, store the m and the b in EEPROM and retrieve it for the different settings.
Now suppose I want to have different equations for different settings. I would have to pre program all the equations and then have a method for selecting that equation and pulling the settings from the EEPROM. It's harder because you need to know the equations ahead of time but still doable.
Now suppose that you don't know the equations ahead of time. Maybe you have to do a piece wise approximation for example. How could you store something like that in memory? That all a controller has to do is feed it a sensor reading and it would give back a control variable. Kind of like passing a variable to a function and getting the answer passed back.
How could you store a function like that in memory if only the current state is important?
How could you store a function like that if past states are important (if the control equation is second, third or fourth order for example)?
The dsPICs have limited RAM, but quite a bit of FLASH, enough for a small, but effective text parser. Have you thought of using some form of text based script? These can be translated to a more efficient data format at run-time.
I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?
If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.
Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Read the wikipedia for deeper understanding of how it works:
You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.
Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.
Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.
I have the following setup:
boolean data: (userid, itemid)
hadoop based mahout itemSimilarityJob with following arguements:
--similarityClassname Similarity_Loglikelihood
--maxSimilaritiesPerItem 50 & others (input,output..)
item based boolean recommender:
-model MySqlBooleanPrefJDBCDataModel
-similarity MySQLJDBCInMemoryItemSimilarity
-candidatestrategy AllSimilarItemsCandidateItemsStrategy
-mostSimilarItemsCandidateStrategy AllSimilarItemsCandidateItemsStrategy
Is there a way to use similarity cooccurence in my setup to get final recommendations? If I plug SIMILARITY_COOCCURENCE in the job, the MySqlJDBCInMemorySimilarity precondition checks fail since the counts become greater than 1. I know I can get final recommendations by running the recommender job on the precomputed similarities. Is there way to do this real time using the api like in the case of similarity loglikelihood (and other similarity metrics with similarity values between -1 & 1) using MysqlInMemorySimilarity?
How can we cap the max no. of similar items per item in the item similarity job. What I mean here is that the allsimilaritemscandidatestrategy calls .allsimilaritems(item) to get all possible candidates. Is there a way I can get say top 10/20/50 similar items using the API. I know we can pass a --maxSimilaritiesPerItem to the item similarity job but i am not completely sure as to what is stands for and how it works. If I set this to 10/20/50, will I be able to achieve what stated above. Also is there way to accomplish this via the api?
I am using a rescorer for filtering out and rescoring final recommendations. With rescorer, the calls to /recommend/userid?howMany=10&rescore={..} & to /similar/itemid?howMany=10&rescore{..} are taking way to longer (300ms-400ms) compared to (30-70ms) without the rescorer. I m using redis as an in memory store to fetch rescore data. The rescorer also receives some run-time data as shown above. There are only a few checks that happen in rescorer. The problem is that as the no. of item preferences for a particular user increase (> 100), the no. of calls to isFiltered() & rescore() increase massively. This is mainly due to the fact that for every user preference, the call to candidateStrategy.getCandidatItems(item) returns around (100+) similar items for each and the rescorer is called for each of these items. Hence the need to cap the max number of similar items per item in the job. Is this correct or am I missing something here? Whats the best way to optimise the rescorer in this case?
The MysqlJdbcInMemorySimilarity uses GenericItemSimilarity to load item similarities in memeory and its .allsimilaritems(item) returns all possible similar items for a given item from the precomputed item similarities in mysql. Do i need to implement my own item similarity class to return top 10/20/50 similar items. What about the if user's no. of preferences continue to grow?
It would be really great if anyone can tell me how to achieve the above? Thanks heaps !
What Preconditions check are you referring to? I don't see them; I'm not sure if similarity is actually prohibited from being > 1. But you seem to be asking whether you can make a similarity function that just returns co-occurrence, as an ItemSimilarity that is not used with Hadoop. Yes you can; it does not exist in the project. I would not advise this; LogLikelihoodSimilarity is going to be much smarter.
You need a different CandidateItemStrategy, particularly, look at SamplingCandidateItemsStrategy and its javadoc. But this is not related to Hadoop, rather than run-time element, and you mention a flag to the Hadoop job. That is not the same thing.
If rescoring is slow, it means, well, the IDRescorer is slow. It is called so many times that you certainly need to cache any lookup data in memory. But, reducing the number of candidates per above will also reduce the number of times this is called.
No, don't implement your own similarity. Your issue is not the similarity measure but how many items are considered as candidates.
I am the author of much of the code you are talking about. I think you are wrestling with exactly the kinds of issues most people run into when trying to make item-based work at significant scale. You can, with enough sampling and tuning.
However I am putting new development into a different project and company called Myrrix, which is developing a sort of 'next-gen' recommender based on the same APIs, but which ought to scale without these complications as it's based on matrix factorization. If you have time and interest, I strongly encourage you to have a look at Myrrix. Same APIs, the real-time Serving Layer is free/open, and the Hadoop-based Computation Layer backed in also available for testing.