ngram modeling, how to conduct cross-validation

ngram modeling, how to conduct cross-validation - n-gram

I'm trying to understand how cross validation works in the context of ngram models. I understand that the model essentially lists the probability of each ngram from a corpus in training. However, how does cross validation work? What is the parameter that I should be adjusting ? I know that I want to get 100% accuracy on the validation set, but I'm not sure what I need to adjust to get this to work. Is it something to do with smoothing?

Related

If I train a custom tokenizer on my dataset, I would still be able to leverage a pre-trained model weight

This is a declaration, but I'm not sure it is correct. I can elaborate.
I have a considerably large dataset (23Gb). I'd like to pre-train the Roberta-base or XLM-Roberta-base, so my language model would fit better to be used in further downstream tasks.
I know I can just run it against my dataset for a few epochs and get good results. But, what if I also train the tokenizer to generate a new vocab, and merge files? The weights from the pre-trained model I started from will still be used, or the new set of tokens will demand complete training from scratch?
I'm asking this because maybe some layers can still contribute with knowledge, so the final model will have the better of both worlds: A tokenizer that fits my dataset, and the weights from previous training.
That makes sense?

In short no.
You cannot use your own pretrained tokenizer for a pretrained model. The reason is that the vocabulary for your tokenizer and the vocabulary of the tokenizer that was used to pretrain the model that later you will use it as pretrained model are different. Thus a word-piece token which is present in Tokenizers's vocabulary may not be present in pretrained model's vocabulary.
Detailed answers can be found here,

Model tuning with Cross validation

I have a model tuning object that fits multiple models and tunes each one of them to find the best hyperparameter combination for each of the models. I want to perform cross-validation on the model tuning part and this is where I am facing a dilemma.
Let's assume that I am fitting just the one model- a random forest classifier and performing a 5 fold cross-validation. Currently, for the first fold that I leave out, I fit the random forest model and perform the model tuning. I am performing model tuning using the dlib package. I calculate the evaluation metric(accuracy, precision, etc) and select the best hyper-parameter combination.
Now when I am leaving out the second fold, should I be tuning the model again? Because if I do, I will get a different combination of hyperparameters than I did in the first case. If I do this across the five folds, what combination do I select?
The cross validators present in spark and sklearn use grid search so for each fold they have the same hyper-parameter combination and don't have to bother about hyper-parameter combinations changing across folds
Choosing the best hyper-parameter combination that I get when I leave out the first fold and using it for the subsequent folds doesn't sound right because then my entire model tuning is dependent on which fold got left out first. However, if I am getting different hyperparameters each time, which one do I settle on?
TLDR:
If you are performing let's say a derivative based model tuning along with cross-validation, your hyper-parameter combination changes as you iterate over folds. How do you select the best combination then? Generally speaking, how do you use cross-validation with derivative-based model tuning methods.
PS: Please let me know if you need more details

This is more of a comment, but it is too long for this, so I post it as an answer instead.
Cross-validation and hyperparameter tuning are two separate things. Cross Validation is done to get a sense of the out-of-sample prediction error of the model. You can do this by having a dedicated validation set, but this raises the question if you are overfitting to this particular validation data. As a consequence, we often use cross-validation where the data are split in to k folds and each fold is used once for validation while the others are used for fitting. After you have done this for each fold, you combine the prediction errors into a single metric (e.g. by averaging the error from each fold). This then tells you something about the expected performance on unseen data, for a given set of hyperparameters.
Once you have this single metric, you can change your hyperparameter, repeat, and see if you get a lower error with the new hyperparameter. This is the hpyerparameter tuning part. The CV part is just about getting a good estimate of the model performance for the given set of hyperparameters, i.e. you do not change hyperparameters 'between' folds.
I think one source of confusion might be the distinction between hyperparameters and parameters (sometimes also referred to as 'weights', 'feature importances', 'coefficients', etc). If you use a gradient-based optimization approach, these change between iterations until convergence or a stopping rule is reached. This is however different from hyperparameter search (e.g. how many trees to plant in the random forest?).
By the way, I think questions like these should better be posted to the Cross-Validated or Data Science section here on StackOverflow.

Which model to pick from K fold Cross Validation

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.
Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.
Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?

If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.
Your final model is then trained on the whole training set - this is where your final coefficients come from.

Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).
As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.
Edit:
Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29

You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.
In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.
Read more about cross-validation here.

Cross Validation is mainly used for the comparison of different models.
For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.

Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.
Scenario-1 (Directly related to the question)
Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.
(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)
After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
For parameters or coefficients, these can be determined by grid search techniques. See grid search
Scenario-2:
Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets.
CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data.
A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.

Validation of hurdle model?

I built a hurdle model, and then used that model to predict from known to unknown data points using the predict command. Is there a way to validate the model and these predictions? Do I have to do this in two parts, for example using sensitivity and specificity for the binomial part of the model?
Any other ideas for how to assess the validity of this model?

For validating predictive models, I usually trust Cross-Validation.
In short: With cross-validation you can measure the predictive performance of your model using only the training data (data with known results). Thus you can get a general opinion on how your model works. Cross-validation works quite well for wide variety of different models. The downside is that it can get quite computation heavy.
With large data sets, 10-fold cross-validation is enough. The smaller your dataset is, the more "folds" you have to do (i.e. with very small datasets, you have to do leave-one-out cross-validation)
With cross-validation, you get predictions for the whole data set. You can then compare these predictions to the actual outputs and measure how well your model performed.
Cross-validated results can take a bit to understand in more complicated comparisons, but for your general purpose question "how to assess the validity of the model", the results should be quite easy to use.

WEKA PartitionMembership filter

I have a question regarding the supervised PartitionMembership filter in WEKA.
When applying this filter using J48 as partition generator, I am able to achieve a much higher accuracy in combination with the KStar classifier.
What does this filter exactly do, because the documentation provided by WEKA is quite limited? And is it valid to use this filter to get an increased accuracy?
When applying this filter on my trainings set, it generates a number of classes. When I try to reapply the model on my test set, the filter generates a different number of classes. Hence, I am not able to use this trained supervised PartitionMembership filter for my test set. How can I use the PartitionMembership filter that was trained on the training set also for the test set?

You are asking two or three questions here. Regarding the first two: What does the PartitionMembership filter do, and how do I use it? - That I don't know to answer properly. Ultimately you can read the source code to check it out.
For the latter question, (~ how do I get it to evaluate my test set), Please use the FilteredClassifier, and there choose your filter and your classification in the dialog box of that classifier.
NAME weka.classifiers.meta.FilteredClassifier
SYNOPSIS Class for running an arbitrary classifier on data that has
been passed through an arbitrary filter. Like the classifier, the
structure of the filter is based exclusively on the training data and
test instances will be processed by the filter without changing their
structure.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio