BayesSearchCV of LGBMregressor: how to weight samples in both training and CV scoring? - lightgbm

While optimizing LightGBM hyperparameters, I'd like to individually weight samples during both training and CV scoring. From the BayesSearchCV docs, it seems that a way to do that could be to insert a LGBMregressor sample_weight key into the BayesSearchCV fit_params option. But this is not clear because both BayesSearchCV and LGBMregressor have fit methods.
To which fit method is the BayesSearchCV fit_params going? And is using fit_params really the way to weight samples during both training and CV scoring?

Based on the documentation I believe fit_params is passed as an argument upon BayesSearchCV() instantiation, not when the .fit() method is called.

Related

Why the decision tree algorithm in python change every run?

I am following a course on udemy about data science with python.
The course is focused on the output of the algorithm and less on the algorithm by itself.
In particular I am performing a decision tree. Every doing I run the algorithm on python, also with the same samples, the algorithm gives me a slightly different decision tree. I have asked to the tutors and they told me "The decision trees does not guarantee the same results each run because of its nature." Someone can explain me why more in detail or maybe give me an advice for a good book about it?
I did the decision tree of my data importing:
import numpy as np
import pandas as pd
from sklearn import tree
and doing this command:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)
where X are my feature data and y is my target data
Thank you
The DecisionTreeClassifier() function is apparently documented here:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
So this function has many arguments. But in Python, function arguments may have default values. Here, all arguments have default values, so you can even call the function with an empty argument list, like this:
clf = tree.DecisionTreeClassifier()
The parameter of interest, random_state is documented like this:
random_state: int, RandomState instance or None, default=None
So your call is equivalent to, among many other things:
clf = tree.DecisionTreeClassifier(random_state=None)
The None value tells the library that you don't want to bother with providing a seed (that is, an initial state) to the underlying pseudo-random number generator. Hence, the library has to come up with some seed.
Typically, it will take the current time value, with microsecond precision if possible, and apply some hash function. So at every call you will get a different initial state, and so a different sequence of pseudo-random numbers. Hence, a different tree.
You might want to try forcing the seed. For example:
clf = tree.DecisionTreeClassifier(random_state=42)
and see if your problem persists.
Now, regarding why does the decision tree require pseudo-random numbers, this is discussed for example here:
According to scikit-learn’s “best” and “random” implementation [4], both the “best” splitter and the “random” splitter uses Fisher-Yates-based algorithm to compute a permutation of the features array.
The Fisher-Yates algorithm is the most common way to compute a random permutation. Also, if stopped before completion, it can be used to extract a random subset of the data sample, for example if you need a random 10% of the sample to be excluded from the data fitting and set aside for a later cross-validation step.
Side note: in some circumstances, non-reproducibility can become a pain point, for example if you want to study the influence of an external parameter, say some global Y values bias. In that case, you don't want uncontrolled changes in the random numbers to blur the effects of your parameter changes. Hence the need for the API to provide some way to control the seed value.

Exogenous variables in hmmlearn's GaussianHMM

I am trying to use hmmlearn's GaussianHMM to fit a Hidden Markov Model with 2 main states, while allowing for multiple exogenous variables. My goal is to determine two states of GDP growth (one with low variance and the other with high variance), these states then depend on lagged unemployment, lagged commercial confidence level etc. I have a couple of questions:
Using hmmlearn's GaussiansHMM, I have read through the documentation but I cannot find any mention of exogenous variable. Using the method fit(X, lengths=None), I see that X can have n_features columns, do I understand correctly that I should pass in an array with the first column being the endogenous varible (GDP growth in my case) and the rest of columns are the exogenous variables ?
Is hmmlearn's GaussianHMM equivalent to statsmodels.tsa.regime_switching.markov_regression.MarkovRegression ? This model allows for exog_tvtp which means that exogenous variables are used to calculate a time varying transition probabilities matrix.
An example of fitting the monthly returns of the S&P500, no exogenous variable.
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM
import yfinance as yf
sp500 = yf.download("^GSPC")["Adj Close"]
# Fitting an absolute return model because we only care about volatility #
rets = np.log(sp500/sp500.shift(1)).dropna()
rets.index = pd.to_datetime(rets.index)
rets = rets.resample("M").sum()
model = GaussianHMM(n_components=2)
model.fit(rets.to_frame())
state_sequence = model.predict(rets.to_frame())
Imagine if I want to add a dependency on exogenous variables to the returns of the S&P500, for example on economic growth or past volatilities, is there a way to do this ?
Thanks for any help.
n_features can be thought of as the temporal domain, and should not be conflated with features that describe the complexity of ie. a regression model.
If your hidden states are the two states of GDP growth, then the observed variable (or emissions) that you are trying to infer the hidden states from should be the feature space (a.k.a. n_features).
This should be a single measurement (emission) descriptive of a combination of your "exogenous variables", collected over time. hmmlearn will not be able to take multivariate emissions.
Suggestions
If I understand your question correctly, perhaps what you might be looking for are Kalman filters. KF produces estimates of unknowns based on multiple measurements (ie. all of your exogenous variables) that ultimately produce a model more accurate than those based on a single measurement.
If you wish each hidden state to have multiple independent emissions then what you might be looking for is a structured perceptron. This is discussed here: Hidden Markov Model for multiple observed variables

H2O document question for stopping_tolerance, score_each_iteration, score_tree_interval, etc

I have the following questions that still confused me after I read the h2o document. Can someone provide some explanation for me
For the stopping_tolerance = 0.001, let's use AUC for example, current AUC is 0.8. Does that mean the AUC need to increase 0.8 + 0.001 or need to increase 0.8*(1+0.1%)?
score_each_iteration, in H2O document
(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_each_iteration.html) it just say "iteration". But what exactly is the definition for each
"iteration", is that each tree or each grid search or each K folder
cross validation or something else?
Can I define score_tree_interval and set score_each_iteration = True
at the same time or I can only use one of them to make the grid
search repeatable?
Is there any difference to put 'stopping_metric',
'stopping_tolerance', 'stopping_rounds' in
H2OGradientBoostingEstimator vs in search_criteria of H2OGridSearch?
I found put in H2OGradientBoostingEstimator will make the code run
much faster when I test it in Spark environment
0.001 is the same as 0.1%, for AUC since bigger is better, you will want to see an increase of at least .001 after a specified number of scoring rounds.
You have linked to a portion of the documentation that is specific to the algorithms listed in Available in at the top of the page. So let's stick to answering this question with respect to individual models and not grid search. If you want to see what is being scored at each iteration take a look at your model results in Flow or use my_model.plot() (for the python api) to see what is getting scored at each iteration. For GBM and DRF this will be ntrees, but since different algorithms will have different aspects that change the word iteration is used since it is more generic.
Did you test this out? what did you find when you did this? Take a look at the scoring history plot in flow and notice what happens when you set both score_tree_interval and score_each_iteration = True versus when you only set score_tree_interval (I would recommend trying to understand these parameters at the individual model level before you use grid search).
yes, in once case you are specifying early stopping as you build an individual model in the case of grid search you are indicating whether on not to build more models.

Validation Split and Checkpoint Best Model in Keras

Let us use a validation split of 0.3 when fitting a Sequential model. What will be used for validation, the first or the last 30% samples?
Secondly, checkpointing the best model saves the best model weights in .hdf5 file format. Does this mean that, for a certain experiment, the saved model is the best tuned model?
For your first question, the last 30% samples will be used for validation.
From Keras documentation:
validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling
For your second question, I assume that you're talking about ModelCheckpoint with save_best_only=True. In this case, this callback saves the weights of a given epoch only if monitor ('val_loss', by default) is better than the best monitored value. Concretely, this happens here. If monitor is 'val_loss', this should be the tuned model for a particular setting of hyperparameters, according to the validation loss.

Confusion Matrix of Bayesian Network

I'm trying to understand bayesian network. I have a data file which has 10 attributes, I want to acquire the confusion table of this data table ,I thought I need to calculate tp,fp, fn, tn of all fields. Is it true ? if it's then what i need to do for bayesian network.
Really need some guidance, I'm lost.
The process usually goes like this:
You have some labeled data instances
which you want to use to train a
classifier, so that it can predict
the class of new unlabeled instances.
Using your classifier
of choice (neural networks, bayes
net, SVM, etc...) we build a
model with your training data
as input.
At this point, you usually would like
to evaluate the performance of the
model before deploying it. So using a
previously unused subset of the data
(test set), we compare the model
classification for these instances
against that of the actual class. A
good way to summarize these results
is by a confusion matrix which shows
how each class of instances is
predicted.
For binary classification tasks, the convention is to assign one class as positive, and the other as negative. Thus from the confusion matrix, the percentage of positive instances that are correctly classified as positive is know as the True Positive (TP) rate. The other definitions follows the same convention...
Confusion matrix is used to evaluate the performance of a classifier, any classifier.
What you are asking is a confusion matrix with more than two classes.
Here is the steps how you do:
Build a classifier for each class, where the training set consists of
the set of documents in the class (positive labels) and its
complement (negative labels).
Given the test document, apply each classifier separately.
Assign the document to the class with the maximum score, the
maximum confidence value, or the maximum probability
Here is the reference for the paper you can have more information:
Picca, Davide, Benoît Curdy, and François Bavaud.2006.Non-linear correspondence analysis in text retrieval: A kernel view. In Proc. JADT.

Resources