How dealing with categorical variables in Calibrated Classifier? - categorical-data

I am dealing with calibration curve for catboost model.
cat=CatBoostClassifier()
calib=CalibratedClassifierCV(base_estimator=cat, method='sigmoid', cv=2)
calib.fit(XX,yy,cat_features=??)
How can I deal with categorical variables in the fit of calibrated classifier?
Thanks :)

You need to pass the categorical indices during the model constructor.
in your case:
cat=CatBoostClassifier(cat_features=categorical_positions)
and then continue as you wrote.
categorical_positions is a list if categorical features indices.

The problem is that sklearn CalibratedClassifierCV doesn't support string values.
To overcome this problem you need to change string values of categorical features to integer values (for example enumerate them). CatBoost will still treat them as categorical, because you have mentioned it in cat_features parameter of CatBoostClassifier, so metrics will be the same.

Related

When to use StringIndexer vs StringIndexer+OneHotEncoder?

When / in what context should you use StringIndexer vs StringIndexer+OneHotEncoder?
Looking at the docs for sparkml's StringIndexer (https://spark.apache.org/docs/latest/ml-features#stringindexer) and OneHotEncoder (https://spark.apache.org/docs/latest/ml-features#onehotencoder), it's not obvious to me when to use just StringIndexer vs StringIndexer+OneHotEncoder (I've been using just a StringIndexer on a benchmarking dataset and getting pretty good results as is, but I suppose that does not mean that doing this is necessarily "correct"). The ohe docs refer to a StringIndexer > OneHotEncoder > VectorAssembler staging pipeline, but the way it is worded make that seem optional (vs just doing StringIndexer > VectorAssembler).
Can anyone clarify this for me?
First, it is necessary to use StringIndexer before OneHotEncoder, because OneHotEncoder needs a column of category indices as input.
To answer your question, StringIndexer may bias some machine learning models. For instance, after passing a data frame with a categorical column that has three classes (0, 1, and 2) to a linear regression model. A relationship of double between value 1 and 2 may be concluded while it is just a different class, a different index. When having a vector with zeros and ones at specific positions can transmit the desired information of class difference. So finally, it depends on the model used during training, tree-based models are sensitive to one-hot encoding and become worse with one-hot encoded vectors.
You may consider reading Create a Pipeline - Learning Spark for more details behind one hot encoding.

Exogenous variables in hmmlearn's GaussianHMM

I am trying to use hmmlearn's GaussianHMM to fit a Hidden Markov Model with 2 main states, while allowing for multiple exogenous variables. My goal is to determine two states of GDP growth (one with low variance and the other with high variance), these states then depend on lagged unemployment, lagged commercial confidence level etc. I have a couple of questions:
Using hmmlearn's GaussiansHMM, I have read through the documentation but I cannot find any mention of exogenous variable. Using the method fit(X, lengths=None), I see that X can have n_features columns, do I understand correctly that I should pass in an array with the first column being the endogenous varible (GDP growth in my case) and the rest of columns are the exogenous variables ?
Is hmmlearn's GaussianHMM equivalent to statsmodels.tsa.regime_switching.markov_regression.MarkovRegression ? This model allows for exog_tvtp which means that exogenous variables are used to calculate a time varying transition probabilities matrix.
An example of fitting the monthly returns of the S&P500, no exogenous variable.
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM
import yfinance as yf
sp500 = yf.download("^GSPC")["Adj Close"]
# Fitting an absolute return model because we only care about volatility #
rets = np.log(sp500/sp500.shift(1)).dropna()
rets.index = pd.to_datetime(rets.index)
rets = rets.resample("M").sum()
model = GaussianHMM(n_components=2)
model.fit(rets.to_frame())
state_sequence = model.predict(rets.to_frame())
Imagine if I want to add a dependency on exogenous variables to the returns of the S&P500, for example on economic growth or past volatilities, is there a way to do this ?
Thanks for any help.
n_features can be thought of as the temporal domain, and should not be conflated with features that describe the complexity of ie. a regression model.
If your hidden states are the two states of GDP growth, then the observed variable (or emissions) that you are trying to infer the hidden states from should be the feature space (a.k.a. n_features).
This should be a single measurement (emission) descriptive of a combination of your "exogenous variables", collected over time. hmmlearn will not be able to take multivariate emissions.
Suggestions
If I understand your question correctly, perhaps what you might be looking for are Kalman filters. KF produces estimates of unknowns based on multiple measurements (ie. all of your exogenous variables) that ultimately produce a model more accurate than those based on a single measurement.
If you wish each hidden state to have multiple independent emissions then what you might be looking for is a structured perceptron. This is discussed here: Hidden Markov Model for multiple observed variables

Shall I treat Industry Classification codes as double data type in K-means clustering?

Since K-means cannot handle categorical variables directly, I want to know if it is correct to convert International Standard Industrial Classification of All Economic Activities or ISIC into double data types to cluster it using K-means along with other financial and transactional data? Or shall I try other techniques such as one hot encoding?
The biggest assumption is that ISIC codes are categorical not numeric variables since code “2930” refers to “Manufacture of parts and accessories for motor vehicles” and not money, kilos, feet, etc., but there is a sort of pattern in such codes since they are not assigned randomly and have a hierarchy for instance 2930 belongs to Section C “Manufacturing” and Division 29 “Manufacture of motor vehicles, trailers and semi-trailers”.
As you want to use standard K-Means, you need your data has a geometric meaning. Hence, if your mapping of the codes into the geometric space is linear, you will not get any proper clustering result. As the distance of the code does not project in their value. For example code 2930 is as close to code 2931 as code 2929. Therefore, you need a nonlinear mapping for the categorical space to the geometric space to using the standard k-mean clustering.
One solution is using from machine learning techniques similar to word-to-vec (for vectorizing words) if you have enough data for co-occurrences of these codes.
Clustering is all about distance measurement.
Discretizing numeric variable to categorical is a partial solution. As earlier highlighted, the underlying question is how to measure the distance for a discretized variable with other discretized variable and numeric variable?
In literature, there are several unsupervised algorithms for treating mixed data. Take a look at the k-prototypes algorithm and the Gower distance.
The k-prototypes in R is given in clustMixType package. The Gower distance in R is given in the function daisy in the cluster package. If using Python, you can look at this post
Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. Paper presented at the Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining,(PAKDD).
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 857-871.
K-means is designed to minimize the sum of squares.
Does minimizing the sum of squares make sense for your problem? Probably not!
While 29, 2903 and 2930 are supposedly all related 2899 likely is not very much related to 2900. Hence, a least squares approach will produce undesired results.
The method is really designed for continuous variables of the same type and scale. One-hot encoded variables cause more problems than they solve - these are a naive hack to make the function "run", but the results are statistically questionable.
Try to figure out what he right thing to do is. It's probably not least squares here.

Numeric to nominal filter

When is it compulsory to use the filter to change the data type to nominal? I am doing classification right now, and the results differ by a huge margin if I changed it to nominal compared to as it is. Thank you in advance.
I don't your question is formed well but I will try to answer it anyway.
Nominal and numeric attributes represent different types of attributes and therefore are treated differently by machine learning algorithms.
Nominal attributes are limited to a closed set of values and they don't have order or any other relation between them. Usually nominal attributes should have a small amount of possible values (large set of possible values may cause over-fitting). The color of car is an example of an attribute that probably would be represented as a nominal attribute.
Numeric attributes are usually more common. They represent values on some axis and are not limited to specific values. Usually the classification algorithms will try to find a point on that axis that differentiate well between the classes or use the value to calculate distance between instances. The salary of an employee is an example of an attribute I will probably use as a numeric attribute.
One more thing you need to take into account is how the classification algorithm treats nominal and numeric attributes. Some algorithms don't handle well nominal attributes. Other algorithms will not work well with several numeric attributes if the values of the attributes are not normalized.

C4.5 algorithm with unbounded attributes

Current implementation of C4.5 in VFDT (http://www.cs.washington.edu/dm/vfml/vfdt.html) or for that matter any other implementation uses the C4.5 format of files for providing inputs for constructing the decision tree. According to this the attributes can have the following formats:
continuous
If the attribute has a continuous value.
discrete
The word 'discrete' followed by an integer which indicates how many values the attribute can take.
list of identifiers
This is a discrete attribute with the values enumerated (this is the prefered method for discrete attributes). The identifiers should be separated by commas.
ignore
means the attribute should be ignored - it won't be used.
Does anybody know how we can specify discrete valued attributes whose complete set of possible values is too large to list down?
For example "IP-Address" attribute can have Math.Pow(255,4) possible discrete values;
"QueryString" attribute can have infinite number of possible values ... etc.
Can the C4.5 algorithm handle the case where the attribute has say 100,000 discrete distinct values, OR where the exact bound is not known, but only an approximation is known?
Thanks.
The usual choice is to enumerate all the values of a discrete feature that occur in your training set. Since the algorithm can never gather enough statistics for values that are not seen during training, those would be ignored no matter how you'd implement them.
Mind you, it's quite hard to gather statistics for such features anyway, so you might want to think about different representations. In particular, multi-word strings of text can be tokenized and treated as bags of words.

Resources