How does h2o.cor deal with categorical data - h2o

h2o.cor function is very powerful because it deals with categorical data, however, on h2o's kmeans function, that also only traditionally deal with numerical data, allows you to specify the categorical_encoding used - how does the h2o.cor function deal with categorical data?

For H2O-3 version 3.22 and earlier the h2o.cor() function is only meant to be applied to numeric type columns (so it doesn't handle categorical columns). If you try to run h2o.cor() on categorical columns it should return NA (the exception being a binary column, because it can be mapped to numeric 0/1).

Related

Nuisance parameters which are related between channels

I want to do a combined fit, where the normalization of a background is related between two channels (regions in my case) using an ABCD formula. I've been using a standard normfactor so far to vary the normalization in a correlated way, but that also means that the absolute parameter value is the same.
Is there a possibility to define two parameters related by a formula (effectively a scale parameter in my case)?
(I suppose this is more or less the same as pyhf: POI application using formula and corresponding issue)

LightGBM incrementally construct Dataset

I want to construct a LightGBM Dataset object from very large X and y, which can not be load to memory. Is there any method that can construct Dataset in "batch"? eg. something like
import lightgbm as lgb
ds = lgb.Dataset()
for X, y in data_generator():
ds.add_new_data(data=X, label=y)
regarding the data there are a few hacks, for example, if your data has numeric, you make sure the precision are too long, e.g. probably two digits would be enough (it depends on your data). or if you have categorical data make sure you store them with digits. but probably you are looking for a better approach
There is a concept called incremental learning. Basically you make a model (a tree) in your first iteration using the first batch of data. Then for your next model, you use that tree as a template and only updates the values (you can also allow for shrinkage). you can use the keep_training_booster for such scenario and please read on your own to learn the mechanism.
The third technique is you make multiple models: say you divide your data into N pieces and make N models, then use an ensemble approach. This way you have used your entire data with N number of observations.

How dealing with categorical variables in Calibrated Classifier?

I am dealing with calibration curve for catboost model.
cat=CatBoostClassifier()
calib=CalibratedClassifierCV(base_estimator=cat, method='sigmoid', cv=2)
calib.fit(XX,yy,cat_features=??)
How can I deal with categorical variables in the fit of calibrated classifier?
Thanks :)
You need to pass the categorical indices during the model constructor.
in your case:
cat=CatBoostClassifier(cat_features=categorical_positions)
and then continue as you wrote.
categorical_positions is a list if categorical features indices.
The problem is that sklearn CalibratedClassifierCV doesn't support string values.
To overcome this problem you need to change string values of categorical features to integer values (for example enumerate them). CatBoost will still treat them as categorical, because you have mentioned it in cat_features parameter of CatBoostClassifier, so metrics will be the same.

Best method to identify and replace outlier for Salary column in python

What is best method to identify and replace outlier for ApplicantIncome,
CoapplicantIncome,LoanAmount,Loan_Amount_Term column in pandas python.
I tried IQR with seaborne boxplot, and tried to identified the outlet and fill with NAN record after that take mean of ApplicantIncome and filled with NAN records.
Try to take group of below combination column ex: gender, education, selfemployed, Property_Area
And having below column in my dataframe
Loan_ID LP001357
Gender Male
Married NaN
Dependents NaN
Education Graduate
Self_Employed No
ApplicantIncome 3816
CoapplicantIncome 754
LoanAmount 160
Loan_Amount_Term 360
Credit_History 1
Property_Area Urban
Loan_Status Y
Outliers
Just like missing values, your data might also contain values that diverge heavily from the big majority of your other data. These data points are called “outliers”. To find them, you can check the distribution of your single variables by means of a box plot or you can make a scatter plot of your data to identify data points that don’t lie in the “expected” area of the plot.
The causes for outliers in your data might vary, going from system errors to people interfering with the data through data entry or data processing, but it’s important to consider the effect that they can have on your analysis: they will change the result of statistical tests such as standard deviation, mean or median, they can potentially decrease the normality and impact the results of statistical models, such as regression or ANOVA.
To deal with outliers, you can either delete, transform, or impute them: the decision will again depend on the data context. That’s why it’s again important to understand your data and identify the cause for the outliers:
If the outlier value is due to data entry or data processing errors,
you might consider deleting the value.
You can transform the outliers by assigning weights to your
observations or use the natural log to reduce the variation that the
outlier values in your data set cause.
Just like the missing values, you can also use imputation methods to
replace the extreme values of your data with median, mean or mode
values.
You can use the functions that were described in the above section to deal with outliers in your data.
Following links will be useful for you:
Python data cleaning
Ways to detect and remove the outliers

Numeric to nominal filter

When is it compulsory to use the filter to change the data type to nominal? I am doing classification right now, and the results differ by a huge margin if I changed it to nominal compared to as it is. Thank you in advance.
I don't your question is formed well but I will try to answer it anyway.
Nominal and numeric attributes represent different types of attributes and therefore are treated differently by machine learning algorithms.
Nominal attributes are limited to a closed set of values and they don't have order or any other relation between them. Usually nominal attributes should have a small amount of possible values (large set of possible values may cause over-fitting). The color of car is an example of an attribute that probably would be represented as a nominal attribute.
Numeric attributes are usually more common. They represent values on some axis and are not limited to specific values. Usually the classification algorithms will try to find a point on that axis that differentiate well between the classes or use the value to calculate distance between instances. The salary of an employee is an example of an attribute I will probably use as a numeric attribute.
One more thing you need to take into account is how the classification algorithm treats nominal and numeric attributes. Some algorithms don't handle well nominal attributes. Other algorithms will not work well with several numeric attributes if the values of the attributes are not normalized.

Resources