Is it possible to define how many variables to use for the final model in H2O Driverless - h2o

I am exploring the functionalities of H2O DAI at the moment. Understand that H2O has the capability of choosing what variables to use and what transformers to apply on them during the feature selection/engineering phase. But is there a way to config in H2O DAI to limit the maximum number of features it could use out of the provided list? E.g., there are 100 features given, I only want H2O DAI to select 20 features out of it and apply feature engineering on it. Tried to browse through the user manual but did not find any hints on this so far.
Many thanks in advance.

There are a few options to control number of features used
# Maximum number of columns selected out of original set of original columns, using feature selection
# The selection is based upon how well target encoding (or frequency encoding if not available) on categoricals and numerics treated as categoricals
# This is useful to reduce the final model complexity. First the best
# [max_orig_cols_selected] are found through feature selection methods and then
# these features are used in feature evolution (to derive other features) and in modelling.
#max_orig_cols_selected = 10000
# Maximum number of numeric columns selected, above which will do feature selection
# same as above (max_orig_cols_selected) but for numeric columns.
#max_orig_numeric_cols_selected = 10000
# Maximum number of non-numeric columns selected, above which will do feature selection on all features and avoid treating numerical as categorical
# same as above (max_orig_numeric_cols_selected) but for categorical columns.
#max_orig_nonnumeric_cols_selected = 300
# Like max_orig_cols_selected, but columns above which add special individual with original columns reduced.
#
#fs_orig_cols_selected = 500
# Maximum features per model (and each model within the final model if ensemble) kept.
# Keeps top variable importance features, prunes rest away, after each scoring.
# Final ensemble will exclude any pruned-away features and only train on kept features,
# but may contain a few new features due to fitting on different data view (e.g. new clusters)
# Final scoring pipeline will exclude any pruned-away features,
# but may contain a few new features due to fitting on different data view (e.g. new clusters)
# -1 means no restrictions except internally-determined memory and interpretability restrictions.
# Notes:
# * If interpretability > remove_scored_0gain_genes_in_postprocessing_above_interpretability, then
# every GA iteration post-processes features down to this value just after scoring them. Otherwise,
# only mutations of scored individuals will be pruned (until the final model where limits are strictly applied).
# * If ngenes_max is not also limited, then some individuals will have more genes and features until
# pruned by mutation or by preparation for final model.
# * E.g. to generally limit every iteration to exactly 1 features, one must set nfeatures_max=ngenes_max=1
# and remove_scored_0gain_genes_in_postprocessing_above_interpretability=0, but the genetic algorithm
# will have a harder time finding good features.
#
#nfeatures_max = -1
See the config.toml file or look in expert settings.
Note that you can't control specific features of having transformers or not.

Related

Exogenous variables in hmmlearn's GaussianHMM

I am trying to use hmmlearn's GaussianHMM to fit a Hidden Markov Model with 2 main states, while allowing for multiple exogenous variables. My goal is to determine two states of GDP growth (one with low variance and the other with high variance), these states then depend on lagged unemployment, lagged commercial confidence level etc. I have a couple of questions:
Using hmmlearn's GaussiansHMM, I have read through the documentation but I cannot find any mention of exogenous variable. Using the method fit(X, lengths=None), I see that X can have n_features columns, do I understand correctly that I should pass in an array with the first column being the endogenous varible (GDP growth in my case) and the rest of columns are the exogenous variables ?
Is hmmlearn's GaussianHMM equivalent to statsmodels.tsa.regime_switching.markov_regression.MarkovRegression ? This model allows for exog_tvtp which means that exogenous variables are used to calculate a time varying transition probabilities matrix.
An example of fitting the monthly returns of the S&P500, no exogenous variable.
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM
import yfinance as yf
sp500 = yf.download("^GSPC")["Adj Close"]
# Fitting an absolute return model because we only care about volatility #
rets = np.log(sp500/sp500.shift(1)).dropna()
rets.index = pd.to_datetime(rets.index)
rets = rets.resample("M").sum()
model = GaussianHMM(n_components=2)
model.fit(rets.to_frame())
state_sequence = model.predict(rets.to_frame())
Imagine if I want to add a dependency on exogenous variables to the returns of the S&P500, for example on economic growth or past volatilities, is there a way to do this ?
Thanks for any help.
n_features can be thought of as the temporal domain, and should not be conflated with features that describe the complexity of ie. a regression model.
If your hidden states are the two states of GDP growth, then the observed variable (or emissions) that you are trying to infer the hidden states from should be the feature space (a.k.a. n_features).
This should be a single measurement (emission) descriptive of a combination of your "exogenous variables", collected over time. hmmlearn will not be able to take multivariate emissions.
Suggestions
If I understand your question correctly, perhaps what you might be looking for are Kalman filters. KF produces estimates of unknowns based on multiple measurements (ie. all of your exogenous variables) that ultimately produce a model more accurate than those based on a single measurement.
If you wish each hidden state to have multiple independent emissions then what you might be looking for is a structured perceptron. This is discussed here: Hidden Markov Model for multiple observed variables

How to handle categorical features for Decision Tree, Random Forest in spark ml?

I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set.
In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a categorical feature will be assigned numeric value based on it's frequency (0 for most frequent label of a category feature).
My question is how the algorithm of Random Forest or Decision Tree will understand that new features (derived from categorical features) are different than continuous variable. Will indexed feature be considered as continuous in the algorithm? Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features.
I read some of the answers from this forum but i didn't get clarity on the last part.
One Hot Encoding should be done for categorical variables with categories > 2.
To understand why, you should know the difference between the sub categories of categorical data: Ordinal data and Nominal data.
Ordinal Data: The values has some sort of ordering between them. example:
Customer Feedback(excellent, good, neutral, bad, very bad). As you can see there is a clear ordering between them (excellent > good > neutral > bad > very bad). In this case StringIndexer alone is sufficient for modelling purpose.
Nominal Data: The values has no defined ordering between them.
example: colours(black, blue, white, ...). In this case StringIndexer alone is NOT sufficient. and One Hot Encoding is required after String Indexing.
After String Indexing lets assume the output is:
id | colour | categoryIndex
----|----------|---------------
0 | black | 0.0
1 | white | 1.0
2 | yellow | 2.0
3 | red | 3.0
Then without One Hot Encoding, the machine learning algorithm will assume: red > yellow > white > black, which we know its not true.
OneHotEncoder() will help us avoid this situation.
So to answer your question,
Will indexed feature be considered as continuous in the algorithm?
It will be considered as continious variable.
Is it the right approach? Or should I go ahead with One-Hot-Encoding
for categorical features
depends on your understanding of data.Although Random Forest and some boosting methods doesn't require OneHot Encoding, most ML algorithms need it.
Refer: https://spark.apache.org/docs/latest/ml-features.html#onehotencoder
In short, Spark's RandomForest does NOT require OneHotEncoder for categorical features created by StringIndexer or VectorIndexer.
Longer Explanation. In general DecisionTrees can handle both Ordinal and Nominal types of data. However, when it comes to the implementation, it could be that OneHotEncoder is required (as it is in Python's scikit-learn).
Luckily, Spark's implementation of RandomForest honors categorical features if properly handled and OneHotEncoder is NOT required!
Proper handling means that categorical features contain the corresponding metadata so that RF knows what it is working on. Features that have been created by StringIndexer or VectorIndexer contain metadata in the DataFrame about being generated by the Indexer and being categorical.
According to the vdep answers, the StringIndexer is enough for Ordinal Data. Howerver the StringIndexer sort the data by label frequency, for example "excellent > good > neutral > bad > very bad" maybe become the "good,excellent,neutral". So for Oridinal data, the StringIndexer do not suit for it.
Secondly, for Nominal Data, the document tells us that
for a binary classification problem with one categorical feature with three categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical features are ordered as A, C, B. The two split candidates are A | C, B and A , C | B where | denotes the split.
The "corresponding proportions of label 1" is same as the label frequency? So I am confused of the feasibility with the StringInder to DecisionTree in Spark.

AMPL: what's a good way to specify equality constraints for large list of pairs of variable-size sets?

I'm working on a problem that involves reconciling data that represents estimates of the same system under two different classification hierarchies. I want to enforce the requirement that equivalent classes or groups of classes have the same sum.
For example, say Classification A divides industries into: Agriculture (sheep/cattle), Agriculture (non-sheep/cattle), Mining, Manufacturing (textiles), Manufacturing (non-textiles), ...
Meanwhile, Classification B has a different breakdown: Agriculture, Mining (iron ore), Mining (non-iron-ore), Manufacturing (chemical), Manufacturing (non-chemical), ...
In this case, any total for A_Agric_SheepCattle + A_Agric_NonSheepCattle should match the equivalent total for B_Agric; A_Mining should match B_MiningIronOre + B_Mining_NonIronOre; and A_MFG_Textiles+A_MFG_NonTextiles should match B_MFG_Chemical+B_MFG_NonChemical.
For bonus complication, one category may be involved in multiple equivalencies, e.g. B_Mining_IronOre might be involved in an equivalency with both A_Mining and A_Mining_Metallic.
I will be working with multi-dimensional tables, with this sort of concordance applied to more than one dimension - e.g. I might be compiling data on Industry x Product, so each equivalency will be used in multiple constraints; hence I need an efficient way to define them once and invoke repeatedly, instead of just setting a direct constraint "A_Agric_SheepCattle + A_Agric_NonSheepCattle = B_Agric".
The most natural way to represent this sort of concordance would seem to be as a list of pairs of sets. The catch is that the set sizes will vary - sometimes we have a 1:1 equivalence, sometimes it's "these 5 categories equate to those 7 categories", etc.
I found this related question which offers two answers for dealing with variable-sized sets. One is to define all set members in a single ordered set with indices, then define the starting index for each set within that. However, this seems unwieldy for my problem; both classifications are likely to be long, so I'd need to be hopping between two loooong lists of industries and two looong lists of indices to see a single equivalency. This seems like it would be a nuisance to check, and hard to modify (since any change to membership for one of the early sets changes the index numbers for all following sets).
The other is to define pairs of long fixed-length sets, and then pad each set to the required length with null members.
This would be a much better option for my purposes since it lets me eyeball a single line and see the equivalence that it represents. But it would require a LOT of padding; most of the equivalence groups will be small but a few might be quite large, and everything has to be padded to the size of the largest expected length.
Is there a better approach?

Cross-validation in Lenskit

I'm trying to understand how exactly is performed cross-validation in lenskit. In the documentation, it says that by default the data are partitioned by user. Does that mean that, in each fold, none of the users in the test set has been used for training? Is this achieved through the "holdout" option? If so, does this option break the user-based partioning and yields folds in which each user shows up in both the training and test sets?
Right now, my evaluation code looks something like this:
dataset crossfold("data") {
source csvfile(sourceFile) {
delimiter "\t"
domain {
minimum 0.0
maximum 10.0
precision 0.1
}
}
// order RandomOrder
holdoutFraction 0.1
}
I commented out the "order" option because, when using it, lenskit eval throws an error.
Cheers!!!
Each user appears in both the training and the test sets, no matter the holdout, holdoutFraction, or retain options.
However, for each test user (when using 5 partitions, 20% of the users), part of their ratings (the test ratings) are held out and placed in the test set. The remainder of their ratings are placed in the training set, along with all ratings from other users.
This simulates the common case of a recommender system: you have users, for whom some of their history is already known and can be used in model training, and you're trying to recommend or predict their future behavior.
The holdout, holdoutFraction, and retain options are different ways of deciding how many ratings are put in the test set. If you say holdout 5, then 5 ratings from each test user are put in the test set, and the rest are used for training. If you say holdoutFraction 0.2, then 20% are used for testing and 80% for training. If you say retain 5, then 5 are used for training and the rest are used for testing.

What is the optimal way to choose a set of features for excluding items based on a bitmask when matching against a large set?

Suppose I have a large, static set of objects, and I have an object that I want to match against all of them according to a complicated set of criteria that entails an expensive test.
Suppose also that it's possible to identify a large set of features that can be used to exclude potential matches, thereby avoiding the expensive test. If a feature is present in the object I am testing, then I can exclude any objects in the set that don't have this feature. In other words, the presence of the feature is necessary but not sufficient for the test to pass.
In that case, I can precompute a bitmask for each object in the set indicating whether each feature is present or absent in the object. I can also compute it for the object that I want to test, and then loop through the array like this (pseudo-code):
objectMask = computeObjectMask(myObject)
for(each testObject in objectSet)
{
if((testObject.mask & objectMask) != objectMask)
{
// early out, some features are in objectMask
// but not in testObject.mask, so the test can't pass
}
else if(doComplicatedTest(testObject, myObject)
{
// found a match!
}
}
So my question is, given a limited bitmask size, and a large list of possible features, and a table of the frequencies of each feature in object set (plus access to the object set if you want to compute correlations between features and so on), what algorithm can I use to choose the optimal set of features for inclusion in my bitmask to maximize the number of early outs and minimize the number of tests?
If I just choose the top x most common features, then chance of a feature being in both masks is higher, so it seems like the number of early outs would be reduced. However if I choose the x least common features then objectMask might frequently be zero, meaning no early outs are possible. It seems pretty easy to experiment and come up with a set of middling-frequency features that gives good performance, but I'm interested in whether there is a theoretical best way of doing it.
Note: the frequency of each feature is assumed to be the same in the set of possible myObjects as in the objectSet, although I'd be interested to know how to handle if it isn't. I'd also be interested to know if there is an algorithm for finding the best feature set given a large sample of candidate objects that are to be matched against the set.
Possible applications: matching an input string against a large number of regexes, matching a string against a large dictionary of words using a criteria such as "must contain the same letters in the same order, but possibly with extra characters inserted anywhere in the word", etc. Example features: "contains the literal character D", "contains the character F followed by the character G later in the string" etc. Obviously the set of possible features will be highly dependent on the specific application.
You can try aho-corasick algorithm. Its the fastest multi pattern matcher. Basically its a finite state machine with failure links computed with a breadth-first search of the trie.

Resources