Can I one only some columns that was used to create a GBM model and still Predict in Supervised Learning.? - apache-spark-mllib

In GBM Model - I have near to 150 columns used to train and create a model - I have a case where for some records I won't be getting all the columns. In that case will the model work - I don't want to set the values to 0 in that case.?

Your question title and description are talking about 2 different things and title is not clear about what you are asking. My following answer is based on your question in description field:
If you use H2O to build your GBM model H2O replaces missing numerical, categorical & unseen values to NA. Please look at the following documentation regarding "handling missing values in GBM" which will help you understand more about your case:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/missing_values.html?highlight=missing%20values

Related

Generating Predicted Vs Actual table using SCORE statement

I'm trying to build a logistic model and I have already divided the training and validation data sets. I used the SCORE statement in order to validate the model against the validation data. In reviewing the SAS documentation, I read the following: "Similarly, an actual by predicted table can be created for a validation data set by using the SCORE statement which also produces a data set containing predicted probability variables and a variable (I_y, where y is the name of your response variable) containing the predicted response category. Note that the validation data set must contain the observed responses in order to produce the table." However my code does not produce the actual by predicted table.
I have also tried an OUTMODEL and INMODEL code with similar results.
proc logistic data=train plots(only)=(effect oddsratio);
class Gender Geography;
model Exited(event="1") = &cat &interval / selection=stepwise clodds=pl slstay=.05 slentry=.05;
score data=valid out=churn.churn_pred_sw;
run;
The only warning that I receive is as follows: WARNING: Some plots have more than 5000 observations and are suppressed. Specify the PLOTS(MAXPOINTS=NONE) option in the PROC
LOGISTIC statement to display the plots.
If I remove the Plots statement, it resolves this issue but still does not produce the actual vs predicted table based on the validation set.

How do h2o models determine what columns to use for predictions (position, name, etc.)?

Using h2o python API to train some models and am a bit confused on how to correctly implement some parts of the API. Specifically, what columns should be ignored in a training dataset and how models look for the actual predictor features in a data set when actually using the model's predict() method. Also how weight columns should be handled (when the actual prediction datasets don't really have weights)
The details of the code here (I think) are not majorly important, but the basic training logic looks something like
drf_dx = h2o.h2o.H2ORandomForestEstimator(
# denoting update version name by epoch timestamp
model_id='drf_dx_v'+str(version)+'t'+str(int(time.time())),
response_column='dx_outcome',
ignored_columns=[
'ucl_id', 'patient_id', 'account_id', 'tar_id', 'charge_line', 'ML_data_begin',
'procedure_outcome', 'provider_outcome',
'weight'
],
weights_column='weight',
ntrees=64,
nbins=32,
balance_classes=True,
binomial_double_trees=True)
.
.
.
drf_dx.train(x=X_train, y=Y_train,
training_frame=train_u, validation_frame=val_u,
max_runtime_secs=max_train_time_hrs*60*60)
(note the ignored columns) and the prediction logic just looks like
preds = model.predict(X)
where X is some (h2o)dataframe with more (or less) columns than in X_train used to train the model (includes some columns for post-processing exploration (in a Jupyter notebook)). Eg. X_train columns may look like
<columns to ignore (as seen in the code)> <columns to use a features for training> <outcome label>
and X columns may look like
<columns to ignore (as seen in the code)> <EVEN MORE COLUMNS TO IGNORE> <columns to use a features for training>
My question is: Is this going to confuse the model when making predictions? Ie. is the model getting the columns to use as features by column name (in which case, I don't think the different dataframe width would be a problem) or is it going by column position (in which case adding more data columns to each sample would shift the positions and become a problem) or something else? What happens since these columns were not explicated in the ignored_columns arg in the model constructor?
** Slight aside: should the weights_column name be in the ignored_columns list or not? The example in the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html#weights-column) seems to use it as a predictor feature as well as seems to recommend it
For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing).
but these weight values are not something that comes with the data used in actual predictions.
I've summarized your question into a few distinct parts, so the answers will be in a Q/A type fashion.
1). When I use my_model.predict(X), how does H2O-3 know which columns to predict with?
H2O-3 will use the columns that you passed as predictors when you built your model (i.e. whatever you passed to the x argument in the estimator, or all the columns you included in your training_frame which were not: ignored using ignored_columns, passed as a target to the y argument, dropped because the column has a constant value.). My recommendation would be to use the x argument to specify your predictors and ignore the ignore_columns parameter. If X, the new dataframe you are predicting on includes columns that were not used when you were building a model, those columns will be ignored - so column names not column positions.
2) Should the weights column name be in the ignored column list?
No, if you pass the weights column to the ignored column list, that column will not be considered in any fashion during the model building phase. In fact, if you test this out, you should get a null pointer error or something similar.
3) Why is the "weights" column specified as a predictor and as the weights_column in the following code example?
This is a great question! I've created two Jira tickets one to update the documentation to clear up the confusion and another one to potentially add a user warning.
The short answer, is if you pass the same column to the predictors argument x and the weights_column argument, that column will only be used as a weight - it will not be used as a feature.
4) Does the user guide recommend using the weights as a feature and as a weight?
No, in the paragraph you are pointing to, the recommendation is to ensure that the column you pass as your weights_column exists in your training frame and validation frame - not that it should also be included as a feature.

Qlik sense - Rank() within a specific dimension when you have multiple ones

I am new to Qlik and trying to solve the following issue.
I have a table with two dimensions, one with the entry's unique ID, and one with a category, as in the example below.
Table example
My goal is to create a new column with a ranking of 'Score' - my measure - per category:
Table with desired output
If I use the expression
Rank(Score)
I get a column of ones, as the command takes the most granular dimension (Unique ID) as the default one. If I use
Rank(TOTAL Score)
It obviously returns a ranking regardless of all the dimensions. By reading the documentation and similar questions asked by other users I reckon that it should be possible to specify which dimension to use for TOTAL, with the following syntax:
Rank(TOTAL <Category> Score)
Yet, the formula returns an error and only null column values. I've tried different syntax, use of brackets but I still cannot grasp what I am doing wrong.
Please note that I cannot create the ranking column when loading the data.
I would immensely appreciate if someone were so kind to help on this!
Try with
=aggr(rank(sum(Score)), Category, UniqueID)

How to change nominal attribute value order in WEKA GUI?

I have 2 data sets for train and test with weka. They both having same amount of attributes and same type data type for variables (numeric or nominal) .But they are not compatible with each other because the order of nominal values is different
ex - Training set
Occupation
1 Doctor 40%
2 Engineer 40%
3 Teacher 20%
Test set
1 Engineer 40%
2 doctor 40%
3 Teacher 20%
So both sets are incompatible. My question is how to change these distinct value order to make them compatible?
It looks a bit like a data pre-processing issue. I am quite curious as to how the training and testing data ended up looking like this!
If you would like to change the nominal values, you could use RenameNominalValues to rename the labels of your data. One possible method is to apply this to your testing data:
This solution assumes that you are dealing with a Nominal attribute, that it is your last attribute and they are labelled as shown in the valueReplacements field.
Failing this, depending on the amount of cases, you could edit the values manually or use your favourite spreadsheet to replace the values.
Hope this Helps!
Use "SwapValues" under unsupervised > attribute

Hibernate - Orderding criteria by formula property

Say I have an entity MyEntity, and it has a formula-based property fmlaProp. Now say I create a criteria:
s.createCriteria(MyEntity.class)
.setProjection(
Projections.distinct(
Projections.property("fmlaProp")))
.addOrder(Order.asc("fmlaProp"));
in this case I get the following SQL:
SELECT DISTINCT fmlaProp-sql FROM MY_ENTITY_TABLE ORDER BY fmlaProp-sql
Which gives an error on Oracle saying that order-by expression is non-selected. Then I tried the following criteria:
s.createCriteria(MyEntity.class)
.setProjection(
Projections.distinct(
Projections.alias(
Projections.property("fmlaProp"),
"alias1"))
.addOrder(Order.asc("alias1"));
Which generates "order by alias1" which works fine. But it is kind of ugly -- the code must "know" of those formula properties, which violates "write once" principle. Any thoughts or suggestions on that? Thank you in advance.
This is expected behavior from Hibernate. It doesn't have to do with the formula property specifically, but that you want to do ordering with a projected value. From the Hibernate Docs:
An alias can be assigned to a projection so that the projected value can be referred to in restrictions or orderings. Here are two different ways to do this...
As far as alternatives, you could try making the formula property a virtual column (in versions of Oracle 11 and above) or wrapping the table in a view with this column computed. That way, Oracle will know fmlaprop directly, which can be used just like a "normal" column.

Resources