I'm trying to build a logistic model and I have already divided the training and validation data sets. I used the SCORE statement in order to validate the model against the validation data. In reviewing the SAS documentation, I read the following: "Similarly, an actual by predicted table can be created for a validation data set by using the SCORE statement which also produces a data set containing predicted probability variables and a variable (I_y, where y is the name of your response variable) containing the predicted response category. Note that the validation data set must contain the observed responses in order to produce the table." However my code does not produce the actual by predicted table.
I have also tried an OUTMODEL and INMODEL code with similar results.
proc logistic data=train plots(only)=(effect oddsratio);
class Gender Geography;
model Exited(event="1") = &cat &interval / selection=stepwise clodds=pl slstay=.05 slentry=.05;
score data=valid out=churn.churn_pred_sw;
run;
The only warning that I receive is as follows: WARNING: Some plots have more than 5000 observations and are suppressed. Specify the PLOTS(MAXPOINTS=NONE) option in the PROC
LOGISTIC statement to display the plots.
If I remove the Plots statement, it resolves this issue but still does not produce the actual vs predicted table based on the validation set.
Related
I have a table ("Issues") which I am creating in PowerBI from a JIRA data connector, so this changes each time I refresh it. I have three columns I am using
Form Name
Effort
Status
I created a second table and have summarized the Form Names and obtained the Total Effort:
SUMMARIZE(Issues,Issues[Form Name],"Total Effort",SUM(Issues[Effort (Days)]))
But I also want to add in a column for
Total Effort for each form name where the Status field is "Done"
My issue is that I don't know how to compare both tables / form names since these might change each time I refresh the table.
I need to write a conditional, something like
For each form name, print the total effort for each form name, print the total effort for each form name where the status is done
I have tried SUMX, CALCULATE, SUM, FILTER but cannot get these to work - can someone help, please?
If all you need is to add a column to your summarized table that sums "Effort" only when the Status is set to 'Done' -- then this is the right place to use CALCULATE.
Table =
SUMMARIZE(
Issues,
Issues[Form Name],
"Total Effort", SUM(Issues[Effort]),
"Total Effort (Done)", CALCULATE(SUM(Issues[Effort]), Issues[Status] = "Done")
)
Here is a quick capture of what some of the mock data that I used to test this looks like. The Matrix is just the mock data with [Form Name] on the rows and [Status] on the columns. The last table shows the 'summarized' data calculated by the DAX above. You can compare this to the values in the matrix and see that they tie out.
I am using Automated ML to run a time series forecasting pipeline.
When the AutoMLStep gets triggered, I get this error: Non numeric value(s) were encountered in the target column.
The data to this step is passed through an OutputTabularDatasetConfig, after applying the read_delimited_files() on an OutputFileDatasetConfig. I've inspected the prior step, and the data is comprised of a 'Date' column and a numeric column called 'Place' with +80 observations in monthly frequencies.
Nothing seems to be wrong with the column type or the data. I've also applied a number of techniques on the data prep side e.g. pd.to_numeric(), astype(float) to ensure it is numeric.
I've also tried forcing this through the FeaturizationConfig() add_column_purpose('Place','Numeric') but in this case, I get another error: Expected column(s) Place in featurization config's column purpose not found in X.
Any thoughts on how to solve?
So a few learnings on this interacting with the stellar Azure Machine Learning engineering team.
When calling the read_delimited_files() method, ensure that the output folder does not have many inputs or files. For example, if all intermediate outputs are saved to a common folder, it may read all the prior inputs into this folder, and depending upon the shape of the data, borrow the schema from the first file, or confuse all of them together. This can lead to inconsistencies and errors. In my case, I was dumping many files to the same location, hence this was causing confusion for this method. The fix is either to distinctly mark the output folder (e.g. with a UUID) or give different paths.
The dataframe from read_delimiter_files() may treat all columns as object type which can lead to a data type check failure (i.e. label_column needs to be numeric). To mitigate, explictly state the type. For example:
from azureml.data import DataType
prepped_data = prepped_data.read_delimited_files(set_column_types={"Place":DataType.to_float()})
Using h2o python API to train some models and am a bit confused on how to correctly implement some parts of the API. Specifically, what columns should be ignored in a training dataset and how models look for the actual predictor features in a data set when actually using the model's predict() method. Also how weight columns should be handled (when the actual prediction datasets don't really have weights)
The details of the code here (I think) are not majorly important, but the basic training logic looks something like
drf_dx = h2o.h2o.H2ORandomForestEstimator(
# denoting update version name by epoch timestamp
model_id='drf_dx_v'+str(version)+'t'+str(int(time.time())),
response_column='dx_outcome',
ignored_columns=[
'ucl_id', 'patient_id', 'account_id', 'tar_id', 'charge_line', 'ML_data_begin',
'procedure_outcome', 'provider_outcome',
'weight'
],
weights_column='weight',
ntrees=64,
nbins=32,
balance_classes=True,
binomial_double_trees=True)
.
.
.
drf_dx.train(x=X_train, y=Y_train,
training_frame=train_u, validation_frame=val_u,
max_runtime_secs=max_train_time_hrs*60*60)
(note the ignored columns) and the prediction logic just looks like
preds = model.predict(X)
where X is some (h2o)dataframe with more (or less) columns than in X_train used to train the model (includes some columns for post-processing exploration (in a Jupyter notebook)). Eg. X_train columns may look like
<columns to ignore (as seen in the code)> <columns to use a features for training> <outcome label>
and X columns may look like
<columns to ignore (as seen in the code)> <EVEN MORE COLUMNS TO IGNORE> <columns to use a features for training>
My question is: Is this going to confuse the model when making predictions? Ie. is the model getting the columns to use as features by column name (in which case, I don't think the different dataframe width would be a problem) or is it going by column position (in which case adding more data columns to each sample would shift the positions and become a problem) or something else? What happens since these columns were not explicated in the ignored_columns arg in the model constructor?
** Slight aside: should the weights_column name be in the ignored_columns list or not? The example in the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html#weights-column) seems to use it as a predictor feature as well as seems to recommend it
For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing).
but these weight values are not something that comes with the data used in actual predictions.
I've summarized your question into a few distinct parts, so the answers will be in a Q/A type fashion.
1). When I use my_model.predict(X), how does H2O-3 know which columns to predict with?
H2O-3 will use the columns that you passed as predictors when you built your model (i.e. whatever you passed to the x argument in the estimator, or all the columns you included in your training_frame which were not: ignored using ignored_columns, passed as a target to the y argument, dropped because the column has a constant value.). My recommendation would be to use the x argument to specify your predictors and ignore the ignore_columns parameter. If X, the new dataframe you are predicting on includes columns that were not used when you were building a model, those columns will be ignored - so column names not column positions.
2) Should the weights column name be in the ignored column list?
No, if you pass the weights column to the ignored column list, that column will not be considered in any fashion during the model building phase. In fact, if you test this out, you should get a null pointer error or something similar.
3) Why is the "weights" column specified as a predictor and as the weights_column in the following code example?
This is a great question! I've created two Jira tickets one to update the documentation to clear up the confusion and another one to potentially add a user warning.
The short answer, is if you pass the same column to the predictors argument x and the weights_column argument, that column will only be used as a weight - it will not be used as a feature.
4) Does the user guide recommend using the weights as a feature and as a weight?
No, in the paragraph you are pointing to, the recommendation is to ensure that the column you pass as your weights_column exists in your training frame and validation frame - not that it should also be included as a feature.
I have a report that I already created in which this works. I have three parameters in my first report with a matrix. The matrix column and rows are based off of two calculated fields. The row field is called Time Summary and is basically a time range like follows (8AM-10AM,10AM - 12PM, 12PM-2PM,Etc.) The column field is just days of the week (Monday,Tuesday,Wednesday,Etc.) They were both calculated by a field called 'CreatedDateTime'. The value column is a count of request numbers so we can see when our call center is receiving the most service requests and at which times.
I'm drilling down to a detail report that lists each request and many of the request details. I created an action on the value (count of request) textbox in my matrix report that makes it drill into the detail report. In the detail report I have the same three parameters as in my first report, but I also have parameters for the row and column fields of my matrix (Time summary and day of the week respectively). Here is a screenshot of my text box properties screen in the action tab.
The problem I'm having is that when I run my matrix report and click on the value that I want to drill into, it will drill into the detail report and update the parameter values for the three that are also in my matrix report, but it won't update the parameters for the values in which I selected in my matrix.
Here is another screenshot of what happens when I select a value in my matrix and it drills into the detail report. I have the calculated fields in my detail report too and it filters the main data set based off of the Day parameter and Time Summary parameter. I made a report yesterday very similarly and it worked. I can't figure out why it I can't get this report to work. I'm almost positive it has to do with how I have my parameters defined in my detail report or something with the text box properties and the action. Any help in figuring this out would be appreciated.
I have a master-block with a details-block. One of the fields in the master-block holds a calculated value which depends on the details-block, and is persisted to the database.
The details-block has POST-INSERT, POST-UPDATE and POST-DELETE form triggers, in which the value of the master-block field is calculated and set:
MASTERBLOCK.FIELD1:=FUNC1; --DB Function that queries the details block's table
When a form is committed, the following happens:
the master block is saved with the stale value
the details-block is saved
the form triggers are executed and the value of the master block is calculated and set.
the master-block field now contains the updated value, but the master-block's record status is not CHANGED and the updated value is not saved.
How can I force the persistence of the calculated field in the master-block?
"One of the fields in the master-block holds a calculated value which depends on the details-block"
Generally the ongoing maintenance of calculated totals exceeds the effort required to calculate them on-demand. But there are exceptions, so let's assume this is the case here.
I think this is your problem: --DB Function that queries the details block's table. Your processing is split between the client and the server in an unhelpful manner. A better approach would be to either:
maintain the total in the master block by capturing the relevant changes in the detail block as they happen (say in navigation triggers); or
calculate the total and update the master record in a database procedure, returning the total for display in the form.
It's not possible to give a definitive answer without knowing more about the specifics of your case. The key thing is you need to understand the concept of a Transaction as the Unit Of Work, and make sure that all the necessary changes are readied before the database issues the COMMIT.