Interpret Google AutoML Online Prediction Results - google-cloud-automl

We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.

Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further

Related

Can MappingScore() be used to get an absolute measure of scRNAseq dataset similarity to the reference dataset?

I have been using Seurat v4 Reference Mapping align some query scRNAseq datasets that come from IPSC-derived cells that were subject to several directed cortical differentiation protocols at multiple timepoints. The reference dataset I made by merging several individual fetal cortical sample datasets that I had annotated based on their unsupervised cluster DEGs (following this vignette using the default parameters).
I am interested in seeing which protocol produces cells most similar to the cells found in the fetal datasets as well as which fetal timepoints the query datasets tend to map to. I understand that the MappingScore() function can show me query cells that aren't well represented in the reference dataset, so I figured that these scores could tell me which datasets are most similar to the reference dataset. However, in comparing the violin plots of the mapping scores for a query dataset from one of the differentiation protocols to a query dataset that contains just pluripotent cells it looks like there are cells with high mapping scores found in both cases (see attached images) even though really only the differentiated cells should have cells closely resembling the fetal cortical tissue cells. I attached the code as a .txt file.
My question is whether or not the mapping score can be used as an absolute measurement of query to reference dataset similarity or if it is always just a relative measure where the high and low thresholds are set by the query dataset. If the latter, what alternative functions might I use here to get information about absolute similarity?
Thanks.
Attachments:
Pluripotent Cell Mapping Score
Differentiated Cell Mapping Score
Code Used For Mapping

Time series based features for binary classification

Is it possible (and how?) to provide time series for binary classification in H2O.ai's Driverless AI? I have dataframe that looks like this:
ID
Status/Target [0/1]
TimeStamp for events that happened on given ID, in last 90 days
Details of those events (category, description, values, etc...)
Ideally what i want is to build a model that predict status for given ID, based on provided history of events.
For H2O's Driverless AI, you can use it out of the box for time-series modeling. See this section. You need to provide the "Time Column" as your TimeStamp and add ID to your "Time Groups Column".
If your target column is 0s or 1s, then it should automatically identify it as binary. If you not, you can toggle it from regression to binary classification.

Use of validation_frame in H2O AutoML

Just started with H2O AutoML so apologies in advance if I have missed something basic.
I have a binary classification problem where data are observations from K years. I want to train on the K-1 years and tune the models and select the best one explicitly based on the remaining K year.
If I switch off cross-validation (with nfolds=0) to avoid randomly blending of years into the N folds and define data of year K as the validation_frame then I don't have the ensemble created (as expected according to the documentation) which in fact I need.
If I train with cross-validation (default nfolds) and defining a validation frame to be the K-year data
aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
then according to
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
the validation_frame is ignored
"...By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored."
Is there a way to get the tuning of the models and the selection of the best one(ensemble or not) based on the K-year data only, and while the ensemble of models is also available in the output?
Thanks a lot!
You don't want to have cross-validation (CV) if you are dealing with times-series (non-IID) data, since you won't want folds from the future to the predict the past.
I would explicitly add nfolds=0 so that CV is disabled in AutoML:
aml = H2OAutoML(max_runtime_secs=3600, seed=1, nfolds=0)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
To have an ensemble, add a blending_frame which also applies to time-series. See more info here.
Additionally, since you are dealing with time-series data. I would recommend adding time-series transformations (e.g. lags), so that your model gets info from previous years and their aggregates (e.g. weighted moving average).

hbase design concat long key-value pairs vs many columns

Please help me understand the best way storing information in HBase.
Basically, I have a rowkey like hashed_uid+date+session_id with metrics like duration, date, time, location, depth and so on.
I have read a lot of materials where I am bit confused. People have suggested less column family for better performance, so I am facing three options to choose:
Have each metrics sits in one row like rowkey_key cf1->alias1:value
Have many columns like rowkey cf1->key1:val1, cf1->key2:val2 ...
Have all the key-value pairs coded into one big string like rowkey cf1->"k1:v1,k2:v2,k3:v3..."
Thank you in advance. I don't know which to choose. The goal of my HBase design is to prepare for incremental windowing functions of a user profiling output, like percentiles, engagement and stat summary for last 60 days. Most likely, I will use hive for that.
Possibly you are confused by the similarity of naming of column family and column. These concepts are different things in HBase. Column family consist of several columns. This design is to improve the speed of access to data when you need to read only some type of columns. E.g., you have raw data and processed data. Reading processed data will not involve raw data if they are stored in separated column families. You can partially to have any numbers of columns per row key; it should be stored in one region, no more than 10GB. The design depends on what you what:
The first variant has no alternatives when you need to store a lot of
data per one-row key, that can't be stored in on a region. More than
10GB.
Second is good when you need to get only a few metrics per
single read per row key.
The last variant is suitable when you
always get all metrics per single read per row key.

Reducing data with data stage

I've been asked to reduce an existing data model using Data Stage ETL.
It's more of an exercice and a way to get to know this program which I'm very new to.
Of course, the data shall be reduced following some functionnal rules.
Table : MEMBERSHIP (..,A,B,C) # where A,B,C are different attributes (our filters)
Reducing data from ~700k rows to 7k rows or so.
I was thinking about keeping the same percentage as in the data source.
Therefore if we have the 70% of A, 20% of B and 10% of C, we would pretty much have the same percentage on the reduced version.
I'm looking for the best way to do so and the inner tools to use(maybe with the aggregator stage?).
Is there any way to do some scripting similar to PL with DataStage ?
I hope I've been clear enough. If you have any advice I'd be very grateful.
Thanks to all of you.
~Whitoo
Datastage does not do percentage wise reductions
What you can do is to use a tranformer stage or a filter stage to filter out the data from the source based on certain conditions. But like I said conditions have to be very specific. (for example - select only those records which have A = [somevalue] or A not= [somevalue])
DataStage PX has the sample stage that allows you to specify what percent of data you want it to sample: http://datastage4you.blogspot.com/2014/01/sample-stage-in-datastage.html.

Resources