h2o SHAP predict contributions with MOJO - h2o

as per release (https://www.h2o.ai/blog/h2o-release-3-26-yau/) it is said that SHAP values can be retrieved from MOJO as well. However in there is no function such as h2o.mojo_predict_contributions or equivalent ?
Once model is imported : gbm_m=h2o.import_mojo('GBM_model_R.zip')
and h2o.predict_contributions(gbm_m,data) is run ( note data is already a h2o data frame and h20 cluster is active ) . Below is the output :
There is another link (http://docs.h2o.ai/sparkling-water/2.2/latest-stable/doc/tutorials/shap_values.html) which doesn't give clear guidance on how to retrieve SHAP values in other than the sparking water h2o version. How can we extract SHAP values with a MOJO object directly without the need to spin off a cluster i.e. functions such as h2o.mojo_predict_df

Related

How do I see the training splits for an AutoML Tables training job?

When training, AutoML will create three data splits: training, validation, and test. How do I see these splits when training?
When doing custom code training, these splits will be materialized on GCS/BigQuery with URIs given by the environment variables: AIP_TRAINING_DATA_URI, AIP_VALIDATION_DATA_URI, and AIP_TEST_DATA_URI. Is there something similar for AutoML?
We don't provide the training/validation set if you're using the managed AutoML training API. But you can optionally export the test set when you create the training job. There's a checkbox in the training workflow.
However, if you're using AutoML through the new Tabular Workflows, you will have access to the split training data as an intermediate training artifact.

Kedro pipeline on partitioned data

I work on partitioned data (partitioned parquet or SQL table with a "partition" column). I want Kedro to load and save data from a partition I provide at runtime (e.g. kedro run --params partition:A). The number of partitions is large and dynamic.
I use Spark. Is there a way to load/save data the way I need with SparkDataSet or SparkJDBCDataSet?
A quick google suggests the Spark JDBCDriver can use a timestamp column for partitioning . All Kedro does behind the scenes is pass the catlaog load_args and save_args to the native driver so this may work.
One another way to use a lifecycle hook like before_pipeline_run, inspect the run parameters and then inject some custom logic at that point as you're able to inspect the --params run arguments easily at that point.
A last thought - if you subclass and extend the SQL dataset you want to use you can easily extend it to partition the way you want it. You won't easily be able to pass run --params but it would be easy to retrieve env variables or custom catalog arguments.

How to do data transformation using Apache NiFi standrad processor?

I have to do data transfomration using Apache NiFi standard processor for below mentioned input data. I have to add two new fields class and year and drop extra price fields.
Below are my input data and transformed data.
Input data
Expected output
Disclaimer: I am assuming that your input headers are not dynamic, which means that you can maintain a predictable input schema. If that is true, you can do this with the standard processors as of 1.12.0, but it will require a little work.
Here's a blog post of mine about how to use ScriptedTransformRecord to take input from one schema, build a new data structure and mix it with another schema. It's a bit involved.
I've used that methodology recently to convert a much larger set of data into summary records, so I know it works. The summary of what's involved is this:
Create two schemas, one that matches input and one for output.
Set up ScriptedTransformRecord to use a writer that explicitly sets which schema to use since ScriptedTransformRecord doesn't support the ability to change the schema configuration internally.
Create a fat jar with Maven or Gradle that compiles your Avro schema into an object that can be used with the NiFi API to expose a static RecordSchema (NiFi API) to your script.
Write a Groovy script that generates a new MapRecord.

Questions about migration, data model and performance of CDH/Impala

I have some questions about migration, data model and performance of Hadoop/Impala.
How to migrate Oracle application to cloudera hadoop/Impala
1.1 How to replace oracle stored procedure in impala or M/R or java/python app.
For example, the original SP include several parameters and sqls.
1.2 How to replace unsupported or complex SQL like over by partition from Oracle to impala.
Are there any existing examples or Impala UDF?
1.3 How to handle update operation since part of data has to be updated.
For example, use data timestamp? use the store model which can support update like HBase? or use delete all data/partition/dir and insert it again(insert overwrite).
Data store model , partition design and query performance
2.1 How to chose impala internal table or external table like csv, parquet, habase?
For example, if there are several kind of data like importing exsited large data in Oracle into hadoop, new business data into hadoop, computed data in hadoop and frequently updated data in hadoop, how to choose the data model? Do you need special attention if the different kind of data need to join?
We have XX TB's data from Oracle, do you have any suggestion about the file format like csv or parquet? Do we need to import the data results into impala internal table or hdfs fs after calculation. If those kind of data can be updated, how to we considered that?
2.2 How to partition the table /external table when joining
For example, there are huge number of sensor data and each one includes measuring data, acquisition timestamp and region information.
We need:
calculate measuring data by different region
Query a series of measuring data during a certain time interval for specific sensor or region.
Query the specific sensor data from huge number of data cross all time.
Query data for all sensors on specific date.
Would you please provide us some suggestion about how to setup up the partition for internal and directories structure for external table(csv) .
In addition, for the structure of the directories, which is better when using date=20090101/area=BEIJING or year=2009/month=01/day=01/area=BEIJING? Is there any guide about that?

Filter data on row key in Random Partitioner

I'm working on Cassandra Hadoop integration (MapReduce). We have used RandomPartitioner to insert data to gain faster write speed. Now we have to read that data from Cassandra in MapReduce and perform some calculations on it.
From the lots of data we have in cassandra we want to fetch data only for particular row keys but we are unable to do it due to RandomPartitioner - there is an assertion in the code.
Can anyone please guide me how should I filter data based on row key on the Cassandra level itself (I know data is distributed across regions using hash of the row key)?
Would using secondary indexes (still trying to understand how they works) solve my problem or is there some other way around it?
I want to use cassandra MR to calculate some KPI's on the data which is stored in cassandra continuously. So here fetching whole data from cassandra every time seems an overhead to me? The rowkey I'm using is like "(timestamp/60000)_otherid"; this CF contains reference of rowkeys of actual data stored in other CF. so to calculate KPI I will work for a particular minute and fetch data from other CF, and process it.
When using RandomPartitioner, keys are not sorted, so you cannot do a range query on your keys to limit the data. Secondary indexes work on columns not keys, so they won't help you either. You have two options for filtering the data:
Choose a data model that allows you to specify a thrift SlicePredicate, which will give you a range of columns regardless of key, like this:
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBufferUtil.bytes(start), ByteBufferUtil.bytes(end), false, Integer.MAX_VALUE));
ConfigHelper.setInputSlicePredicate(conf, predicate);
Or use your map stage to do this by simply ignoring input keys that are outside your desired range.
I am unfamiliar with the Cassandra Hadoop integration but trying to understand how to use the hash system to query the data yourself is likely the wrong way to go.
I would look at the Cassandra client you are using (Hector, Astynax, etc.) and ask how to query by row keys from that.
Querying by the row key is a very common operation in Cassandra.
Essentially if you want to still use a RandomPartitioner and want the ability to do range slices you will need to create a reverse index (a.k.a. inverted index). I have answered a similar question here that involved timestamps.
Having the ability to generate your rowkeys programmatically allows you to emulate a range slice on rowkeys. To do this you must write your own InputFormat class and generate your splits manually.

Resources