Machine learning with spark, data preparation performance problem, MLeap - performance

I found a lot of good responses about Mleap - a library, allowing fast scoring. It works on a basis of a model, converted into MLeap bundle.
But what with data preparation stage before scoring?
Is there some effective approach to convert 'spark ML data preparation pipeline' (which is working during training, but in spark framework) to a robust, performance effective, optimized byte-code?

You can easily serialize your entire PipelineModel (containing both feature engineering and model training) with MLeap.
NOTE: The following code is a bit old and you probably have access to a cleaner API now..
// Mleap PipelineModel Serialization into a single .zip file
val sparkBundleContext = SparkBundleContext().withDataset(pipelineModel.transform(trainData))
for(bundleFile <- managed(BundleFile(s"jar:file:${mleapSerializedPipelineModel}"))) {
pipelineModel.writeBundle.save(bundleFile)(sparkBundleContext).get
}
// Mleap code: Deserialize model from local filesystem (without any Spark dependency)
val mleapPipeline = (for(bf <- managed(BundleFile(s"jar:file:${modelPath}"))) yield {
bf.loadMleapBundle().get.root
}).tried.get
Be aware that the tricky part is if you define your own Estimators/Transformers in Spark as they will need a corresponding MLeap version as well.

Related

Is there a Go implementation of fs.ReadDir for Google Cloud Storage?

I am creating a web application in Go.
I have modified my working code so that it can read and write files on both a local filesystem and a bucket of Google Cloud Storage based on a flag.
Basically I included a small package in the middle, and I implemented my-own-pkg.readFile or my-own-pkg.WriteFile and so on...
I have replaced all calls in my code where I read or save files from the local filesystem with calls to my methods.
Finally these methods include a simple switch case that runs the standard code to read/write locally or the code to read/wrote from/to a gcp bucket.
My current problem
In some parts I need to perform a ReadDir to get the list of DirEntries and then cycle though them. I do not want to change my code except for replacing os.readDir with my-own-pkg.ReadDir.
So far I understand that there is not a native function in the gcp module. So I suppose (but here I need your help because I am just guessing) that I would need an implementation of fs.FS for the gcp. It being a new feature of go 1.6 I guess it's too early to find one.
So I am trying to create simply a my-own-pkg.ReadDir(folderpath) function that does the following:
case "local": { }
case "gcp": {
<Use gcp code sample to list objects in my bucket with Query.Prefix = folderpath and
Query.Delimiter="/"
Then create a slice of my-own-pkg.DirEntry (because fs.DkrEntry is just an interface and so it needs to be implemented... :-( ) and return them.
In order to do so I need to implement also the interface fs.DirEntry (which requires the implementation of interface for FileInfo and maybe something else...)
Question 1) is this the right path to follow to solve my issue or is there a better way?
Question 2) (only) if so, does the gcp method that lists object with a prefix and a delimiter return just files? I can't see a method that returns also the list of prefixes found
(If I have prefix/file1.txt and prefix/a/file2.txt I would like to get both "file1.txt" and "a" as files and prefixes...)
I hope I was enough clear... This time I can't include code because it's incomplete... But in case it helps I can paste what I can.
NOTE: by the way go 1.6 allowed me to solve elegantly a similar issue when dealing with assets either embedded or on the filesystem thanks to the existing implementation of fs.FS and the related ReadDirFS. So good if I could follow the same route 🙂
By the way I am going on studying and experimenting so in case I am successful I will contribute as well :-)
I think your abstraction layer is good but you need to know something on Cloud Storage: The directory doesn't exist.
In fact, all the object are put at the root of the bucket / and the fully qualified name of the object is /path/to/object.file. You can filter on a prefix, that return all the object (i.e. file because directory doesn't exist) with the same path prefix.
It's not a full answer to your question but I'm sure that you can think and redesign the rest of your code with this particularity in mind.

Tool for periodic mirroring of full data-set from GraphQL endpoint?

I'm looking for a tool that can make a clone of data exposed on a GraphQL API.
Basically something that can run periodically and recurively copy the raw data reponse to disk, making use of connection based pagination & cursors to ensure consistency of progress of the mirrored content.
Assuming this would be a runner that extracts data 24/7, it will either have to rewrite/transform already copied data, or even better apply updates in a more event sourced way to make it easier to provide diff-sets of changes in the source API data.
I'm not aware of any such tool. I'm not sure such a tool will exist, because
retrieving data from GraphQL requires only the thinnest of layers over the existing GraphQL libraries, which are quite feature-rich
the transformation/writing will likely be part of a different tool. I'm sure several tools for this already exist. The simplest example I could think of is Git. Getting a diff is as simple as running git diff after overwriting an existing version-controlled file.
A simple example of retrieving the data is adapted from the graphql-request Documentation Quickstart
import { request, gql } from 'graphql-request'
import { writeFile } from 'fs/promises'
const query = gql`
{
Movie(title: "Inception") {
releaseDate
actors {
name
}
}
}
`
request('https://api.graph.cool/simple/v1/movies', query)
.then((data) => writeFile('data.json', data))

Reusing h2o model mojo or pojo file from python

As H2o models are only reusable with the same major version of h2o they were saved with, an alternative is to save the model as MOJO/POJO format. Is there a way these saved models can be reused/loaded from python code. Or is there any way to keep the model for further development when upgrading the H2O version??
If you want to use your model for scoring via python, you could use either h2o.mojo_predict_pandas or h2o.mojo_predict_csv. But otherwise if you want to load a binary model that you previously saved, you will need to have compatible versions.
Outside of H2O-3 you can look into pyjnius as Tom recommended: https://github.com/kivy/pyjnius
Another alternative is to use pysparkling, if you only need it for scoring:
from pysparkling.ml import H2OMOJOModel
# Load test data to predict
df = spark.read.parquet(test_data_path)
# Load mojo model
mojo = H2OMOJOModel.createFromMojo(mojo_path)
# Make predictions
predictions = mojo.transform(df)
# Show predictions with ground truth (y_true and y_pred)
predictions.select('your_target_column', 'prediction').show()

Is it possible to train XGBoost4J-Spark with tree_method='exact' ?

I intend to use the trained xgboost model with tree_method='exact' in the SparkML pipeline so I need to use XGBoost4J-Spark; however documentation says "Distributed and external memory version only support approximate algorithm." (https://xgboost.readthedocs.io/en/latest//parameter.html). Is there anyway to work around this?
Alternatively, I can train the model with C-based xgboost and some how convert the trained model to XGBoostEstimator which is a SparkML estimator and seamless to integrate in SparkML pipeline. Has anyone came across such a convertor?
I don't mind running on a single node instead of a cluster as I can afford to wait.
Any insights is appreciated.
So there is this way:
import ml.dmlc.xgboost4j.scala.XGBoost
val xgb1 = XGBoost.loadModel("xgb1")
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
val xgbSpark = new XGBoostRegressionModel(xgb1)
where xgb1 is the model trained with C-based xgboost. There is a problem however; their predictions don't match. I have reported the issue on the github repo: https://github.com/dmlc/xgboost/issues/3190

Is there a supported way to get list of features used by a H2O model during its training?

This is my situation. I have over 400 features, many of which are probably useless and often zero. I would like to be able to:
train an model with a subset of those features
query that model for the features actually used to build that model
build a H2OFrame containing just those features (I get a sparse list of non-zero values for each row I want to predict.)
pass this newly constructed frame to H2OModel.predict() to get a prediction
I am pretty sure what found is unsupported but works for now (v 3.13.0.341). Is there a more robust/supported way of doing this?
model._model_json['output']['names']
The response variable appears to be the last item in this list.
In a similar vein, it would be nice to have a supported way of finding out which H2O version that the model was built under. I cannot find the version number in the json.
If you want to know which feature columns the model used after you have built a model you can do the following in python:
my_training_frame = your_model.actual_params['training_frame']
which will return some frame id
and then you can do
col_used = h2o.get_frame(my_training_frame)
col_used
EDITED (after comment was posted)
To get the columns use:
col_used.columns
Also, a quick way to check the version of a saved binary model is to try and load it into h2o, if it loads it is the same version of h2o, if it isn't you will get a warning.
you can also open the saved model file, the first line will list the version of H2O used to create it.
For a model saved as a mojo you can look at the model.ini file. It will list the version of H2O.

Resources