I'd like to train a model using Spark ML Lib but then be able to export the model in a platform-agnostic format. Essentially I want to decouple how models are created and consumed.
My reason for wanting this decoupling is so that I can deploy a model in other projects. E.g.:
Use the model to perform predictions in a separate standalone program which doesn't depend on Spark for the evaluation.
Use the model with existing projects such as OpenScoring and provide APIs which can make use of the model.
Load an existing model back into Spark for high throughput prediction.
Has anyone done something like this with Spark ML Lib?
Version of Spark 1.4 now has support for this. See latest documentation. Not all models are available (see to be supported (see the JIRA issue SPARK-4587).
HTHs
Related
Being a beginner with the Apache Beam programming model, I would like to know what is the difference between JDBC and jdbcio. I have developed a simple dataflow which involves normal JDBC connection and it is working as expected.
Is it mandatory to use jdbcio over JDBC? If yes, what are the issues we face when we go with a normal JDBC code?
Within a Beam pipeline there are various options for reading and writing out to external sources of data. The most common method is to make use of inbuilt sinks and sources that have been built by the Beam community (Built-in I/O Transforms). These connectors will often have had considerable development effort spent on them and will have been production hardened. For example the BigQueryIO has been used in production for many years, with continuous development throughout that period. The general advice will therefore be to make use of the standard Sinks and Sources whenever possible.
However not all interactions with external data sources should be via Sources and Sinks, there are use cases where a hand built communication from a DoFn to the external source is the correct path. A few examples below (there are more of course!);
There is no Sink / Source to the data source, or there is a source
but it does not yet support all switches / modes etc for your needs.
Of course you can always enhance the existing Sink / Source or if it
does not exist to build a new I/O connector from scratch and if
possible would be great to contribute this back to the community :)
You are enriching elements flowing through your streaming pipeline
with a small subset of data from a large data set. For example, let's
say your processing events coming from a sales order and you would
like to add information for each item. The information for the item's
lives in a large multi TB store but on average you will only access a
small percentage of the data as lookup keys. In this example it makes
sense to enrich each element by making an external call to the data
store within a DoFn. Rather than reading all of the data in as a
Source and doing the join operation within the pipeline.
Extra notes / hints:
When calling external systems, keep in mind that Apache Beam is designed to distribute work across many threads, this can place significant load on your external datasource, you can often reduce this load by making use of the start & end bundle annotations;
Java (SDK 2.9.0)
DoFn.StartBundle
DoFn.FinishBundle
Python (SDK 2.9.0)
start_bundle()
finish_bundle()
I created a UIMA stack using OpenNLP that runs locally across all cores. It does a variety of tasks including reading from a CSV file, inserting text to a database, parsing the text, POS tagging text, chunking text, etc. I also got it to run a variety of tasks across a spark cluster.
We want to add some machine learning algorithms to the stack and DeepLearning4j came up as a very viable option. Unfortunately, it was not clear how to integrate DL4J within what we currently have or if it simply replicates the stack I have now.
What I have not found in the UIMA, ClearTK, and Deeplearning4j sites is how these three libraries fit together. Does DeepLearning4J implement a ClearTK set of abstract classes that calls OpenNLP functions? What benefit does ClearTK provide? Do I worry about how DeepLearning4J implements anything with the ClearTK framework?
Thanks!
As far as I understand you're running a UIMA pipeline which uses some OpenNLP based AnalysisEngines, so far that's fine.
What is not clear from your question is what you're looking for in terms of feature, rather than tooling.
So I think that's the first thing to clarify.
Other than that, Apache UIMA is an architectural framework; there you can integrate OpenNLP, DL4J, ClearTK or anything else is useful for your unstructured information processing task.
In the Apache OpenNLP project we're doing some experiments for integrations of different DL frameworks, you can have a https://issues.apache.org/jira/browse/OPENNLP-1009 (current prototypes are based on DL4J).
Since you mentioned you're leveraging an Apache Spark cluster, DL4J might be a good fit as it should integrate smoothly with it.
We only use it as part of a set of interfaces for NLP with dl4j. A tokenizer factory and tokenizer that uses UIMA internally for tokenization and sentence segmentation with our sentenceiterator interface. That's very different from building your own models with deeplearning4j itself.
I want to store a created model within sparkling water as a binary file so that I can can reload it with a different application.
What is the best way?
The support article is outdated and was just demonstrating something we already incorporated into our API. That sample code is using an outdated water.serial.ObjectTreeBinarySerializer, which doesn't exist anymore.
The most convenient way is to use the ModelSerializationSupport.
I am new to Sparkling Water, I want to ask some quick questions:
Does Sparking Water support all the algorithms that both Spark MLlib and H2O provides
Does Sparkling Water itself provide algorithms that Spark MLlib and H2O don't support?
If I want to write code with pure Spark MLlib within Sparkling Water context, should I have to use H2OContext or Sparkling Water related API?
Per the above 3 questions, I think what I want to understand is how Sparkling Water works. (For present, I know no more than that Sparkling Water brings Spark and H2O together)
Thanks.
Questions-2017-01-11
I am able to run the AirlinesWithWeatherDemo2example with run-example.shsuccessfully, but I got two questions:
H2O Flow web ui is opened during application running(can be accessed through 54321 port), but when the application is finished,
the process that opens 54321 port is also shut down(the web ui is inaccessible any more), I would ask when I am running the example, what functionality does this flow UI provide since it may be short-lived
Sparkling water is meant to integrate Spark and H2O, when I submit the example, I only need the sparkling-water-assembly_2.11-2.0.3-all as the applicaiton jar(It contains the example classes),
It looks that if I want to run H2O algorithms that Sparkling water doesn't provide, I should add the H2O jars(h2o.jar) as the dependent jars?
Yes
Not really, we are working on wrapping Spark's MLlib algorithms so you can run them from H2O's FlowUI and on wrapping H2O's algorithms so you can use them in MLlib's pipelines, though.
You need H2OContext only if you want to run H2O specific functionality.
Sparkling Water simply allows you to run H2O nodes inside Spark nodes, instead of bootstrapping the H2O cluster by hand. This also allows you to use data in both H2O and Spark.
#Edit:
None but you might have a long running Spark job, where you don't exit after doing some initial computation but lock the job (and need to kill it somehow). Then you can use FlowUI as normal. We simply start the HTTP server every time (even for demos). No reason not do to it.
You can either use one of our droplets - https://github.com/h2oai/h2o-droplets/tree/master/sparkling-water-droplet which is a template project, you add your logic in the main class and run ./gradlew shadowJar and submit the jar with spark-submit, it already contains all the jars. Or, as you mentioned you'll need to provide (though --jars or --packages) all the necessary dependencies, H2O.jar included.
I know that if you change the Core Data model and you have run the app before on the old model that you will get Persistent Store error. How would you handle changes to the Core Data model so you do not get this error? Is there a way to upgrade an old model so that the already saved data is not lost?
Core Data comes with a built-in mechanism to handle changes to your model.
Take a look at the Core Data Model Versioning and Data Migration Programming Guide for details.
If 10.6 is your baseline OS then you can use lightweight migration, specifically NSInferMappingModelAutomaticallyOption.
The article I wrote is similar and useful if 10.6 is not your baseline OS.