Automating H2O Flow: run flow from CLI - hadoop

I’ve been an h2o user for a little over a year and a half now, but my work has been limited to the R api; h2o flow is relatively new to me. If it's new to you as well, it's basically 0xdata's version of iPython, however iPython let's you export your notebook to a script. I can't find a similar option in flow...
I’m at the point of moving a model (built in flow) to production, and I'm wondering how to automate it. With the R api, after the model was built and saved, I could easily load it in R and make predictions on the new data simply by running a nohup Rscript <the_file> & from CLI, but I’m not sure how I can do something similar with flow, especially since it’s running on Hadoop.
As it currently stands, every run is broken into three pieces with the flow creating a relatively clunky process in the middle:
preprocess data, move it to hdfs
start h2o on hadoop, nslookup the IP address h2o is running on, manually run the flow cell-by-cell
run the post-prediction clean-up and final steps
This is a terribly intrusive production process, and I want to tie all the ends up, however flow is making it rather difficult. To distill the question: is there a way to compress the flow into a hadoop jar and then later just run the jar like hadoop jar <my_flow_jar.jar> ...?
Edit:
Here's the h2o R package documentation. The R API allows you to load an H2O model, so I tried loading the flow (as if it were an H2O model), and unsurprisingly it did not work (failed with a water.api.FSIOException) as it's not technically an h2o model.

This is really late, but (now) h2o flow models have auto-generated java code that represents the trained model (called a POJO) that can be cut and pasted (say from your remote hadoop session to a local java file). See here for a quickstart tutorial on how to use the java object (https://h2o-release.s3.amazonaws.com/h2o/rel-turing/1/docs-website/h2o-docs/pojo-quick-start.html). You'll have to refer to the h2o java api (https://h2o-release.s3.amazonaws.com/h2o/rel-turing/8/docs-website/h2o-genmodel/javadoc/hex/genmodel/easy/EasyPredictModelWrapper.html) to start customizing how you want to use the POJO, but you essentially use it as a black box that makes predictions on properly formated inputs.
Assuming you hadoop session is remote, replace "localhost" in the example with the IP address of your (remote) flow session.

Related

Magic committer not improving performance in a Spark3+Yarn3+S3 setup

What I am trying to achieve?
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes. IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
My setup
I have a server which has the Spark client. The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
What I tried
I am using the following config (setting via PySpark Spark-conf) to enable the committer:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/ directory of the Spark-Home on the Nodemanagers and my Spark-client servers.
Changes that I see after applying the aforementioned configs:
I see PRE __magic/ directory if I run aws s3 ls <write-path> when the job is running.
I don't see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. anymore.
A _SUCCESS file gets created with (JSON) content. One of the key-value that I see in that file is "committer" : "magic".
Hence, I believe my configs are getting applied correctly.
What I expect
I have read in multiple articles that this committer is expected to show a performance boost (e.g. this article claims 57-77% time reduction). Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
Some other point that might be of value
When I use "spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol", my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol.
I have not looked into enabling S3gaurd, as S3 now provides strong consistency.
correct. you don't need s3guard
the com.hortonworks binding was for the wip committer work. the binding classes for wiring up spark/parquet are all in spark-hadoop-cloud and have org.spark prefixes. you seem to be ok there
the simple test for what committer is live is to print the JSON _SUCCESS file. If that is a 0 byte file, you are still using the old committer. it does sound like you are.
grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data). it is also correct, which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere

Any performance difference between loading ML Module (Xquery / Javascript) from physical disk localation and loading from ML Module DB inside ML?

By default, ML HTTP server will use the Module DB inside ML.
(It seems all ML training materials refer to that type of configuration.)
Any changes in the XQuery programs will need to upload into the Module DB first. That could be accomplished by using mlLoadModules or mlReloadModules ml-gradle commands.
CI/CD does not access the ML cluster directly. Everything is via ml-gradle from a machine dedicated from code deployment to different ML enviroments like dev/uat/prod etc.
However it is also possible to configure the ML app server to use the XQuery program from physical disk location like below screenshot.
With that configuration, it is not required to reload the programs into ML Module DB.
The changes in the program have to be in the ML server itself. CI/CD will need to access to the ML cluster directly. One advantage of this way is that developer will easily see whether the changes in the program have been indeed deployed, as all changes are sitting as physical readable text files in the disk.
Questions:
Which way is better? Why?
Any ML query perforemance difference between these two different approaches?
For the physical file approach, does it mean that CI/CD will need to deploy the program changes to all the ML hosts in the ML cluster? (I guess it is not a concern if HTTP server refers XQuery programs from Module DB inside ML. ML cluster will auto sync the code among different hosts.)
In general, it's recommended to deploy modules to a database rather than the filesystem.
This makes deployment more simple and easy, as you only have to load the module once into the modules database, rather than putting the file on every single host. If you use the filesystem, then you need to put those files on every host in the cluster.
With a modules database, if you were to add nodes to the cluster, you don't have to also deploy the modules. You can then also take advantage of High Availability, backup and restore, and all the other features of a database.
Once a module is read, it is loaded into caches, so the performance impact should be negligible.
If you plan to use REST extensions, then you would need a modules database so that the configurations can be installed in that database.
Some might look to use filesystem for simple development on a single node, in which changes saved to the filesystem are made available without re-deploying. However, you could use something like the ml-gradle mlWatch task to auto-deploy modules as they are modified on the filesystem and achieve effectively the same thing using a modules database.

As a Hadoop Regular User, Is There a Way to See Details about Running Jobs?

I do not have access to any CLI on any of the Hadoop nodes, but I have access to the cluster via Hue and Jupyter. The engineering team has also configured the Hadoop UI that shows New, Running, Submitted, Finish, etc. applications. However, it appears all spark jobs have a generic name, for instance, something like this:
HIVE-f23fa1a1-4444-4ab2-1c44-12345a123456
or similar and when I click on the application_id, I get a Failed to read the attempts of the application error. (even for my own jobs). Similarly, spark jobs, which you can normally name using setAppName, are all named generic "Spark-something" because the spark context is already initialized upon bringing up Jupyter on an edge node (i.e. I can't establish a name because one already exists).
Is there a way for a unprivileged Hadoop user to see into what job is actually running (i.e. the Hive query or the Spark / Hadoop command ), without having some sort of CLI privilege?
I have tried using a few urls that I suspect have job information in them, for instance:
http://cluster_master:<portnum>/history/application_1234123412341234_12345/jobs/ or
http://cluster_master:<portnum>/jobs/application_1234123412341234_12345/
but neither attempt returns any details about the job itself (even things I named myself within the hive / spark context using setAppName.
Please let me know if there's a better way to ask this question. I am relatively new to Hadoop/Spark. All the reference docs and SO answers I've found assume CLI or privileged access and I can't find any documentation in either Spark or Hadoop that applies to this problem.

How to run mahout from command line with KNN based Item Recommender?

I'm new to mahout and still trying to figure things out.
I'm trying to run a KNN based recommender using mahout 0.8 that runs in hadoop cluster (distributed recommender). I'm using mahout 0.8, so KNN is deprecated, but it is still usable (at least when I make it in java code)
I have several questions:
Is it true that there are basically two mahout implementations?
distributed (runs from command line)
non disributed (runs from jar file)
Assumming (1) is correct, Is mahout support running KNN based recommender from command-line? Can someone gives me a direction to do it?
Assumming (1) is wrong, how can I build a recommender in java (I'm using eclipse) that runs in hadoop cluster (distributed)?
Thanks!
KNN is being deprecated because it is being replaced with item-based and user-based cooccurrence recommenders and the ALS-WR recommender, which are better, more modern.
Yes, but not all code has a CLI interface. For the most part the CLI jobs in Mahout are Hadoop/distributed jobs that produce files in HDFS for output. These can be run from jar files with your own code wrapping them as you must with the local/non-distributed/non-Hadoop versions, which do not have a CLI. The in-memory recommenders require you to pass in a user ID to get recs, so you have to write code to do that. The Hadoop versions do have a CLI since they precalculate all recs for all users and put them in files. You'll probably insert them in your DB or serve them up some other way.
No, to my knowledge only user-based, item-based, and ALS-WR recommenders are supported from the command line. This runs the Hadoop/distributed version of the recommenders. This can work on a single machine, of course even using the local filesystem since Hadoop can be set up that way.
For the in-memory recommenders, just write your driver code and run them in eclipse, since Hadoop is not involved it works fine. If you want to use the Hadoop versions, setup Hadoop on your dev machine to run locally using the local filesystem and everything works fine in eclipse. Once you have things debugged move it to your Hadoop cluster. You can also debug remotely on the cluster but that is another question altogether.
The latest thing in Mahout recommenders is one that is trained in the background using Hadoop then the output is indexed by Solr. You then query Solr with items the user has expressed a preference for, no need to precalculate all recs for all users since they returned from a Solr query in near realtime. This is in Mahout 1.0-SNAPSHOT's mahout/examples/ or here https://github.com/pferrel/solr-recommender
BTW this code is being integrated with Mahout 1.0 and moved to run on Spark instead of Hadoop so even the training step will be much much faster.
Update:
I've clarified what can be run from the CLI above.

What is the best components stack for building distributed log aggregator (like Splunk)?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.
I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).
I see architecture as:
Log crawler (Ruby script).
Distributed log storage.
Distributed search engine.
Lightweight front end.
Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.
My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.
MongoDB was rejected because it doesn't work on Solaris.
CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.
So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.
Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?
I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.
For more, see my answer to a similar question on Quora.
With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.

Resources