How to change port of web UI with pysparkling - h2o

I'm just trying to get pysparkling working, but change the port of the web UI. I've looked in the help files and they seem to reference old versions of sparkling water. Currently am running
from pysparkling import *
hc = H2OContext.getOrCreate(spark)
and is starting up on the default 54321 port. I see there is a conf object to pass in, but am unsure of how to set this correctly. Any help would be appreciated.

This is the script you can use the launch the H2O cluster on a different port:
## Importing Libraries
from pysparkling import *
import h2o
## Setting H2O Conf Object
h2oConf = H2OConf(sc)
h2oConf
## Setting H2O Conf for different port
h2oConf.set_client_port_base(54300)
h2oConf.set_node_base_port(54300)
## Gett H2O Conf Object to see the configuration
h2oConf
## Launching H2O Cluster
hc = H2OContext.getOrCreate(spark, h2oConf)
## Getting H2O Cluster status
h2o.cluster_status()
I have also written a blog post to explain it in details.

Related

How to specify the H2O version in Sparkling Water?

In a Databricks notebook, I am trying to load an H2O model that was trained for H2O version 3.30.1.3.
I have installed the version of Sparkling Water which corresponds to the Spark version used for the model training (3.0), h2o-pysparkling-3.0, which I pulled from PyPI.
The Sparkling Water server is using the latest version of H2O rather than the version I need. Maybe there is a way to specify the H2O version when I initiate the Sparkling Water context? Something like this:
import h2o
from pysparkling import H2OContext
from pysparkling.ml import H2OBinaryModel
hc = H2OContext.getOrCreate(h2o_version='3.30.1.3')
model = H2OBinaryModel.read('s3://bucket/model_file')
I run the above code without an argument to H2OContext.getOrCreate() and I get this error:
IllegalArgumentException:
The binary model has been trained in H2O of version
3.30.1.3 but you are currently running H2O version of 3.34.0.6.
Please make sure that running Sparkling Water/H2O-3 cluster and the loaded binary
model correspond to the same H2O-3 version.
Where is the Python API for Sparkling Water? If I could find that I might be able to determine if there's an H2O version argument for the context initializer but surprisingly it's been impossible for me to find so far with Google and poking around in the docs.
Or is this something that's instead handled by installing an H2O version-specific build of Sparkling Water? Or perhaps there's another relevant configuration setting someplace?
Did you try Notebook-scoped library concepts ? Notebook-scoped libraries let you create, modify, save, reuse, and share custom Python environments that are specific to a notebook. When you install a notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected. You can ref : link
Limitations : Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.

Using the dask labextenstion to connect to a remote cluster

I'm interested in running a Dask cluster on EMR and interacting with it from inside of a Jupyter Lab notebook running on a separate EC2 instance (e.g. an EC2 instance not within the cluster and not managed by EMR).
The Dask documentation points to dask-labextension as the tool of choice for this use case. dask-labextension relies on a YAML config file (and/or some environment vars) to understand how to talk to the cluster. However, as far as I can tell, this configuration can only be set to point to a local Dask cluster. In other words, you must be in a Jupyter Lab notebook running on an instance within the cluster (presumably on the master instance?) in order to use this extension.
Is my read correct? Is it not currently possible to use dask-labextension with an external Dask cluster?
Dask Labextension can talk to any Dask cluster that is visible from where your web client is running. If you can connect to a dashboard in a web browser then you can copy that same address to the Dask-Labextension search bar and it will connect.

How to change cluster IP in a replication controller run time

I am using Kubernetes 1.0.3 where a master and 5 minion nodes deployed.
I have an Elasricsearch application that is deployed on 3 nodes using a replication controller and service is defined.
Now i have added a new minion node to the cluster and wanted to run the container elasticsearch on the new node.
I am scaling my replication controller to 4 so that based on the node label the elasticsearch container is deployed on new node.Below is my issue and please let me k ow if there is any solution ?
The cluster IP defined in the RC is wrong as it is not the same in service.yaml file.Now when I scale the RC new node is installed with the ES container pointing to the wrong Cluster IP due to which the new node is not joining the ES cluster.Is there any way that I can modify the cluster IP of deployed RC so that when I scale the RC the image is deployed on new node with the correct cluster IP ?
Since I am using old version I don't see kubectl edit command and I tried changing using kubectl patch command but the IP didn't change.
The problem is that I need to do this on a production cluster so I can't delete the existing pods but only option is to change the cluster IP of deployed RC and then scale so that it will take the new IP and image is started accordingly.
Please let me know if any way I can do this ?
Kubernetes creates that (virtual) ClusterIP for every service.
Whatever you defined in your service definition (which you should have posted along with your question) is being ignored by Kubernetes, if I recall correctly.
I don't quite understand the issue with scaling, but basically, you want to point at the service name (resolved by Kubernetes's internal DNS) rather than the ClusterIP.
E.g., http://myelasticsearchservice instead of http://1.2.3.4

Nifi Clustering: Embedded Zookeeper setup issue

I have followed the NiFi clustering steps mentioned in NiFi admin guide. But the NiFi nodes are not forming a working cluster with embedded zookeepers. Am I missing something? Please help.
The configuration in zookeeper.properties is as follows. The 192.168.99.101 is the localhost IP address where NiFi is running and listening on port 9090:
clientPort=2181
initLimit=10
autopurge.purgeInterval=24
syncLimit=5
tickTime=2000
dataDir=./state/zookeeper
autopurge.snapRetainCount=30
server.1=192.168.99.101:2888:3888
The configuration pertaining to Zookeeper in nifi.properties is as follows:
nifi.state.management.embedded.zookeeper.start=true
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
nifi.zookeeper.connect.string=192.168.99.101:2181
nifi.zookeeper.connect.timeout=3 secs
nifi.zookeeper.session.timeout=3 secs
nifi.zookeeper.root.node=/nifi
nifi.zookeeper.auth.type=
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=
Following the detailed Zookeeper based NiFi clustering steps as documented in these articles helped: Pierre Villard on NiFi clustering and Elton Atkins on NiFi clustering.
Also, following Matt Clarke's advise regarding using dedicated external zookeepers instead of embedded zookeepers helped.
Documenting what helped me just in case it helps someone else who struggles with similar problem in near future.

Submit a Spark application that connects to a Cassandra database from IntelliJ IDEA

I found a similar question here: How to submit code to a remote Spark cluster from IntelliJ IDEA
I want to submit a Spark application to a cluster on which Spark and Cassandra are installed.
My Application is on a Windows OS. The application is written in IntelliJ using:
Maven
Scala
Spark
Below is a code snippet:
val spark = SparkSession
.builder().master("spark://...:7077") // the actual code contains the IP of the master node from the cluster
.appName("Cassandra App")
.config("spark.cassandra.connection.host", cassandraHost) // is the same as the IP of the master node from the cluster
.getOrCreate()
val sc = spark.sparkContext
val trainingdata = sc.cassandraTable("sparkdb", "trainingdata").map(a => a.get[String]("attributes"))
The Cluster contains two nodes on which Ubuntu is installed. Also, Cassandra and Spark are installed on each node.
When I use local[*] instead of spark://...:7077 everything works fine. However, when I use the version described in this post, I get the next error:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
On the cluster, the error is detailed further:
java.lang.ClassNotFoundException: MyApplication$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
Also, I want to note that the application written on Windows uses Spark as a Maven dependency.
I would like to know if it is possible to submit this spark application from the Windows node to the Ubuntu cluster and if it is not possible, what alternative should I use. If I have to create a jar from the Scala object, what approach should I use call the cluster from IntelliJ?
In order to launch your application it should persist on cluster in other words your packaged jar should reside or in HDFS or in every node of your cluster at same path. Then you can use ssh client or RESTfull interface or whatever enables triggering spark-submit command.

Resources