Run Apache Zeppelin as different User - hadoop

How can I run a Zeppelin interpreter as a different user than the user who started the process?
I want to run Zeppelin as "root" and then launch a spark application as "admin" user

You can keep running Zeppelin as you're currently doing, but start the Spark process separately as that admin user.
The Spark interpreter can be pointed to an external master. Open the Zeppelin interpreter config and change the value of the spark master config key, pointing it to the instance started by the admin user.
In other words, you have one process for spark:
# First run spark as admin:
$ /path/to/spark/sbin/start-all.sh
# Then run zeppelin as root:
# /path/to/zeppelin/bin/zeppelin-daemon.sh start
According to the Zeppelin documentation for the Spark interpreter, you can point Zeppelin to a separate master by changing the value of the master configuration.
The default value for this config is local[*], which makes Zeppelin start a spark context just as done through the spark shell.
And just as the Spark shell can be pointed to an external master, you can use a value for the master URL, such as spark://masterhost:7077.
After this change (and possibly a restart), Zeppelin will only be running the driver, program, while all the workers and scheduling will be handled by your master.

Related

Installing Hive 2.1.0 Interactive Query (LLAP) on Kerberized HDP 2.6.2 environment

I had a lot of issues surrounding the installation/activation of Hive 2.1.0 on our HDP 2.6.2 cluster. But finally I got it working, so I wanted to share the steps involved with the community. I got these steps from different sources, which I will also mention below each step. My specifications:
Clustered HDP 2.6.2 (hortonworks) environment
Kerberos
Hive 1.2.1000 -> Hive 2.1.0
Step 1: Enable Hive Interactive Query
Follow the steps on the Hortonworks website. This includes enabling YARN pre-emption and some other Yarn settings. After adjusting YARN your can enable Hive Interactive Query via Ambari. You also have to specify a default queue that is at least 20% of your total cluster capacity.
Source
Step 2: Kerberos related settings
Make sure you add the following settings to the custom hiveserver2-interactive site in Ambari. Where ${REALMNAME} is the name of your LDAP realm.
hive.llap.zk.sm.keytab.file=/etc/security/keytabs/hive.llap.zk.sm.keytab
hive.llap.zk.sm.principal=hive/_HOST#${REALMNAME}
hive.llap.daemon.keytab.file=/etc/security/keytabs/hive.service.keytab
hive.llap.daemon.service.principal=hive/_HOST#${REALMNAME}
Now you have to put those 2 keytabs (basically the same keytabs) on every YARN node. This can be done manually or through Ambari (Kerberos service). Make sure those keytabs are chown hive:hadoop and have a chmod 440 (group read).
Note: you also need a user hive on all those nodes.
Source
Step 3: Zookeeper configuration
It could be that Hive is not recognized by Zookeeper, this will give acl errors when trying to start the HiveServer2 Interactive. To cope with this issue I added the right hive acl nodes through a zookeeper client host.
su -
# First, authenticate with the hive keytab
kinit hive/'hostname' -kt /etc/security/keytabs/hive.service.keytab
# Second, connect to a zookeeper client on your cluster
/usr/hdp/current/zookeeper-server/bin/zkCli.sh -server ${ZOOKEEPER_CLIENT}
# Third, check the current status of the user-hive acl
getAcl /llap-sasl/user-hive
# Fourth, If this is not there create the following nodes
create /llap-sasl/user-hive "" sasl:hive:cdrwa,world:anyone:r
create /llap-sasl/user-hive/llap0 "" sasl:hive:cdrwa,world:anyone:r
create /llap-sasl/user-hive/llap0/workers "" sasl:hive:cdrwa,world:anyone:r
# Fifth, change the llap-sasl node to add the user hive
setAcl /llap-sasl sasl:hive:cdrwa,world:anyone:r
Source 1, Source 2
Basically, this should work for Kerberized environments. If you got errors related to ACL, go back to your Zookeeper settings and look if everything is fine. If you have errors related to a missing Hive user, you should look of the hive user is added correctly to the nodes. If you have an error related to Kerberos (principal or keytabs) look if the keytabs are on the designated (YARN) nodes with the correct rights.

Spark History Server on Yarn only shows Python application

I have two spark contexts running on a box, 1 from python and 1 from scala. They are similarly configured, yet only the python application appears in the spark history page pointed to by the yarn tracking URL. Is there extra configuration I am missing here? (both run in yarn-client mode)

Apache Spark: History server (logging) + non super-user access (HDFS)

I have a working HDFS and a running Spark framework in a remote server.
I am running SparkR applications and hope to see the logs of the completed UI as well.
I followed all the instructions here: Windows: Apache Spark History Server Config
and was able to start the History Server on the server.
However, only when the super-user(person who started the name node of Hadoop) and who started the Spark processes fires a Spark application remotely, the logging takes places successfully in HDFS path & we are able to view the History Web UI of Spark as well.
When I run the same application from my user ID (remotely), though it shows on port 18080 a History Server is up and running, it does not log any of my applications.
I have been given read, write and execute access to the folder in HDFS.
The spark-defaults.conf file now looks like this:
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs://XX.XX.XX.XX:19000/user/logs
spark.eventLog.dir hdfs://XX.XX.XX.XX:19000/user/logs
spark.history.ui.acls.enable false
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 7d
Am I missing out on some permissions or config settings somewhere(Spark? HDFS)?
Any pointers/tips to proceed from here would be appreciated.

H2O: unable to connect to h2o cluster through python

I have a 5 node hadoop cluster running HDP 2.3.0. I setup a H2O cluster on Yarn as described here.
On running following command
hadoop jar h2odriver_hdp2.2.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 512m -nodes 3 -output /user/hdfs/H2OTestClusterOutput
I get the following ouput
H2O cluster (3 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...
When I try to execute the command
h2o.init(ip="10.113.57.98", port=54321)
The process remains stuck at this stage.On trying to connect to the web UI using the ip:54321, the browser tries to endlessly load the H2O admin page but nothing ever displays.
On forcefully terminating the init process I get the following error
No instance found at ip and port: 10.113.57.98:54321. Trying to start local jar...
However if I try and use H2O with python without setting up a H2O cluster, everything runs fine.
I executed all commands as the root user. Root user has permissions to read and write from the /user/hdfs hdfs directory.
I'm not sure if this is a permissions error or that the port is not accessible.
Any help would be greatly appreciated.
It looks like you are using H2O2 (H2O Classic). I recommend upgrading your H2O to the latest (H2O 3). There is a build specifically for HDP2.3 here: http://www.h2o.ai/download/h2o/hadoop
Running H2O3 is a little cleaner too:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
Also, 512mb per node is tiny - what is your use case? I would give the nodes some more memory.

Hadoop on Google Compute Engine

I am trying to setup hadoop cluster in Google Compute Engine through "Launch click-to-deploy software" feature .I have created 1 master and 1 slave node and tried to start the cluster using start-all.sh script from master node and i got error "permission denied(publickey)" .
I have generated public and private keys in both slave and master nodes .
currently i logged into the master with my username, is it mandatory to login into master as "hadoop" user .If so ,what is the password for that userid .
please let me know how to overcome this problem .
The deployment creates a user hadoop which owns Hadoop-specific SSH keys which were generated dynamically at deployment time; this means since start-all.sh uses SSH under the hood, you must do the following:
sudo su hadoop
/home/hadoop/hadoop-install/bin/start-all.sh
Otherwise, your "normal" username doesn't have SSH keys properly set up so you won't be able to launch the Hadoop daemons, as you saw.
Another thing to note is that the deployment should have already started all the Hadoop daemons automatically, so you shouldn't need to manually run start-all.sh unless you're rebooting the daemons after some manual configuration updates. If the daemons weren't running after the deployment ran, you may have encountered some unexpected error during initialization.

Resources