History server is not receiving any event - hadoop

I am working on streaming application. I tried to configure history server to persist the events of application in Hadoop file system (HDFS). However, it is not logging any events.
I am running Apache Spark 1.4.1 (pyspark) under Ubuntu 14.04 with three nodes.
Here is my configuration:
File - /usr/local/spark/conf/spark-defaults.conf
#In all three nodes
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master-host:port/usr/local/hadoop/spark_log
#in master node
export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://host:port/usr/local/hadoop/spark_log"
Can someone give list of steps to configure history server?

Related

Run Apache Zeppelin as different User

How can I run a Zeppelin interpreter as a different user than the user who started the process?
I want to run Zeppelin as "root" and then launch a spark application as "admin" user
You can keep running Zeppelin as you're currently doing, but start the Spark process separately as that admin user.
The Spark interpreter can be pointed to an external master. Open the Zeppelin interpreter config and change the value of the spark master config key, pointing it to the instance started by the admin user.
In other words, you have one process for spark:
# First run spark as admin:
$ /path/to/spark/sbin/start-all.sh
# Then run zeppelin as root:
# /path/to/zeppelin/bin/zeppelin-daemon.sh start
According to the Zeppelin documentation for the Spark interpreter, you can point Zeppelin to a separate master by changing the value of the master configuration.
The default value for this config is local[*], which makes Zeppelin start a spark context just as done through the spark shell.
And just as the Spark shell can be pointed to an external master, you can use a value for the master URL, such as spark://masterhost:7077.
After this change (and possibly a restart), Zeppelin will only be running the driver, program, while all the workers and scheduling will be handled by your master.

Installing Hive 2.1.0 Interactive Query (LLAP) on Kerberized HDP 2.6.2 environment

I had a lot of issues surrounding the installation/activation of Hive 2.1.0 on our HDP 2.6.2 cluster. But finally I got it working, so I wanted to share the steps involved with the community. I got these steps from different sources, which I will also mention below each step. My specifications:
Clustered HDP 2.6.2 (hortonworks) environment
Kerberos
Hive 1.2.1000 -> Hive 2.1.0
Step 1: Enable Hive Interactive Query
Follow the steps on the Hortonworks website. This includes enabling YARN pre-emption and some other Yarn settings. After adjusting YARN your can enable Hive Interactive Query via Ambari. You also have to specify a default queue that is at least 20% of your total cluster capacity.
Source
Step 2: Kerberos related settings
Make sure you add the following settings to the custom hiveserver2-interactive site in Ambari. Where ${REALMNAME} is the name of your LDAP realm.
hive.llap.zk.sm.keytab.file=/etc/security/keytabs/hive.llap.zk.sm.keytab
hive.llap.zk.sm.principal=hive/_HOST#${REALMNAME}
hive.llap.daemon.keytab.file=/etc/security/keytabs/hive.service.keytab
hive.llap.daemon.service.principal=hive/_HOST#${REALMNAME}
Now you have to put those 2 keytabs (basically the same keytabs) on every YARN node. This can be done manually or through Ambari (Kerberos service). Make sure those keytabs are chown hive:hadoop and have a chmod 440 (group read).
Note: you also need a user hive on all those nodes.
Source
Step 3: Zookeeper configuration
It could be that Hive is not recognized by Zookeeper, this will give acl errors when trying to start the HiveServer2 Interactive. To cope with this issue I added the right hive acl nodes through a zookeeper client host.
su -
# First, authenticate with the hive keytab
kinit hive/'hostname' -kt /etc/security/keytabs/hive.service.keytab
# Second, connect to a zookeeper client on your cluster
/usr/hdp/current/zookeeper-server/bin/zkCli.sh -server ${ZOOKEEPER_CLIENT}
# Third, check the current status of the user-hive acl
getAcl /llap-sasl/user-hive
# Fourth, If this is not there create the following nodes
create /llap-sasl/user-hive "" sasl:hive:cdrwa,world:anyone:r
create /llap-sasl/user-hive/llap0 "" sasl:hive:cdrwa,world:anyone:r
create /llap-sasl/user-hive/llap0/workers "" sasl:hive:cdrwa,world:anyone:r
# Fifth, change the llap-sasl node to add the user hive
setAcl /llap-sasl sasl:hive:cdrwa,world:anyone:r
Source 1, Source 2
Basically, this should work for Kerberized environments. If you got errors related to ACL, go back to your Zookeeper settings and look if everything is fine. If you have errors related to a missing Hive user, you should look of the hive user is added correctly to the nodes. If you have an error related to Kerberos (principal or keytabs) look if the keytabs are on the designated (YARN) nodes with the correct rights.

Hadoop client node installation

I have 12 node cluster. Its Hardware information are :
NameNode : CPU Core i3 2.7 Ghz | 8GB RAM | 500 GB HDD
DataNode : CPU Core i3 2.7 Ghz | 2GB RAM | 500 GB HDD
I have installed the hadoop 2.7.2. I am using normal hadoop installation process on ubuntu and it work fine. But I want to add client machine.and I have no such clue that how to add client machine.
Question :
Installing process of Client machine. ?
How to run any script of pig/hive on that client machine ?
Client should have same copy of Hadoop Distribution and configuration which is present at Namenode then Only Client will come to know on which node Job tracker/Resourcemanager is running, and IP of Namenode to access HDFS data.
Also you need to update /etc/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Note that, you shouldn’t start any hadoop service on client machine.
Steps to follow on client machine:
create an user account on the cluster, say user1
create an account on client machine with the same name: user1
configure client machine to access the cluster machines (ssh w\out passphrase i.e, password less login)
copy/get a hadoop distribution same as cluster to client machine and extract it to /home/user1/hadoop-2.x.x
copy(or Edit) the hadoop configuration files (*-site.xml) from Namenode of the cluster - from this client will know where the Namenode/resourcemanager is running.
Set environment variables: JAVA_HOME, HADOOP_HOME (/home/user1/hadoop-2.x.x)
Set hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
test it out: hadoop fs -ls / which should list the root directory of the cluster hdfs.
you may face some issues like privileges, may need to set JAVA_HOME places like conf/hadoop-env.sh on client machine. update/comment any error you get.
Answers to more questions from comments:
How to load data from client node to hdfs ? - Just run hadoop fs commands from client machine: hadoop fs -put /home/user1/data/* /user/user1/data - you can also write shell-scripts that would run these command(s) if you need to run them many times.
Why I am installing hadoop on the client if we only use ssh to connect remotely to the master node ?
because client need to communicate with cluster, and need to know
where cluster nodes are.
client will be running hadoop jobs
like hadoop fs commands, hive queries, hadoop jar commnads, spark
jobs, developing mapreduce jobs etc for which client will need
hadoop binaries on client node.
Basically you are not only using the ssh to
connect, but you are performing some operations on hadoop cluster from
client node so you would need hadoop binaries.
ssh is used by
hadoop binaries on client node, when you run such operations like hadoop fs
-ls/ from client node to cluster. (remember adding $HADOOP_HOME/bin to PATH as part of installation process above)
when you are saying "we only use ssh" - that sounds to me like when you want to make changes/access hadoop configuration files from cluster you are connecting using ssh to cluster nodes - you do this as part of administrative work but when you need to run hadoop commands/jobs against cluster from client node you dont need to ssh manually - hadoop installation on client node will take care of it.
with out hadoop instalations how can you run hadoop commands/jobs/queries from client node to cluster?
3. should user name 'user1' must be same ? what if it is different ? - it will work. you can install hadoop on client node under group user say: qa or dev, and all users on client node as sudo under that group. than when user1 on client node need to run any hadoop job on cluster: user1 should be able to sudo -i -u qa and then run hadoop command from it.

Apache Spark: History server (logging) + non super-user access (HDFS)

I have a working HDFS and a running Spark framework in a remote server.
I am running SparkR applications and hope to see the logs of the completed UI as well.
I followed all the instructions here: Windows: Apache Spark History Server Config
and was able to start the History Server on the server.
However, only when the super-user(person who started the name node of Hadoop) and who started the Spark processes fires a Spark application remotely, the logging takes places successfully in HDFS path & we are able to view the History Web UI of Spark as well.
When I run the same application from my user ID (remotely), though it shows on port 18080 a History Server is up and running, it does not log any of my applications.
I have been given read, write and execute access to the folder in HDFS.
The spark-defaults.conf file now looks like this:
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs://XX.XX.XX.XX:19000/user/logs
spark.eventLog.dir hdfs://XX.XX.XX.XX:19000/user/logs
spark.history.ui.acls.enable false
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 7d
Am I missing out on some permissions or config settings somewhere(Spark? HDFS)?
Any pointers/tips to proceed from here would be appreciated.

Setting up Kerberos on HDP 2.1

I have 2 node Ambari Hadoop cluster running on CentOS 6. Recently I setup Kerberos for the services in the cluster as per the instructions detailed here:
http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.0.0/bk_ambari_security/content/ambari-kerb.html
In addition to the above documentation, found that you have to add additional configurations for the Web Namenode UI and so on (QuickLinks in the Ambari server console for each of the Hadoop Services) to work. Hence I followed the configuration options, listed in the question portion of the article to setup HTTP Authentication:Hadoop Web Authentication using Kerberos
Also to create the secret http file, I used the command to generate the file on node 1, and then copied the file to the same folder location on node 2 on the cluster as well:
sudo dd if=/dev/urandom of=/etc/security/keytabs/http_secret bs=1024 count=1
Updated the Zookeeper JAAS client file under, /etc/zookeeper/conf/zookeeper_client_jaas.conf to add the following:
Client { com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=false
useTicketCache=true
keyTab='/etc/security/keytabs/zk.service.keytab'
principal='zookeeper/host#realm-name';
};
This step followed from the article: http://blog.spryinc.com/2014/03/configuring-kerberos-security-in.html
When I restarted my Hadoop services, I get the 401 Authentication Required error, when I try to access the NameNode UI/ NameNode Logs/ NameNode JMX and so on. None of the links given in the QuickLinks drop down is able to connect and pull up the data.
Any thoughts to resolve this error?

Resources