Apache Spark: History server (logging) + non super-user access (HDFS) - hadoop

I have a working HDFS and a running Spark framework in a remote server.
I am running SparkR applications and hope to see the logs of the completed UI as well.
I followed all the instructions here: Windows: Apache Spark History Server Config
and was able to start the History Server on the server.
However, only when the super-user(person who started the name node of Hadoop) and who started the Spark processes fires a Spark application remotely, the logging takes places successfully in HDFS path & we are able to view the History Web UI of Spark as well.
When I run the same application from my user ID (remotely), though it shows on port 18080 a History Server is up and running, it does not log any of my applications.
I have been given read, write and execute access to the folder in HDFS.
The spark-defaults.conf file now looks like this:
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs://XX.XX.XX.XX:19000/user/logs
spark.eventLog.dir hdfs://XX.XX.XX.XX:19000/user/logs
spark.history.ui.acls.enable false
spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 1d
spark.history.fs.cleaner.maxAge 7d
Am I missing out on some permissions or config settings somewhere(Spark? HDFS)?
Any pointers/tips to proceed from here would be appreciated.

Related

Run Apache Zeppelin as different User

How can I run a Zeppelin interpreter as a different user than the user who started the process?
I want to run Zeppelin as "root" and then launch a spark application as "admin" user
You can keep running Zeppelin as you're currently doing, but start the Spark process separately as that admin user.
The Spark interpreter can be pointed to an external master. Open the Zeppelin interpreter config and change the value of the spark master config key, pointing it to the instance started by the admin user.
In other words, you have one process for spark:
# First run spark as admin:
$ /path/to/spark/sbin/start-all.sh
# Then run zeppelin as root:
# /path/to/zeppelin/bin/zeppelin-daemon.sh start
According to the Zeppelin documentation for the Spark interpreter, you can point Zeppelin to a separate master by changing the value of the master configuration.
The default value for this config is local[*], which makes Zeppelin start a spark context just as done through the spark shell.
And just as the Spark shell can be pointed to an external master, you can use a value for the master URL, such as spark://masterhost:7077.
After this change (and possibly a restart), Zeppelin will only be running the driver, program, while all the workers and scheduling will be handled by your master.

Getting "User [dr.who] is not authorized to view the logs for application <AppID>" while running a YARN application

I'm running a custom Yarn Application using Apache Twill in HDP 2.5 cluster, but I'm not able to see my own container logs (syslog, stderr and stdout) when I go to my container web page:
Also the login changes from my kerberos to "dr.who" when I navigate to this page.
But I can see the logs of map-reduce jobs. Hadoop version is 2.7.3 and the cluster is yarn acl enabled.
i had this issue with hadoop ui. I found in this doc, that the hadoop.http.staticuser.user is set to dr.who by default and you need include it in the related setting file (in my issue is core-site.xml file).
so late but hope useful.

Spark History Server on Yarn only shows Python application

I have two spark contexts running on a box, 1 from python and 1 from scala. They are similarly configured, yet only the python application appears in the spark history page pointed to by the yarn tracking URL. Is there extra configuration I am missing here? (both run in yarn-client mode)

spark history server does not show jobs or stages

We are trying to use spark history server to further improve our spark jobs. The spark job correctly writes the eventlog into HDFS and the spark history server also can access this eventlog: we do see the job in the spark history server job listing but aside from the environment variables and executors everything is empty...
Any ideas on how we can make the spark history server show everything (we really want to see the DAG for instance) ?
We are using spark 1.4.1.
Thanks.
I had a similar issue. I am browsing the history server with port forwarding with ssh. After granting the read permission to all the files in the log directory, they appear in my history server!
cd {SPARK_EVENT_LOG_DIR}
chmod +r * # grant the read permission to all users for all files

History server is not receiving any event

I am working on streaming application. I tried to configure history server to persist the events of application in Hadoop file system (HDFS). However, it is not logging any events.
I am running Apache Spark 1.4.1 (pyspark) under Ubuntu 14.04 with three nodes.
Here is my configuration:
File - /usr/local/spark/conf/spark-defaults.conf
#In all three nodes
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master-host:port/usr/local/hadoop/spark_log
#in master node
export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://host:port/usr/local/hadoop/spark_log"
Can someone give list of steps to configure history server?

Resources