I have a hadoop cluster setup using Ambari which has services like HDFS,YARN,spark running on the hosts.
When i run the sample spark pi in cluster mode as master yarn, the application gets successfully executed and I can view the same from resource manager logs.
But when i click on the history link, it does not show the spark history UI. How to enable/view the same?
First, check if your spark-history server is already configured by looking for spark.yarn.historyServer.address in spark-defaults.conf file.
If not configured, this link should help you configure the server: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_installing_manually_book/content/ch19s04s01.html
If already configured, check if the history server host is accessible from all the nodes in the cluster, and also the port is open.
Related
I just installed Hortonwroks Sandbox via virtualbox. And when i started Ambari every services was red like you can see in this screenshot . Have i missed something? i'm a beginner in hadoop
Actually, when we start HDP Sandbox all Services services go into the starting stage except Strome, Atlas, Hbase (This can be checked by Gear Icon on the top right side from there you can check the reason behind of failed Services).
Try to Manually Start services in the following manner
Zookeeper
HDFS
YARN
MapReduce
Hive
I'm new to Apache Hadoop. I've installed a cluster of YARN with one master and two slaves on AWS. When I just start the cluster YARN, I could observe that some applications are launched by user dr.who with app type YARN automatically. It bothers me a lot. Hoping someone could help me out of this. Thanks!
application_1531399885156_0041 dr.who hadoop YARN default Thu Jul 12 14:58:37 +0200 2018 N/A ACCEPTED UNDEFINED ApplicationMaster 0
This is a known bug in latest launch of Hadoop and a JIRA has also been created. The apps submission by dr.who and when the user kills all the jobs then the NodeManager goes down.
EDIT: Problem Resolution
PROBLEM Customer unable to see logs via Resource Manager UI due to incorrect permissions for the default user dr.who.
RESOLUTION Customer changed the following property in core-site.xml to resolve the issue. Other values such as hdfs or mapred also resolve the issue. If the cluster is managed by Ambari, this should be added in Ambari > HDFS > Configurations >Advanced core-site > Add Property
hadoop.http.staticuser.user=yarn
The same threat was posted on Hortonworks and was answered by Sandeep Nemuri who wrote:
Stop further attacks:
a. Use Firewall / IP table settings to allow access only to whitelisted IP addresses for Resource Manager port (default 8088). Do this on both Resource Managers in your HA setup. This only addresses the current attack. To permanently secure your clusters, all HDP end-points ( e.g WebHDFS) must be blocked from open access outside of firewalls.
b. Make your cluster secure (kerberized).
Clean up existing attacks:
a. If you already see the above problem in your clusters, please filter all applications named “MYYARN” and kill them after verifying that these applications are not legitimately submitted by your own users.
b. You will also need to manually login into the cluster machines and check for any process with “z_2.sh” or “/tmp/java” or “/tmp/w.conf” and kill them.
The link to that thread is : dr.who
I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?
1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.
2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?
3) Do I need to configure SSL also to authenticate my local system?
Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.
Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.
To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.
However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.
Spark submit YARN mode HADOOP_CONF_DIR contents
Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.
Not sure about Python, R, or the SSL configs.
And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.
I set-up a new Hadoop Cluster with Hortonworks Data Platform 2.5. In the "old" cluster (installed HDP 2.4) I was able to see the information about running Spark jobs via the History Server UI by clicking the link show incomplete applications:
Within the new installation this link opens the page, but it always sais No incomplete applications found! (when there's still an application running).
I just saw, that the YARN ResourceManager UI shows two different kind of links in the "Tracking UI" column, dependent on the status of the Spark application:
application running: Application Master
this link opens http://master_url:8088/proxy/application_1480327991583_0010/
application finished: History
this link opens http://master_url:18080/history/application_1480327991583_0009/jobs/
Via the YARN RM link I can see the running Spark app infos, but why can't I access them via Spark History Server UI? Was there somethings changed from HDP 2.4 to 2.5?
I solved it, it was a network problem: Some of the cluster hosts (Spark slaves) couldn't reach each other due to a incorrect switch configuration. Found it out, as I tried to ping each host from each other.
Since all hosts can ping each other hosts the problem is gone and I can see active and finished jobs in my Spark History server UI again!
I didn't noticed the problem, because the ambari-agents worked on each host, and the ambari-server was also reachable from each cluster host! However, since ALL hosts can reach each other the problem is solved!
I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)