Some automatically launched Hadoop YARN application - hadoop

I'm new to Apache Hadoop. I've installed a cluster of YARN with one master and two slaves on AWS. When I just start the cluster YARN, I could observe that some applications are launched by user dr.who with app type YARN automatically. It bothers me a lot. Hoping someone could help me out of this. Thanks!
application_1531399885156_0041 dr.who hadoop YARN default Thu Jul 12 14:58:37 +0200 2018 N/A ACCEPTED UNDEFINED ApplicationMaster 0

This is a known bug in latest launch of Hadoop and a JIRA has also been created. The apps submission by dr.who and when the user kills all the jobs then the NodeManager goes down.
EDIT: Problem Resolution
PROBLEM Customer unable to see logs via Resource Manager UI due to incorrect permissions for the default user dr.who.
RESOLUTION Customer changed the following property in core-site.xml to resolve the issue. Other values such as hdfs or mapred also resolve the issue. If the cluster is managed by Ambari, this should be added in Ambari > HDFS > Configurations >Advanced core-site > Add Property
hadoop.http.staticuser.user=yarn
The same threat was posted on Hortonworks and was answered by Sandeep Nemuri who wrote:
Stop further attacks:
a. Use Firewall / IP table settings to allow access only to whitelisted IP addresses for Resource Manager port (default 8088). Do this on both Resource Managers in your HA setup. This only addresses the current attack. To permanently secure your clusters, all HDP end-points ( e.g WebHDFS) must be blocked from open access outside of firewalls.
b. Make your cluster secure (kerberized).
Clean up existing attacks:
a. If you already see the above problem in your clusters, please filter all applications named “MYYARN” and kill them after verifying that these applications are not legitimately submitted by your own users.
b. You will also need to manually login into the cluster machines and check for any process with “z_2.sh” or “/tmp/java” or “/tmp/w.conf” and kill them.
The link to that thread is : dr.who

Related

Spark History UI not working | Ambari | YARN

I have a hadoop cluster setup using Ambari which has services like HDFS,YARN,spark running on the hosts.
When i run the sample spark pi in cluster mode as master yarn, the application gets successfully executed and I can view the same from resource manager logs.
But when i click on the history link, it does not show the spark history UI. How to enable/view the same?
First, check if your spark-history server is already configured by looking for spark.yarn.historyServer.address in spark-defaults.conf file.
If not configured, this link should help you configure the server: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_installing_manually_book/content/ch19s04s01.html
If already configured, check if the history server host is accessible from all the nodes in the cluster, and also the port is open.

HDP 2.5: Spark History Server UI won't show incomplete applications

I set-up a new Hadoop Cluster with Hortonworks Data Platform 2.5. In the "old" cluster (installed HDP 2.4) I was able to see the information about running Spark jobs via the History Server UI by clicking the link show incomplete applications:
Within the new installation this link opens the page, but it always sais No incomplete applications found! (when there's still an application running).
I just saw, that the YARN ResourceManager UI shows two different kind of links in the "Tracking UI" column, dependent on the status of the Spark application:
application running: Application Master
this link opens http://master_url:8088/proxy/application_1480327991583_0010/
application finished: History
this link opens http://master_url:18080/history/application_1480327991583_0009/jobs/
Via the YARN RM link I can see the running Spark app infos, but why can't I access them via Spark History Server UI? Was there somethings changed from HDP 2.4 to 2.5?
I solved it, it was a network problem: Some of the cluster hosts (Spark slaves) couldn't reach each other due to a incorrect switch configuration. Found it out, as I tried to ping each host from each other.
Since all hosts can ping each other hosts the problem is gone and I can see active and finished jobs in my Spark History server UI again!
I didn't noticed the problem, because the ambari-agents worked on each host, and the ambari-server was also reachable from each cluster host! However, since ALL hosts can reach each other the problem is solved!

hadoop port configuration within a firewall

I have a customer where we have hadoop installation managed by us. In the current setup all the nodes in the cluster have all the ports open for each other. But the customer is quite reluctant to keep all the ports open. Can anyone let me know if any such configuration is at all possible where we instruct hadoop to use only restricted number of ports.
My Findings : I have been able to configure a test setup where I have opened only the required port as per the mentioned blog
https://hadoop.apache.org/docs/r2.6.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
But I still see the MR jobs are not executed in distributed manner.

Submitting jobs to Spark EC2 cluster remotely

I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.
I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:
--deploy-mode=client:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
Results in the following indefinite warnings/errors:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.
I've tried to pass limits for cores and memory to submit script but it didn't help...
--deploy-mode=cluster:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
The result is:
.... Driver successfully submitted as driver-20150223023734-0007 ...
waiting before polling master for driver state ... polling master for
driver state State of driver-20150223023734-0007 is ERROR Exception
from cluster was: java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at
org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)
So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.
UPDATE:
So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.
The documentation at:
http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security
lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.
So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.
I advise against submitting spark jobs remotely using the port opening strategy, because it can create security problems and is in my experience, more trouble than it's worth, especially due to having to troubleshoot the communication layer.
Alternatives:
1) Livy - now an Apache project! http://livy.io or http://livy.incubator.apache.org/
2) Spark Job server - https://github.com/spark-jobserver/spark-jobserver

Errors in setting up HBase on Distributed Hadoop, ZooKeeperServer not running

I'm trying to set up HBase on Hadoop and have been follow various great tutorials online by Michael G. Noll and here. Basically all is fine, my Hdfs and MapRed works well on the web interface it shows that I have 2 nodes (my NameNode is both a NameNode and a DataNode but that's just for testing purposes).
When I got to the point of installing HBase, thats where I meet problems, I get lots of different errors. The latest one I have is on the log file in my slave node
INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.2.xx.xx:43089 (no session established for client)
INFO org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
But when I type in
$ zkServer.sh status
It says shows the mode that both machines are running in!
Anyone has any idea what is this problem. Or does any one know of another guide/tutorial that I can follow to set this up? I've tried following the HBase documentation on setting up HBase on a distributed HDFS but it doesn't work too.
Thanks for any help offered!
Are both the zookeepers servers configured in a Qorum? If so have they managed to connect to one another and vote on who's the leader (this should all be in the logs for both servers).
Zookeeper may be running, but if they can't communicate with one another (firewall rules or miss configuration for example), then zookeeper will not accept in coming client connections

Resources