spark submit on edge node - hadoop

I am submitting my spark-submit command command through my Edge Node. For this I am using client mode, Now I am accessing my edge node(which is on the same network as my cluster) through my laptop. I know that the driver program runs on my Edge Node, what I want to know is that why does my spark-job automatically suspends when I close my ssh session with the Edge Node? Does opening Edge Node putty connection through VPN/Wireless internet has any effect on the spark job vs using the Ethernet cable from within the network? At present the spark submit job is very slow even though the cluster is really powerful!Please help!
Thanks!

You are submitting the job with --master yarn but possibly you are not specifying --deploy-mode cluster, so the driver application (your Java code) is running locally on this edge node machine. When choosing --deploy-mode cluster the driver will run on your cluster and will overall be more robust.
The spark job dies when you close the ssh connection because you're killing the driver when doing it, it is running on your terminal session. To avoid this you must send the command as a background job using & at the end of your spark-submit. For example:
spark-submit --master yarn --class foo bar zaz &
This will send the driver into the background and the stdout will be sent to your tty, polluting your session but not killing the process when you close the ssh connection.
If you however don´t want it to be so polluted you can send the stdout to /dev/null by doing this:
spark-submit --master yarn --class foo bar zaz &>/dev/null &
However you won´t know why things failed. You can redirect the stdout to a file too instead of /dev/null.
Finally, once this is clear enough I strongly recommend not deploying like this your spark jobs, since the driver process in the edge node failing for any funky reason will kill the job running in the cluster. It also has a strange behavior since the job dying in the cluster (Some runtime problem) will not stop nor kill your driver in the edge node, which leads to a lot of wasted memory in that machine if you don´t take care of manually kill all those old driver processes in that machine.
All this is avoided by using the flag --deploy-mode cluster in your spark submit.

Related

Spark won't run final `saveAsNewAPIHadoopFile` method in yarn-cluster mode

I wrote a Spark application, that reads some CSV files (~5-10 GB), transforms the data and converts the data into HFiles. The data is read from and saved into HDFS.
Everything seems to work fine when I run the application in the yarn-client mode.
But when I try to run it as yarn-cluster application, the process seems not to run the final saveAsNewAPIHadoopFile action on my transformed and ready-to-save RDD!
Here is a snapshot of my Spark UI, where you can see that all the other Jobs are processed:
And the corresponding Stages:
Here the last step of my application where the saveAsNewAPIHadoopFile method is called:
JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ...
try {
Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab");
Configuration baseConf = c.getConfiguration();
baseConf.set("hbase.zookeeper.quorum", HBASE_HOST);
baseConf.set("zookeeper.znode.parent", "/hbase-secure");
Job job = Job.getInstance(baseConf, "Test Bulk Load");
HTable table = new HTable(baseConf, "map_data");
HBaseAdmin admin = new HBaseAdmin(baseConf);
HFileOutputFormat2.configureIncrementalLoad(job, table);
Configuration conf = job.getConfiguration();
cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);
System.out.println("Finished!!!!!");
} catch (IOException e) {
e.printStackTrace();
System.out.println(e.getMessage());
}
I'm running the appliaction via spark-submit --master yarn --deploy-mode cluster --class sparkhbase.BulkLoadAsKeyValue3 --driver-cores 8 --driver-memory 11g --executor-cores 4 --executor-memory 9g /home/myuser/app.jar
When I look into the output directory of my HDFS, it is still empty! I'm using Spark 1.6.3 in a HDP 2.5 platform.
So I have two questions here: Where comes this behavior from (maybe memory problems)? And what is the difference between the yarn-client and yarn-cluster mode (I didn't understand it yet, also the documentation isn't clear to me)? Thanks for your help!
It seems that job doesn't start. Before start the job Spark check available resources. I think available resources are not enough. So try to reduce driver and executor memory, driver and executor cores in your configuration.
Here you can read how to calculate opportune value of resources for executors and driver: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Your job runs in client mode because in client mode drive can use all available resources on the node. But in cluster mode resources are limited.
Difference between cluster and client mode:
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.
I found out, that this problem is related to a Kerberos issue! When running the application in yarn-client mode from my Hadoop Namenode the driver is running on that node, where also my Kerberos server is running on. Therefore, the used userpricipal in file /etc/security/keytabs/user.keytab is present on this machine.
When running the app in yarn-cluster, the driver process is started randomly on one of my Hadoop nodes. As I forgot to copy the keyfiles to the other nodes after creating them, the driver processes of course coun't find the keytab file on that local location!
So, to be able to work with Spark in a Kerberized Hadoop Cluster (and even in yarn-cluster mode), you have to copy the needed keytab files of the user who runs the spark-submit command to the corresponding path on all nodes of the cluster!
scp /etc/security/keytabs/user.keytab user#workernode:/etc/security/keytabs/user.keytab
So you should be able to make a kinit -kt /etc/security/keytabs/user.keytab user on each node of the cluster.

terminating a spark step in aws

I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the cluster, because doing so would force me to buy a whole new hour of whatever cluster I'm running. Can anyone please help me terminate a spark-step in EMR without terminating the entire cluster?
That's easy:
yarn application -kill [application id]
you can list your running applications with
yarn application -list
You can kill application from the Resource manager (in the links at the top right under cluster status).
In the resource manager, click on the application you want to kill and in the application page there is a small "kill" label (top left) you can click to kill the application.
Obviously you can also SSH but this way I think is faster and easier for some users.

Understand Spark: Cluster Manager, Master and Driver nodes

Having read this question, I would like to ask additional questions:
The Cluster Manager is a long-running service, on which node it is running?
Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure?
1. The Cluster Manager is a long-running service, on which node it is running?
Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.
2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.
In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish.
If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN
3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.
4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?
Master node failures are handled in two ways.
Standby Masters with ZooKeeper:
Utilizing ZooKeeper to provide leader election and some state storage,
you can launch multiple Masters in your cluster connected to the same
ZooKeeper instance. One will be elected “leader” and the others will
remain in standby mode. If the current leader dies, another Master
will be elected, recover the old Master’s state, and then resume
scheduling. The entire recovery process (from the time the first
leader goes down) should take between 1 and 2 minutes. Note that this
delay only affects scheduling new applications – applications that
were already running during Master failover are unaffected. check here
for configurations
Single-Node Recovery with Local File System:
ZooKeeper is the best way to go for production-level high
availability, but if you want to be able to restart the Master if
it goes down, FILESYSTEM mode can take care of it. When applications
and Workers register, they have enough state written to the provided
directory so that they can be recovered upon a restart of the Master
process. check here for conf and more details
The Cluster Manager is a long-running service, on which node it is running?
A cluster manager is just a manager of resources, i.e. CPUs and RAM, that SchedulerBackends use to launch tasks.
A cluster manager does nothing more to Apache Spark, but offering resources, and once Spark executors launch, they directly communicate with the driver to run tasks.
You can start a standalone master server by executing:
./sbin/start-master.sh
Can be started anywhere.
To run an application on the Spark cluster
./bin/spark-shell --master spark://IP:PORT
Is it possible that the Master and the Driver nodes will be the same machine?
I presume that there should be a rule somewhere stating that these two nodes should be different?
In standalone mode, when you start your machine certain JVM will start.Your SparK Master will start up and on each machine Worker JVM will start and they will register with the Spark Master.
Both are the resource manager.When you start your application or submit your application in cluster mode a Driver will start up wherever you do ssh to start that application.
Driver JVM will contact to the SparK Master for executors(Ex) and in standalone mode Worker will start the Ex.
So Spark Master is per cluster and Driver JVM is per application.
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly?
i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If a Ex JVM will crashes the Worker JVM will start the Ex and when Worker JVM ill crashes Spark Master will start them.
And with a Spark standalone cluster with cluster deploy mode, you can also specify --supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code.Spark Master will start Driver JVM
Similarly to the previous question: In case where the Master node fails,
what will happen exactly and who is responsible of recovering from the failure?
failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.
Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions:
1.application won't be able to finish in elegant way.
2.if Spark Master is down Worker will try to reregisterWithMaster. If this fails multiple times workers will simply give up.
reregisterWithMaster()-- Re-register with the active master this worker has been communicating with. If there is none, then it means this worker is still bootstrapping and hasn't established a connection with a master yet, in which case we should re-register with all masters.
It is important to re-register only with the active master during failures.worker unconditionally attempts to re-register with all masters,
will may arise race condition.Error detailed in SPARK-4592:
At this moment long running applications won't be able to continue processing but it still shouldn't result in immediate failure.
Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing.

Submitting jobs to Spark EC2 cluster remotely

I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.
I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:
--deploy-mode=client:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
Results in the following indefinite warnings/errors:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.
I've tried to pass limits for cores and memory to submit script but it didn't help...
--deploy-mode=cluster:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
The result is:
.... Driver successfully submitted as driver-20150223023734-0007 ...
waiting before polling master for driver state ... polling master for
driver state State of driver-20150223023734-0007 is ERROR Exception
from cluster was: java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at
org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)
So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.
UPDATE:
So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.
The documentation at:
http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security
lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.
So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.
I advise against submitting spark jobs remotely using the port opening strategy, because it can create security problems and is in my experience, more trouble than it's worth, especially due to having to troubleshoot the communication layer.
Alternatives:
1) Livy - now an Apache project! http://livy.io or http://livy.incubator.apache.org/
2) Spark Job server - https://github.com/spark-jobserver/spark-jobserver

Stopping a Hadoop 2x container

Can someone tell how to kill a container? i see nodes are still running containers even after the application is finished and i want to know the command to kill them? Because of this issue, my subsequent applications stays in accepted state.
Thanks
Hadoop job -list
This gives you jobs that are running with JobID's
To kill job
Hadoop job –kill JobID
If yarn application is finished and some containers are still running, I'd say this is a bug somewhere. Is this a MR app? I don't think there's any commands to kill containers and anyway those should be handled by a nodemanager. Resource manager and Node manager should kill all containers when application is finished.
You didn't provide any info on what is this app, hadoop version, operating system, etc. Having said that, I once had a problem in my ubuntu hosts which had HADOOP-9752 bug which prevented nodemanager to kill a container.

Resources