Spark won't run final `saveAsNewAPIHadoopFile` method in yarn-cluster mode - hadoop

I wrote a Spark application, that reads some CSV files (~5-10 GB), transforms the data and converts the data into HFiles. The data is read from and saved into HDFS.
Everything seems to work fine when I run the application in the yarn-client mode.
But when I try to run it as yarn-cluster application, the process seems not to run the final saveAsNewAPIHadoopFile action on my transformed and ready-to-save RDD!
Here is a snapshot of my Spark UI, where you can see that all the other Jobs are processed:
And the corresponding Stages:
Here the last step of my application where the saveAsNewAPIHadoopFile method is called:
JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ...
try {
Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab");
Configuration baseConf = c.getConfiguration();
baseConf.set("hbase.zookeeper.quorum", HBASE_HOST);
baseConf.set("zookeeper.znode.parent", "/hbase-secure");
Job job = Job.getInstance(baseConf, "Test Bulk Load");
HTable table = new HTable(baseConf, "map_data");
HBaseAdmin admin = new HBaseAdmin(baseConf);
HFileOutputFormat2.configureIncrementalLoad(job, table);
Configuration conf = job.getConfiguration();
cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);
System.out.println("Finished!!!!!");
} catch (IOException e) {
e.printStackTrace();
System.out.println(e.getMessage());
}
I'm running the appliaction via spark-submit --master yarn --deploy-mode cluster --class sparkhbase.BulkLoadAsKeyValue3 --driver-cores 8 --driver-memory 11g --executor-cores 4 --executor-memory 9g /home/myuser/app.jar
When I look into the output directory of my HDFS, it is still empty! I'm using Spark 1.6.3 in a HDP 2.5 platform.
So I have two questions here: Where comes this behavior from (maybe memory problems)? And what is the difference between the yarn-client and yarn-cluster mode (I didn't understand it yet, also the documentation isn't clear to me)? Thanks for your help!

It seems that job doesn't start. Before start the job Spark check available resources. I think available resources are not enough. So try to reduce driver and executor memory, driver and executor cores in your configuration.
Here you can read how to calculate opportune value of resources for executors and driver: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Your job runs in client mode because in client mode drive can use all available resources on the node. But in cluster mode resources are limited.
Difference between cluster and client mode:
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

I found out, that this problem is related to a Kerberos issue! When running the application in yarn-client mode from my Hadoop Namenode the driver is running on that node, where also my Kerberos server is running on. Therefore, the used userpricipal in file /etc/security/keytabs/user.keytab is present on this machine.
When running the app in yarn-cluster, the driver process is started randomly on one of my Hadoop nodes. As I forgot to copy the keyfiles to the other nodes after creating them, the driver processes of course coun't find the keytab file on that local location!
So, to be able to work with Spark in a Kerberized Hadoop Cluster (and even in yarn-cluster mode), you have to copy the needed keytab files of the user who runs the spark-submit command to the corresponding path on all nodes of the cluster!
scp /etc/security/keytabs/user.keytab user#workernode:/etc/security/keytabs/user.keytab
So you should be able to make a kinit -kt /etc/security/keytabs/user.keytab user on each node of the cluster.

Related

Understanding of Hadoop parallel processing

I'm very new in Hadoop and recently configured Hadoop inside Virtual box with Ubuntu, here Name Node and Resource Manager are configured independent machines and also 3 seprate datanodes, and one client node.
After read some more articles, I got an understanding mapreduce jobs are running on multiple nodes in parallel,
For my understanding, I have written a Mapreduce program and access the hostName of the system as key in map function, this is for I want to understand the parallelism
I have loaded the data to Hdfs, 200MB of data with 64 MB block size, confirmed 3 datanodes having blocks
After exported the jar and run from client using yarn jar as well as hadoop jar, my expectation is it would be get the three datanodes name in side the reducer, but it shows Client system Name
Please can you explain how this execution ( Hadoop jar) works, Is it run my mapreduce jar in all three nodes, if then why it shows the client host name instead of three datanodes

Submit a Spark application that connects to a Cassandra database from IntelliJ IDEA

I found a similar question here: How to submit code to a remote Spark cluster from IntelliJ IDEA
I want to submit a Spark application to a cluster on which Spark and Cassandra are installed.
My Application is on a Windows OS. The application is written in IntelliJ using:
Maven
Scala
Spark
Below is a code snippet:
val spark = SparkSession
.builder().master("spark://...:7077") // the actual code contains the IP of the master node from the cluster
.appName("Cassandra App")
.config("spark.cassandra.connection.host", cassandraHost) // is the same as the IP of the master node from the cluster
.getOrCreate()
val sc = spark.sparkContext
val trainingdata = sc.cassandraTable("sparkdb", "trainingdata").map(a => a.get[String]("attributes"))
The Cluster contains two nodes on which Ubuntu is installed. Also, Cassandra and Spark are installed on each node.
When I use local[*] instead of spark://...:7077 everything works fine. However, when I use the version described in this post, I get the next error:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
On the cluster, the error is detailed further:
java.lang.ClassNotFoundException: MyApplication$$anonfun$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
Also, I want to note that the application written on Windows uses Spark as a Maven dependency.
I would like to know if it is possible to submit this spark application from the Windows node to the Ubuntu cluster and if it is not possible, what alternative should I use. If I have to create a jar from the Scala object, what approach should I use call the cluster from IntelliJ?
In order to launch your application it should persist on cluster in other words your packaged jar should reside or in HDFS or in every node of your cluster at same path. Then you can use ssh client or RESTfull interface or whatever enables triggering spark-submit command.

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Hadoop YARN clusters - Adding node at runtime

I am working on a solution providing run-time ressources addition to an Hadoop Yarn cluster. The purpose is to handle heavy peaks on our application.
I am not an expert and I need help in order to approve / contest what I understand.
Hadoop YARN
This application an run in a cluster-mode. It provides ressource management (CPU & RAM).
A spark a application, for example, ask for a job to be done. Yarn handles the request and provides an executor computing on the Yarn cluster.
HDFS - Data & Executors
The datas are not shared through executors, so they have to be stored in a file System. In my case : HDFS. That means I will have to run a copy of my spark streaming application in the new server (hadoop node).
I am not sure of this :
The yarn cluster and HDFS are different, writing on HDFS won't write on the new hadoop node local data (because it is not an HDFS node).
As I will only write on HDFS new data from a spark streaming application, creating a new application should not be a problem.
Submit the job to yarn
--- peak, resources needed
Instance new server
Install / configure Hadoop & YARN, making it a slave
Modifying hadoop/conf/slaves, adding it's ip adress (or dns name from host file)
Moddifying dfs.include and mapred.include
On host machine :
yarn -refreshNodes
bin/hadoop dfsadmin -refreshNodes
bin/hadoop mradmin -refreshNodes
Should this work ? refreshQueues sounds not really useful here as it seems to only take care of the process queue.
I am not sure if the running job will increase it's capacity. Another idea is to wait for the new ressources to be available and submit a new job.
Thanks for you help

Submitting jobs to Spark EC2 cluster remotely

I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.
I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:
--deploy-mode=client:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
Results in the following indefinite warnings/errors:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.
I've tried to pass limits for cores and memory to submit script but it didn't help...
--deploy-mode=cluster:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
The result is:
.... Driver successfully submitted as driver-20150223023734-0007 ...
waiting before polling master for driver state ... polling master for
driver state State of driver-20150223023734-0007 is ERROR Exception
from cluster was: java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at
org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)
So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.
UPDATE:
So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.
The documentation at:
http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security
lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.
So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.
I advise against submitting spark jobs remotely using the port opening strategy, because it can create security problems and is in my experience, more trouble than it's worth, especially due to having to troubleshoot the communication layer.
Alternatives:
1) Livy - now an Apache project! http://livy.io or http://livy.incubator.apache.org/
2) Spark Job server - https://github.com/spark-jobserver/spark-jobserver

Resources