Data locality with Spark standalone and HDFS - hadoop

I have a Job that need to access parquet files on HDFS and I would like to minimise the network activity. So far I have HDFS Datanodes and Spark Workers started on the same nodes, but when I launch my job the data locality is always at ANY where it should be NODE_LOCAL since the data is distributed among all the nodes.
Is there any option I should configure to tell Spark to start the tasks where the data is ?

The property you are looking for is spark.locality.wait. If you increase its value it will execute jobs more locally, as spark wont send the data to other workers just because the one is busy on which the data resides. Although, setting the value to high might result in longer execution times cause you do not utilise workers efficiently.
Also have a look here:
http://spark.apache.org/docs/latest/configuration.html

Related

Does Spark schedule workers on the same nodes where the data resides?

The Google MapReduce paper said that workers were scheduled on the same node as the data resided, or at least on the same rack if that was possible. I haven't read through the entire Hadoop documentation, but I assume that it moves the computation to the data if possible, rather than the data to the computation.
(When I first I learned about Hadoop, all data from HDFS to the workers had to go through a TCP connection, even when the worker was on the same node as the data. Is this still the case?)
In any event, with Apache Spark, do workers get scheduled on the same nodes as the data, or does the RDD concept make it harder to do that?
Generally speaking it depends. Spark recognizes multiple levels of locality (including PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL) and tries to schedule tasks to achieve the best locality level. See Data Locality in Tuning Spark
Exact behavior can be controlled using spark.locality.* properties. It includes amount of time scheduler waits for free resources before choosing a node with a lower locality. See Scheduling in Spark Configuration.

Spark running on YARN - What does a real life example's workflow look like?

I have been reading up on Hadoop, YARN and SPARK. What makes sense to me thus far is what I have summarized below.
Hadoop MapReduce: Client choses an input file and hands if off to
Hadoop (or YARN). Hadoop takes care of splitting the flie based on
user's InputFormat and stores it on as many nodes that are available
and configured Client submits a job (map-reduce) to YARN, which
copeies the jar to available Data Nodes and executes the job. YARN is
the orchestrator that takes care of all the scheduling and running of
the actual tasks
Spark: Given a job, input and a bunch of configuration parameters, it
can run your job, which could be a series of transformations and
provide you the output.
I also understand MapReduce is a batch based processing paradigm and
SPARK is more suited for micro batch or stream based data.
There are a lot of articles that talks about how Spark can run on YARN and how they are complimentary, but none have managed to help me understand how those two come together during an acutal workflow. For example when a client has a job to submit, read a huge file and do a bunch of transformations what does the workflow look like when using Spark on YARN. Let us assume that the client's input file is a 100GB text file. Please include as much details as possible
Any help with this would be greatly appreciated
Thanks
Kay
Let's assume the large file is stored in HDFS. In HDFS the file is divided into blocks of some size (default 128 MB).
That means your 100GB file will be divided into 800 blocks. Each block will be replicated and can be stored on different node in the cluster.
When reading the file with Hadoop InputFormat list of splits with location is obtained first. Then there is created one task per each splits. That you will get 800 parallel tasks that are executed by runtime.
Basically the input process is the same for MapReduce and Spark, because both of the use Hadoop Input Formats.
Both of them will process each InputSplit in separate task. The main difference is that Spark has more rich set of transformations and can optimize the workflow if there is a chain of transformations that can be applied at once. As opposed to MapReduce where is always map and reduce phase only.
YARN stands for "Yet another resource negotiator". When a new job with some resource requirement (memory, processors) is submitted it is the responsibility of YARN to check if the needed resources are available on the cluster. If other jobs are running on the cluster are taking up too much of the resources then the new job will be made to wait till the prevoius jobs complete and resources are available.
YARN will allocate enough containers in the cluster for the workers and also one for the Spark driver. In each of these containers JVM is started with given resources. Each Spark worker can process multiple tasks in parallel (depends on the configured number of cores per executor).
e.g.
If you set 8 cores per Spark executor, YARN tries to allocated 101 containers in the cluster tu run 100 Spark workers + 1 Spark master (driver). Each of the workers will process 8 tasks in parallel (because of 8 cores).

spark + hadoop data locality

I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).
Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.
When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?
When FileSystem.open(path) gets executed in Spark tasks, File
content will be loaded to local variable in same JVM process and prepares
the RDD ( partition(s) ). so the data locality for that RDD is always
PROCESS_LOCAL
-- vanekjar has
already commented the on question
Additional information about data locality in Spark:
There are several levels of locality based on the data’s current location. In order from closest to farthest:
PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.
Data locality is one of the spark's functionality which increases its processing speed.Data locality section can be seen here in spark tuning guide to Data Locality.At start when you write sc.textFile("path") at this point the data locality level will be according to the path you specified but after that spark tries to make locality level to process_local in order to optimize speed of processing by starting process at the place where data is present(locally).

How to fully utilize all Spark nodes in cluster?

I have launched a 10 node cluster with the ec2-script in standalone mode for Spark. I am accessing data in s3 buckets from within the PySpark shell but when I perform transormations on the RDD, only one node is ever used. For example the below will read in data from the CommonCorpus:
bucket = ("s3n://#aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/"
"/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10"
"-180-212-248.ec2.internal.warc.gz")
data = sc.textFile(bucket)
data.count()
When I run this, only one of my 10 slaves processes the data. I know this because only one slave (213) has any logs of the activity when viewed from the Spark web console. When I view the the activity in Ganglia, this same node (213) is the only slave with a spike in mem usage when the activity was run.
Furthermore I have the exact same performance when I run the same script with an ec2 cluster of only one slave. I am using Spark 1.1.0 and any help or advice is greatly appreciated.
...ec2.internal.warc.gz
I think you've hit a fairly typical problem with gzipped files in that they cannot be loaded in parallel. More specifically, a single gzipped file cannot be loaded in parallel by multiple tasks, so Spark will load it with 1 task and thus give you an RDD with 1 partition.
(Note, however, that Spark can load 10 gzipped files in parallel just fine; it's just that each of those 10 files can only be loaded by 1 task. You can still get parallelism across files, just not within a file.)
You can confirm that you only have 1 partition by checking the number of partitions in your RDD explicitly:
data.getNumPartitions()
The upper bound on the number of tasks that can run in parallel on an RDD is the number of partitions in the RDD or the number of slave cores in your cluster, whichever is lower.
In your case, it's the number of RDD partitions. You can increase that by repartitioning your RDD as follows:
data = sc.textFile(bucket).repartition(sc.defaultParallelism * 3)
Why sc.defaultParallelism * 3?
The Spark Tuning guide recommends having 2-3 tasks per core, and sc.defaultParalellism gives you the number of cores in your cluster.

Is there a way to have a secondary storage or backup for data blocks in Hadoop?

I have Hadoop running on a cluster that has non-dedicated nodes (i.e. it shares nodes with other applications/users). When the other users are using a cluster's node, it is not allowed to run Hadoop jobs in that node. Thus, it is possible that only a few nodes are available in a given moment, and that this few nodes do not have all data blocks (replicas) need by the Hadoop job.
I also have a big Network-Attached Storage that is used for backup. So, I am wondering if there is a way to use it as a secondary storage for Hadoop. For example, if some data block is missing in the cluster, Hadoop would get the block from the secondary/backup storage.
Any ideas?
Thanks in advance!
I am not aware about such a "mixed" storage mode for the hadoop. So I do not think that your scenario is directly supported by hadoop.
For me it looks like you need more "elastic" solution. If EMR would be available open source - it might be good choice - where NAS would play the role of S3.
I would suggest the following solution in Your case:
Install and run data nodes on all available servers. They are not as resource hungy as task trackers - since they are only sequentially read/write data.
Install task trackers on all machines also, but run only on these which are not used now. Hadoop is smart enough to preserve data locality when possible. In the same time hadoop will takes change in number of task trackers much easier then disappearing data nodes.
Alternatively you can build cluster of task trackers only, not use HDFS and run jobs against the NAS.
In all cases the main interference with other users I still expect is network congestions - during shuffle stage hadoop is usually saturating the network.

Resources