I have been using Spark and I am curious about how exactly RDDs work. I understand that an RDD is a pointer to the data. If I am trying to create an RDD for a HDFS file, I understand that the RDD will be a pointer to the actual data on the HDFS file.
What I do not understand is where the data gets stored in-memory. When a task is sent to the worker node, does the data for a specific partition get stored in-memory on that worker node? If so, what happens when an RDD partition is stored in memory on worker node1, but worker node2 has to compute a task for the same partition of the RDD? Does worker node2 communicate with worker node1 to get the data for the partition and store it in its own memory?
In principle, tasks are divided across executors, each one representing its own separate chunk of data (for instance, from HDFS files or folders). The data for a task is loaded in local memory for that executor. Multiple transformations can be chained on the same task.
If, however, a transformation needs to pull data from more than one executor, a new set of tasks will be created, and the results from the previous tasks will be shuffled and re-distributed across executors. For instance, many of the *byKey transformation will shuffle the entire data around, through HDFS, so that executors can perform the second set of tasks. The number of times and valume of shuffled data is critical in Spark's performance.
Related
We all know Spark does the computation in memory. I am just curious on followings.
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDDs data will reside on Spark Workers Memory?
If I do not delete RDD, will it be in memory forever?
If my dataset(file) size exceeds available RAM size, where will data to stored?
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD
data will reside on Spark Memory?
Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.
If I do not delete RDD, will it be in memory forever?
Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist() or sqlContext.uncacheTable("sparktable") to remove the df or tables from memory.
link to read more
If my dataset size exceeds available RAM size, where will data to
stored?
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed.
link to read more
If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment
To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.
In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.
If I create 10 RDD in my Pyspark shell, does it mean all these 10 RDD
data will reside on Spark Memory?
Answer: RDD only contains the "lineage graph" (the applied transformations). So, RDD is not data!!! When ever we perform any action on an RDD, all the transformations are applied before the action. So if not explicitly (of course there are some optimisations which cache implicitly) cached, each time an action is performed the whole transformation and action are performed again!!!
E.g - If you create an RDD from HDFS, apply some transformations and perform 2 actions on the transformed RDD, HDFS read and transformations will be executed twice!!!
So, if you want to avoid the re-computation, you have to persist the RDD. For persisting you have the choice of a combination of one or more on HEAP, Off-Heap, Disk.
If I do not delete RDD, will it be in memory for ever?
Answer: Considering RDD is just "lineage graph", it will follow the same scope and lifetime rule of the hosting language. But if you have already persisted the computed result, you could unpersist!!!
If my dataset size exceed available RAM size, where will data to stored?
Answer: Assuming you have actually persisted/cached the RDD in memory, it will be stored in memory. And LRU is used to evict data. Refer for more information on how memory management is done in spark.
I have a Job that need to access parquet files on HDFS and I would like to minimise the network activity. So far I have HDFS Datanodes and Spark Workers started on the same nodes, but when I launch my job the data locality is always at ANY where it should be NODE_LOCAL since the data is distributed among all the nodes.
Is there any option I should configure to tell Spark to start the tasks where the data is ?
The property you are looking for is spark.locality.wait. If you increase its value it will execute jobs more locally, as spark wont send the data to other workers just because the one is busy on which the data resides. Although, setting the value to high might result in longer execution times cause you do not utilise workers efficiently.
Also have a look here:
http://spark.apache.org/docs/latest/configuration.html
I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).
Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.
When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?
When FileSystem.open(path) gets executed in Spark tasks, File
content will be loaded to local variable in same JVM process and prepares
the RDD ( partition(s) ). so the data locality for that RDD is always
PROCESS_LOCAL
-- vanekjar has
already commented the on question
Additional information about data locality in Spark:
There are several levels of locality based on the data’s current location. In order from closest to farthest:
PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.
Data locality is one of the spark's functionality which increases its processing speed.Data locality section can be seen here in spark tuning guide to Data Locality.At start when you write sc.textFile("path") at this point the data locality level will be according to the path you specified but after that spark tries to make locality level to process_local in order to optimize speed of processing by starting process at the place where data is present(locally).
I understand that Resource Manager sends MapReduce Program to each Node Manager so that MapReduce gets executed in each Node.
But After seeing this image , I am getting confused on where actual Map & Reduce jobs executed and how shuffling is happening between Data Nodes ?
Is it not time taking process to sort and suffle/send data accross difference Data Node to perform Reduce Job ? Please explain me.
Also let me know what is Map Node and Reduce Node in this diagram.
Image Src: http://gppd-wiki.inf.ufrgs.br/index.php/MapReduce
The input split is a logical chunk of the file stored on hdfs , by default an input split represents a block of a file where the blocks of the file might be stored on many data nodes in the cluster.
A container is a task execution template allocated by the Resource Manager on any of the data node in order to execute the Map/Reduce tasks.
First the Map tasks gets executed by the containers on data node where the container was allocated by the Resource Manager as near as possible to the Input Split's location by adhering to the Rack Awareness Policy (Local/Rack Local/DC Local).
The Reduce tasks will be executed by any random containers on any data nodes, and the reducers copies its relevant the data from every mappers by the Shuffle/Sort process.
The mappers prepares the results in such a way the results are internally partitioned and within each partition the records are sorted by the key and the partitioner determines which reducer should fetch the partitioned data.
By Shuffle and Sort, the Reducers copies their relevant partitions from every mappers output through http, eventually every reducer Merge&Sort the copied partitions and prepares the final single Sorted file before the reduce() method invoked.
The below image may give more clarifiations.
[Imagesrc:http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/]
I am new to hadoop and I have following questions on the same.
This is what I have understood in hadoop.
1) When ever any file is written in hadoop it is stored across all the data nodes in chunks (64MB default)
2) When we run the MR job, a split will be created from this block and on each data node the split will be processed.
3) From each split record reader will be used to generate key/value pair at mapper side.
Questions :
1) Can one data node process more than one split at a time ? What if data node capacity is more?
I think this was limitation in MR1, and with MR2 YARN we have better resource utilization.
2) Will a split be read in serial fashion at data node or can it be processed in parallel to generate key/value pair? [ By randomly accessing disk location in data node split]
3) What is 'slot' terminology in map/reduce architecture? I was reading through one of the blogs and it says YARN will provide better slot utilization in Datanode.
Let me first address the what I have understood in hadoop part.
A file stored on Hadoop file system is NOT stored across all data nodes. Yes, it is split into chunks (default is 64MB), but the number of DataNodes on which these chunks are stored depends on a.File Size b.Current Load on Data Nodes c.Replication Factor and d.Physical Proximity. The NameNode takes these factors into account when deciding which dataNodes will store the chunks of a file.
Again each Data Node MAY NOT Process a split. Firstly, DataNodes are only responsible for managing the storage of data, not executing jobs/tasks. The TaskTracker is the slave node responsible for executing tasks on individual nodes. Secondly, only those nodes which contain the data required for that particular Job will process the splits, unless the load on these nodes is too high, in which case the data in the split is copied to another node and processed there.
Now coming to the questions,
Again, dataNodes are not responsible for processing jobs/tasks. We usually refer to a combination of dataNode + taskTracker as a node since they are commonly found on the same node, handling different responsibilities (data storage & running tasks). A given node can process more than one split at a time. Usually a single split is assigned to a single Map task. This translates to multiple Map tasks running on a single node, which is possible.
Data from input file is read in serial fashion.
A node's processing capacity is defined by the number of Slots. If a node has 10 slots, it means it can process 10 tasks in parallel (these tasks may be Map/Reduce tasks). The cluster administrator usually configures the number of slots per each node considering the physical configuration of that node, such as memory, physical storage, number of processor cores, etc.