For example, I cached a number of RDDs in memory.
Then I leave the application for a few days or more.
And then I try to access cached RDDs.
Will they still be in memory?
Or Spark will clean unused cached RDDs after some period of time.
Please help!
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
Related
I am trying to put data in Ignite Cache and it was taking too much time and making my process much slow.
I have an Ignite Cache of type IgniteCache<Integer,List<Integer>> and adding data to cache taking too much time. I am using 4 nodes in cluster.
How to reduce its time and make the processing fast?
We all know Spark does the computation in memory. I am just curious on followings.
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDDs data will reside on Spark Workers Memory?
If I do not delete RDD, will it be in memory forever?
If my dataset(file) size exceeds available RAM size, where will data to stored?
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD
data will reside on Spark Memory?
Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.
If I do not delete RDD, will it be in memory forever?
Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist() or sqlContext.uncacheTable("sparktable") to remove the df or tables from memory.
link to read more
If my dataset size exceeds available RAM size, where will data to
stored?
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed.
link to read more
If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment
To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.
In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.
If I create 10 RDD in my Pyspark shell, does it mean all these 10 RDD
data will reside on Spark Memory?
Answer: RDD only contains the "lineage graph" (the applied transformations). So, RDD is not data!!! When ever we perform any action on an RDD, all the transformations are applied before the action. So if not explicitly (of course there are some optimisations which cache implicitly) cached, each time an action is performed the whole transformation and action are performed again!!!
E.g - If you create an RDD from HDFS, apply some transformations and perform 2 actions on the transformed RDD, HDFS read and transformations will be executed twice!!!
So, if you want to avoid the re-computation, you have to persist the RDD. For persisting you have the choice of a combination of one or more on HEAP, Off-Heap, Disk.
If I do not delete RDD, will it be in memory for ever?
Answer: Considering RDD is just "lineage graph", it will follow the same scope and lifetime rule of the hosting language. But if you have already persisted the computed result, you could unpersist!!!
If my dataset size exceed available RAM size, where will data to stored?
Answer: Assuming you have actually persisted/cached the RDD in memory, it will be stored in memory. And LRU is used to evict data. Refer for more information on how memory management is done in spark.
My question is about the concept of distributed cache specifically for Hadoop and whether it should be called distributed Cache. A conventional definition of distributed cache is "A distributed cache spans multiple servers so that it can grow in size and in transactional capacity".
This is not true in hadoop as Distributed cache is distributed to all the nodes which runs the tasks i.e. the same file mentioned in the driver code.
Shouldn't this be called a replicative cache. The intersection of cache on all nodes should be null (or close to it) if we go by the conventional distributed cache definition. But for hadoop the result of intersection is the same file which is present in all nodes.
Is my understanding correct or am i missing something? Please guide.
Thanks
The general understanding and concept of any Cache is to make data available in memory and avoid hitting disk for reading the data. Because reading the data from disk is a costlier operation than reading from memory.
Now lets take the same analogy to Hadoop ecosystem. Here disk is your HDFS and memory is local file system where are the actual tasks run. During the life cycle of an application, there may be multiple tasks are running on the same node. So when the first task is launched in the node, it will fetch the data from HDFS and put it in the local system. Now the subsequent tasks on the same node will not fetch the same data again. That way it will save the cost of getting data from HDFS vs getting it from local file system. The is the concept of Distributed Cache in MapReduce framework.
The size of the data is usually small enough that it can be loaded in the Mapper memory, usually in few MBs.
I too agree that it's not really "Distributed cache". But I am convinced with YoungHobbit comments on efficiency of not hitting disk for IO operations.
The only merit I have seen in this mechanism as per Apache documentation:
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
Please note that DistributedCache has been deprecated since 2.6.0 release. You have to use new APIs in Job class to achieve the same functionality.
I have the following Spark job, some RDD have RDD fraction cached more than 100%. How can this be possible? What did I miss? Thanks!
I believe this is because you can have the same partition cached in multiple locations. See SPARK-4049 for more details.
EDIT:
I'm wondering if maybe you have speculative execution (see spark.speculation) set? If you have straggling tasks, they will be relaunched which I believe will duplicate a partition. Also, another useful thing to do might be to call rdd.toDebugString which will provide lots of info on an RDD including transformation history and number of cached partitions.
How does spark handle concurrent queries? I have read a bit about spark and underlying RDD's but I am unable to understand how concurrent queries would be handled?
For example if I run a query which loads the data in memory and the entire available memory is consumed and at the same time someone else runs a query involving another set of data, how would spark allocate the memory to both the queries? Also what would be the impact if the priorities are taken into account.
Also can running lots of parallel queries would result in the machines hanging ?
Firstly Spark doesn't take the in-memory (RAM) more than threshold limit.
Spark tries to allocate the default in-memory to every job.
If there is insufficient memory for a new job then it tries to spill the in-memory content of LeastRecentlyUsed (LRU) RDD to disk and then allocates to new job.
Optionally you can also specify the storage of RDD like IN-MEMORY only, DISK only, MEMORY AND DISK etc..
Scenario: consider a low in-memory machine with huge no of jobs, then most of the RDDs will be placed in disk only, as per the above approach.
So, the jobs will continue to run but it will not take the advantage of Spark in-memory processing.
Spark does the memory allocation very intelligently.
If Spark used on top-of YARN then Resource manager also takes place in the resource allocation.