Monitoring the Memory Usage of Spark Jobs - memory-management

How can we get the overall memory used for a spark job. I am not able to get the exact parameter which we can refer to retrieve the same. Have referred to Spark UI but not sure of the field which we can refer. Also in Ganglia we have the following options:
a)Memory Buffer
b)Cache Memory
c)Free Memory
d)Shared Memory
e)Free Swap Space
Not able to get any option related to Memory Used. Does anyone have some idea regarding this.

If you persist your RDDs you can see how big they are in memory via the UI.
It's hard to get an idea of how much memory is being used for intermediate tasks (e.g. for shuffles). Basically Spark will use as much memory as it needs given what's available. This means that if your RDDs take up more than 50% of your available resources, your application might slow down because there are fewer resources available for execution.

Related

Utilizing memory to its fullest in Spark

I have running a Pyspark application and I am trying to persist dataframe as I am using the dataframe again in the code.
I am using the following:
sourceDF.persist(StorageLevel.MEMORY_AND_DISK_SER)
I am processing 30GB of data.
I have 3 nodes, all 16 GB RAM and 4 Virtual Cores.
From Spark UI, I see the Size in Memory after persistence is very less. I'd want it to store the cached data in RAM Memory as much as possible.
How can I best utilise RAM Memory?
Also, the GC time for the tasks seems quite high. How can I reduce it ?
You're already making the best use of memory by using dataframes and storing data with serialization. There's not much more you can do besides filtering out as much data as possible that isn't needed for the final result before caching.
Garbage collection is tricky. When working with Dataframe API and untyped transformations, catalyst is going to do its best to avoid unnecessary object creation. You really don't have much of a say when using dataframes and running into GC issues. Some operations are inherently more expensive as far as performance and object creation, but you can only control those using the typed dataset api and rdd api. You're best off doing what you're currently doing now. If GC is truly an issue, the best thing you can do is use a JVM profiling tool and find which pieces of code are creating the most objects and looking to optimize that. In addition, trying to minimize as much as possible data skew, and leveraging broadcast joins where possible should help avoid some GC.

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.
The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.
http://spark.apache.org/docs/latest/hardware-provisioning.html
Local Disks
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages.
For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.
Note:
I am looking for a solution which doesn't include the below items
Increase the RAM
Sample & reduce data size
Use cloud or cluster computers
My end objective is to use Spark MLLIB to build machine learning models.
I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?
Questions
SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?
Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

Running parallel queries in Spark

How does spark handle concurrent queries? I have read a bit about spark and underlying RDD's but I am unable to understand how concurrent queries would be handled?
For example if I run a query which loads the data in memory and the entire available memory is consumed and at the same time someone else runs a query involving another set of data, how would spark allocate the memory to both the queries? Also what would be the impact if the priorities are taken into account.
Also can running lots of parallel queries would result in the machines hanging ?
Firstly Spark doesn't take the in-memory (RAM) more than threshold limit.
Spark tries to allocate the default in-memory to every job.
If there is insufficient memory for a new job then it tries to spill the in-memory content of LeastRecentlyUsed (LRU) RDD to disk and then allocates to new job.
Optionally you can also specify the storage of RDD like IN-MEMORY only, DISK only, MEMORY AND DISK etc..
Scenario: consider a low in-memory machine with huge no of jobs, then most of the RDDs will be placed in disk only, as per the above approach.
So, the jobs will continue to run but it will not take the advantage of Spark in-memory processing.
Spark does the memory allocation very intelligently.
If Spark used on top-of YARN then Resource manager also takes place in the resource allocation.

Caching in RAM using HDFS

I need to process some big files (~2 TBs) with a small cluster (~10 servers), in order to produce a relatively small report (some GBs).
I only care about the final report, not intermediate results, and the machines have a great amount of RAM, so it would be fantastic to use it to reduce as much as possible disk access (and consequently increasing speed), ideally by storing the data blocks in volatile memory using the disk only when.
Looking at the configuration files and a previous question it seems Hadoop doesn't offer this function. Spark website talks about a memory_and_disk option, but I'd prefer to ask the company to deploy a new software based on a new language.
The only "solution" I found is to set dfs.datanode.data.dir as /dev/shm/ in hdfs-default.xml, to trick it to use volatile memory instead of the filesystem to store data, still in this case it would behave badly, I assume, when the RAM gets full and it uses the swap.
Is there a trick to make Hadoop store datablocks as much as possible on RAM and write on disk only when necessary?
Since the release of Hadoop 2.3 you can use HDFS in memory caching.
You can toy around with mapred.job.reduce.input.buffer.percent (defaults to 0, try something closer to 1.0, see for example this blog post) and also setting the value of mapred.inmem.merge.threshold to 0. Note that finding the right values is a bit of an art and requires some experimentation.

Hadoop single node configuration on the high memory machine

I have a single node instance of Apache Hadoop 1.1.1 with default parameter values (see e.g. [1] and [2]) on the machine with a lot of RAM and very limited free disk space size. Then, I notice that this Hadoop instance wastes a lot of disk space during map tasks. What configuration parameters should I pay attention to in order to take advantage of high RAM capacity and decrease disk space usage?
You can use several of the mapred.* params to compress map output, which will greatly reduce the amount of disk space needed to store mapper output. See this question for some good pointers.
Note that different compression codecs will have different issues (i.e. GZip needs more CPU than LZO, but you have to install LZO yourself). This page has a good discussion of compression issues in Hadoop, although it is a bit dated.
The amount of RAM you need depends upon what you are doing in your map-reduce jobs, although you can increase your heap-size in:
conf/mapred-site.xml mapred.map.child.java.opts
See cluster setup for more details on this.
You can use dfs.datanode.du.reserved in hdfs-site.xml to specify an amount of disk space you won't use. I don't know whether hadoop is able to compensate with higher memory usage.
You'll have a problem, though, if you run a mapreduce job that's disk i/o intensive. I don't think any amount of configuring will help you then.

Resources