Utilizing memory to its fullest in Spark - hadoop

I have running a Pyspark application and I am trying to persist dataframe as I am using the dataframe again in the code.
I am using the following:
sourceDF.persist(StorageLevel.MEMORY_AND_DISK_SER)
I am processing 30GB of data.
I have 3 nodes, all 16 GB RAM and 4 Virtual Cores.
From Spark UI, I see the Size in Memory after persistence is very less. I'd want it to store the cached data in RAM Memory as much as possible.
How can I best utilise RAM Memory?
Also, the GC time for the tasks seems quite high. How can I reduce it ?

You're already making the best use of memory by using dataframes and storing data with serialization. There's not much more you can do besides filtering out as much data as possible that isn't needed for the final result before caching.
Garbage collection is tricky. When working with Dataframe API and untyped transformations, catalyst is going to do its best to avoid unnecessary object creation. You really don't have much of a say when using dataframes and running into GC issues. Some operations are inherently more expensive as far as performance and object creation, but you can only control those using the typed dataset api and rdd api. You're best off doing what you're currently doing now. If GC is truly an issue, the best thing you can do is use a JVM profiling tool and find which pieces of code are creating the most objects and looking to optimize that. In addition, trying to minimize as much as possible data skew, and leveraging broadcast joins where possible should help avoid some GC.

Related

Spark as the storage layer for Mapreduce

I am facing a unique problem, and wanted your opinions here.
I have a legacy map-reduce application, where multiple map-reduce jobs run sequentially, the intermediate data is written back and forth to HDFS. Because of intermediate data written to HDFS, the jobs with small data lose more than gain from HDFS's features, and take considerably more time than what a non-Hadoop equivalent would have taken. Eventually I plan to convert all my map reduce jobs to Spark DAGs, however that's a big-bang change, so I am reasonably procrastinating.
What I really want as a short term solution is that, change the storage layer, so that I continue to benefit from map-reduce parallelism, but do not pay much penalty for storage layer. In that direction, I am thinking of using Spark as the storage layer, where map-reduce jobs will store their outputs in Spark through Spark Context, and the inputs will be read again (by creating Spark input split, each split will have it's own Spark RDD) from Spark Context.
In this way, I will be able to operate intermediate data read/write at memory speed, which will theoretically give me significant performance improvement.
My question is, does this architectural scheme make sense? Has anyone encountered situations like this? Am I missing something significant, which I should have considered even at this preliminary stage of the solution?
Thanks in advance!
does this architectural scheme make sense?
It doesn't. Spark has no standalone storage layer so there is nothing you can use here. If it wasn't enough at its core it is using standard Hadoop input formats for reading and writing data.
If you want to reduce overhead of a storage layer you should rather consider accelerated accelerated storage (like Alluxio) or memory grid (like Ignite Hadoop Accelerator).

Monitoring the Memory Usage of Spark Jobs

How can we get the overall memory used for a spark job. I am not able to get the exact parameter which we can refer to retrieve the same. Have referred to Spark UI but not sure of the field which we can refer. Also in Ganglia we have the following options:
a)Memory Buffer
b)Cache Memory
c)Free Memory
d)Shared Memory
e)Free Swap Space
Not able to get any option related to Memory Used. Does anyone have some idea regarding this.
If you persist your RDDs you can see how big they are in memory via the UI.
It's hard to get an idea of how much memory is being used for intermediate tasks (e.g. for shuffles). Basically Spark will use as much memory as it needs given what's available. This means that if your RDDs take up more than 50% of your available resources, your application might slow down because there are fewer resources available for execution.

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.
The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.
http://spark.apache.org/docs/latest/hardware-provisioning.html
Local Disks
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages.
For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.
Note:
I am looking for a solution which doesn't include the below items
Increase the RAM
Sample & reduce data size
Use cloud or cluster computers
My end objective is to use Spark MLLIB to build machine learning models.
I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?
Questions
SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?
Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

Mongodb - make inmemory or use cache

I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.

Is Hadoop a good candidate for use as a key-value store?

Question
Would Hadoop be a good candidate for the following use case:
Simple key-value store (primarily needs to GET and SET by key)
Very small "rows" (32-byte key-value pairs)
Heavy deletes
Heavy writes
On the order of a 100 million to 1 billion key-value pairs
Majority of data can be contained on SSDs (solid state drives) instead of in RAM.
More info
The reason I ask is because I keep seeing references to the Hadoop file system and how Hadoop is used as the foundation for a lot of other database implementations that aren't necessarily designed for Map-Reduce.
Currently, we are storing this data in Redis. Redis performs great, but since it contains all of its data within RAM, we have to use expensive machines with upwards of 128gb RAM. It would be nice to instead use a system that relies on SSDs. This way we would have the freedom to build much bigger hash tables.
We have also stored this data using Cassandra, but Cassandra tends to "break" if the deletes become too heavy.
Hadoop (unlike popular media opinions) is not a database. What you describe is a database. Thus Hadoop is not a good candidate for you. Also the below post is opinionated, so feel free to prove me wrong with benchmarks.
If you care about "NoSql DB's" that are on top of Hadoop:
HBase would be suited for heavy writes, but sucks on huge deletes
Cassandra same story, but writes are not as fast as in HBase
Accumulo might be useful for very frequent updates, but will suck on deletes as well
None of them make "real" use of SSDs, I think that all of them do not get a huge speedup by them.
All of them suffer from the costly compactions if you start to fragment your tablets (in BigTable speech), thus deleting is a fairly obvious limiting factor.
What you can do to mitigate the deletion issues is to just overwrite with a constant "deleted" value, which work-arounds the compaction. However, grows your table which can be costly on SSDs as well. Also you will need to filter, which likely affects the read latency.
From what you describe, Amazon's DynamoDB architecture sounds like the best candidate here. Although deletes here are also costly- maybe not as much as the above alternatives.
BTW: the recommended way of deleting lots of rows from the tables in any of the above databases is to just completely delete the table. If you can fit your design into this paradigm, any of those will do.
Although this isnt an answer to you question, but in context with what you say about
It would be nice to instead use a system that relies on SSDs. This way
we would have the freedom to build much bigger hash tables.
you might consider taking a look at Project Voldemort.
Specifically being a Cassandra user I know when you say Its the compaction and the tombstones that are a problem. I have myself ran into TombstoneOverwhelmingException couple of times and hit dead ends.
You might want to have a look at this article by Linked In
It says:
Memcached is all in memory so you need to squeeze all your data into
memory to be able to serve it (which can be an expensive proposition
if the generated data set is large).
And finally
all we do is just mmap the entire data set into the process address
space and access it there. This provides the lowest overhead caching
possible, and makes use of the very efficient lookup structures in the
operating system.
I dont know if this fits your case. But you can consider evaluating Voldemort once! Best of luck.

Resources