Caching in RAM using HDFS - caching

I need to process some big files (~2 TBs) with a small cluster (~10 servers), in order to produce a relatively small report (some GBs).
I only care about the final report, not intermediate results, and the machines have a great amount of RAM, so it would be fantastic to use it to reduce as much as possible disk access (and consequently increasing speed), ideally by storing the data blocks in volatile memory using the disk only when.
Looking at the configuration files and a previous question it seems Hadoop doesn't offer this function. Spark website talks about a memory_and_disk option, but I'd prefer to ask the company to deploy a new software based on a new language.
The only "solution" I found is to set dfs.datanode.data.dir as /dev/shm/ in hdfs-default.xml, to trick it to use volatile memory instead of the filesystem to store data, still in this case it would behave badly, I assume, when the RAM gets full and it uses the swap.
Is there a trick to make Hadoop store datablocks as much as possible on RAM and write on disk only when necessary?

Since the release of Hadoop 2.3 you can use HDFS in memory caching.

You can toy around with mapred.job.reduce.input.buffer.percent (defaults to 0, try something closer to 1.0, see for example this blog post) and also setting the value of mapred.inmem.merge.threshold to 0. Note that finding the right values is a bit of an art and requires some experimentation.

Related

Monitoring the Memory Usage of Spark Jobs

How can we get the overall memory used for a spark job. I am not able to get the exact parameter which we can refer to retrieve the same. Have referred to Spark UI but not sure of the field which we can refer. Also in Ganglia we have the following options:
a)Memory Buffer
b)Cache Memory
c)Free Memory
d)Shared Memory
e)Free Swap Space
Not able to get any option related to Memory Used. Does anyone have some idea regarding this.
If you persist your RDDs you can see how big they are in memory via the UI.
It's hard to get an idea of how much memory is being used for intermediate tasks (e.g. for shuffles). Basically Spark will use as much memory as it needs given what's available. This means that if your RDDs take up more than 50% of your available resources, your application might slow down because there are fewer resources available for execution.

How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?

I have 50 GB dataset which doesn't fit in 8 GB RAM of my work computer but it has 1 TB local hard disk.
The below link from offical documentation mentions that Spark can use local hard disk if data doesnt fit in the memory.
http://spark.apache.org/docs/latest/hardware-provisioning.html
Local Disks
While Spark can perform a lot of its computation in memory, it still
uses local disks to store data that doesn’t fit in RAM, as well as to
preserve intermediate output between stages.
For me computational time is not at all a priority but fitting the data into a single computer's RAM/hard disk for processing is more important due to lack of alternate options.
Note:
I am looking for a solution which doesn't include the below items
Increase the RAM
Sample & reduce data size
Use cloud or cluster computers
My end objective is to use Spark MLLIB to build machine learning models.
I am looking for real-life, practical solutions that people successfully used Spark to operate on data that doesn't fit in RAM in standalone/local mode in a single computer. Have someone done this successfully without major limitations?
Questions
SAS have similar capability of out-of-core processing using which it can use both RAM & local hard disk for model building etc. Can Spark be made to work in the same way when data is more than RAM size?
SAS writes persistent the complete dataset to hardisk in ".sas7bdat" format can Spark do similar persistent to hard disk?
If this is possible, how to install and configure Spark for this purpose?
Look at http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
You can use various persistence models as per your need. MEMORY_AND_DISK is what will solve your problem . If you want a better performance, use MEMORY_AND_DISK_SER which stores data in serialized fashion.

Is Hadoop a good candidate for use as a key-value store?

Question
Would Hadoop be a good candidate for the following use case:
Simple key-value store (primarily needs to GET and SET by key)
Very small "rows" (32-byte key-value pairs)
Heavy deletes
Heavy writes
On the order of a 100 million to 1 billion key-value pairs
Majority of data can be contained on SSDs (solid state drives) instead of in RAM.
More info
The reason I ask is because I keep seeing references to the Hadoop file system and how Hadoop is used as the foundation for a lot of other database implementations that aren't necessarily designed for Map-Reduce.
Currently, we are storing this data in Redis. Redis performs great, but since it contains all of its data within RAM, we have to use expensive machines with upwards of 128gb RAM. It would be nice to instead use a system that relies on SSDs. This way we would have the freedom to build much bigger hash tables.
We have also stored this data using Cassandra, but Cassandra tends to "break" if the deletes become too heavy.
Hadoop (unlike popular media opinions) is not a database. What you describe is a database. Thus Hadoop is not a good candidate for you. Also the below post is opinionated, so feel free to prove me wrong with benchmarks.
If you care about "NoSql DB's" that are on top of Hadoop:
HBase would be suited for heavy writes, but sucks on huge deletes
Cassandra same story, but writes are not as fast as in HBase
Accumulo might be useful for very frequent updates, but will suck on deletes as well
None of them make "real" use of SSDs, I think that all of them do not get a huge speedup by them.
All of them suffer from the costly compactions if you start to fragment your tablets (in BigTable speech), thus deleting is a fairly obvious limiting factor.
What you can do to mitigate the deletion issues is to just overwrite with a constant "deleted" value, which work-arounds the compaction. However, grows your table which can be costly on SSDs as well. Also you will need to filter, which likely affects the read latency.
From what you describe, Amazon's DynamoDB architecture sounds like the best candidate here. Although deletes here are also costly- maybe not as much as the above alternatives.
BTW: the recommended way of deleting lots of rows from the tables in any of the above databases is to just completely delete the table. If you can fit your design into this paradigm, any of those will do.
Although this isnt an answer to you question, but in context with what you say about
It would be nice to instead use a system that relies on SSDs. This way
we would have the freedom to build much bigger hash tables.
you might consider taking a look at Project Voldemort.
Specifically being a Cassandra user I know when you say Its the compaction and the tombstones that are a problem. I have myself ran into TombstoneOverwhelmingException couple of times and hit dead ends.
You might want to have a look at this article by Linked In
It says:
Memcached is all in memory so you need to squeeze all your data into
memory to be able to serve it (which can be an expensive proposition
if the generated data set is large).
And finally
all we do is just mmap the entire data set into the process address
space and access it there. This provides the lowest overhead caching
possible, and makes use of the very efficient lookup structures in the
operating system.
I dont know if this fits your case. But you can consider evaluating Voldemort once! Best of luck.

Hadoop single node configuration on the high memory machine

I have a single node instance of Apache Hadoop 1.1.1 with default parameter values (see e.g. [1] and [2]) on the machine with a lot of RAM and very limited free disk space size. Then, I notice that this Hadoop instance wastes a lot of disk space during map tasks. What configuration parameters should I pay attention to in order to take advantage of high RAM capacity and decrease disk space usage?
You can use several of the mapred.* params to compress map output, which will greatly reduce the amount of disk space needed to store mapper output. See this question for some good pointers.
Note that different compression codecs will have different issues (i.e. GZip needs more CPU than LZO, but you have to install LZO yourself). This page has a good discussion of compression issues in Hadoop, although it is a bit dated.
The amount of RAM you need depends upon what you are doing in your map-reduce jobs, although you can increase your heap-size in:
conf/mapred-site.xml mapred.map.child.java.opts
See cluster setup for more details on this.
You can use dfs.datanode.du.reserved in hdfs-site.xml to specify an amount of disk space you won't use. I don't know whether hadoop is able to compensate with higher memory usage.
You'll have a problem, though, if you run a mapreduce job that's disk i/o intensive. I don't think any amount of configuring will help you then.

what does " local caching of data" mean in the context of this article?

From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

Resources