Caching data on Hadoop worker nodes - caching

My Map/Reduce program is requesting files frequently from S3. In the reducer, I am requesting files from Amazon S3 very frequently and the I may request the same file multiple times (about 10 K files each file is between 1 MB to 12 MB). Using Hadoop Distributed Cache is not efficient because it will copy all these files to all worker nodes (as I understand), but I don't want to do these as in the reducer phase, I may request 1000 files only from 10 K files. Moreover, if the reducer requested before a file, I don't want to request it again if the reducer needed it again. I am asking if anyone implemented a caching framework like ehcache or oscache on the worker nodes ? or are there any methods to cache only the requested files on the worker machines disks ?
Thanks
Yahia

Have a look at SHARK
it should not take much time to configure. Another option is memcached .

You probably need a mature in-memory data grid with partitioned cache support. GridGain is one of them. Take a look www.gridgain.com

I would suggest to use HDFS as a cache. S3 is usually much slower then local disks, so HDFS can be considered as local cache.
I am not aware about fully automatic solution, but I believe that distcp will be of help. (http://hadoop.apache.org/common/docs/r0.19.2/distcp.html) It has "update" option so it will not copy files who's size does not changed .

Related

use spark to copy data across hadoop cluster

I have a situation where I have to copy data/files from PROD to UAT (hadoop clusters). For that I am using 'distcp' now. but it is taking forever. As distcp uses map-reduce under the hood, is there any way to use spark to make the process any faster? Like we can set hive execution engine to 'TEZ' (to replace map-reduce), can we set execution engine to spark for distcp? Or is there any other 'spark' way to copy data across clusters which may not even bother about distcp?
And here comes my second question (assuming we can set distcp execution engine to spark instead of map-reduce, please don't bother to answer this one otherwise):-
As per my knowledge Spark is faster than map-reduce mainly because it stores data in the memory which it might need to process in several occasions so that it does not have to load the data all the way from disk. Here we are copying data across clusters, so there is no need to process one file (or block or split) more than once as each file will go up into the memory then will be sent over the network, gets copied to the destination cluster disk, end of the story for that file. Then how come Spark makes the process faster if the main feature is not used?
Your bottlenecks on bulk cross-cluster IO are usually
bandwidth between clusters
read bandwidth off the source cluster
write bandwidth to the destination cluster (and with 3x replication, writes do take up disk and switch bandwidth)
allocated space for work (i.e. number of executors, tasks)
Generally on long-distance uploads its your long-haul network that is the bottleneck: you don't need that many workers to flood the network.
There's a famous tale of a distcp operation between two Yahoo! clusters which did manage to do exactly that to part of the backbone: the Hadoop ops team happy that the distcp was going so fast, while the networks ops team are panicing that their core services were somehow suffering due to the traffic between two sites. I believe this incident is the reason that distcp now has a -bandwidth option :)
Where there may be limitations in distcp, it's probably in task setup and execution: the decision of which files to copy is made in advance and there's not much (any?) intelligence in rescheduling work if some files copy fast but others are outstanding.
Distcp just builds up the list in advance and hands it off to the special distcp mappers, each of which reads its list of files and copies it over.
Someone could try doing a spark version of distcp; it could be an interesting project if someone wanted to work on better scheduling, relying on the fact that spark is very efficient at pushing out new work to existing executors: a spark version could push out work dynamically, rather than listing everything in advance. Indeed, it could still start the copy operation while enumerating the files to copy, for a faster startup time. Even so: cross-cluster bandwidth will usually be the choke point.
Spark is not really intended for data movement between Hadoop clusters. You may want to look into additional mappers for your distcp job using the "-m" option.

Distributed Cache Concept in Hadoop

My question is about the concept of distributed cache specifically for Hadoop and whether it should be called distributed Cache. A conventional definition of distributed cache is "A distributed cache spans multiple servers so that it can grow in size and in transactional capacity".
This is not true in hadoop as Distributed cache is distributed to all the nodes which runs the tasks i.e. the same file mentioned in the driver code.
Shouldn't this be called a replicative cache. The intersection of cache on all nodes should be null (or close to it) if we go by the conventional distributed cache definition. But for hadoop the result of intersection is the same file which is present in all nodes.
Is my understanding correct or am i missing something? Please guide.
Thanks
The general understanding and concept of any Cache is to make data available in memory and avoid hitting disk for reading the data. Because reading the data from disk is a costlier operation than reading from memory.
Now lets take the same analogy to Hadoop ecosystem. Here disk is your HDFS and memory is local file system where are the actual tasks run. During the life cycle of an application, there may be multiple tasks are running on the same node. So when the first task is launched in the node, it will fetch the data from HDFS and put it in the local system. Now the subsequent tasks on the same node will not fetch the same data again. That way it will save the cost of getting data from HDFS vs getting it from local file system. The is the concept of Distributed Cache in MapReduce framework.
The size of the data is usually small enough that it can be loaded in the Mapper memory, usually in few MBs.
I too agree that it's not really "Distributed cache". But I am convinced with YoungHobbit comments on efficiency of not hitting disk for IO operations.
The only merit I have seen in this mechanism as per Apache documentation:
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
Please note that DistributedCache has been deprecated since 2.6.0 release. You have to use new APIs in Job class to achieve the same functionality.

Distributed Cache in Hadoop

What is Distributed Cahce in Hadoop?
How it works?
Could some one give me inline description of it with real time example?
The distributed cache can contain small data files needed for initialization or libraries of code that may need to be accessed on all nodes in the cluster.
Say for example you have to count no of words occurence in a huge set of file.
And you have instructed that you have to count every words except these words in a file given say (ignore.csv which is also large file).
Then you read this ignore.csv in distributed cache is setup function of your mapper or reducer depends on your logic and store it in a data structure where you can access each word easily( e.g. HashMap).
This file will read and stored before mapper and reducer of any machine get started and this distributed cache is same for all the machines running in cluster.
I hope you understand now. Please comment your doubts if any.
DistributedCache is a deprecated class in Hadoop. Here is the right way to use
Hadoop DistributedCache is deprecated - what is the preferred API?
DistributedCache copies the files to all the slave nodes. So that access is faster for the MR job running locally. The cache is not in RAM, its just a file system cache in all the local disk volume of all slave nodes

Copying a large file (~6 GB) from S3 to every node of an Elastic MapReduce cluster

Turns out that copying a large file (~6 GB) from S3 to every node in an Elastic MapReduce cluster in a bootstrap action doesn't scale well; the pipe is only so big, and downloads to the nodes get throttled as # nodes gets large.
I'm running a job flow with 22 steps, and this file is needed by maybe 8 of them. Sure, I can copy from S3 to HDFS and cache the file before every step, but that's a major speed kill (and can affect scalability). Ideally, the job flow would start with the file on every node.
There are StackOverflow questions at least obliquely addressing persisting a cached file through a job flow:
Re-use files in Hadoop Distributed cache,
Life of distributed cache in Hadoop .
I don't think they help me. Anyone have some fresh ideas?
Two ideas, please consider your case specifics and disregard at will:
Share the file through NFS with a server with a instance type with good enough networking on the same placement group or AZ.
Have EBS PIOPS volumes and EBS-Optimized instances with the file pre-loaded and just attach them to your nodes in a bootstrap action.

How big is too big for a DistributedCache file hadoop?

Are there any guidelines for whether to distribute a file using a distributed cache or not ?
I have a file of size 86746785 (I use hadoop dfs -dus - don't know if this is in bytes or what). Is it a good idea to distribue this file ?
The only viable answer is "it depends".
What you have to consider about using distributed cache is the file gets copied to every node that is involved in your task, which obviously takes bandwidth. Also, usually if you want the file in distributed cache, you'll keep the file in memory, so you'd have to take that into consideration.
As for your case -- yes, those are bytes. The size is roughly 86 MB, which is perfectly fine for distributed cache. Anything within a couple hundred MBs should probably still be.
In addition to TC1's answer, also consider:
When/where are you going to use the file(s) and how big is your cluster?
In a many mappers, single reducer (or small number of) scenario where you only need the file in the reducer i would advise against it as you might as well just pull down the file yourself in the reducer (setup method), rather than unnecessarily for each task node your mappers run on - especially if the file is large (this depends on how many nodes you have in your cluster)
How many files are you putting into the cache?
If for some reason you have 100's of files to distribute, you're better off tar'ing them up and putting the tar file in the distributed cache's archives set (the dist cache will take care of untaring the file for you). The thing you're trying to avoid here is if you didn't put them in the dist cache but directly loaded them from HDFS, you may run into a scenario where you have 1000's of mappers and or reducers trying to open the same file which could caused too many open files problems for the name node and data nodes
The size of Distributed Cache is 10GB by default. But its better to keep a few MBs of data in Distributed Cache.Otherwise it will affect the performance of your application.

Resources