I've been strugling with distcp for several days and I swear I have googled enough. Here is my use-case:
USE CASE
I have a main folder in a certain location say /hdfs/root, with a lot of subdirs (deepness is not fixed) and files.
Volume: 200,000 files ~= 30 GO
I need to copy only a subset for a client, /hdfs/root in another location, say /hdfs/dest
This subset is defined by a list of absolute path that can be updated over time.
Volume: 50,000 files ~= 5 GO
You understand that I can't use a simple hdfs dfs -cp /hdfs/root /hdfs dest because it is not optimized, it will take every files, and it hasn't an -update mode.
SOLUTION POC
I ended up using hadoop distcp in two ways:
Algo 1 (simplified):
# I start up to N distcp jobs in parallel for each subdir, with N=MAX_PROC (~30)
foreach subdir in mylist:
# mylist = /hdfs/root/dirX/file1 /hdfs/root/dirX/file2 ...
mylist = buildList(subdirs)
hadoop distcp -i -pct -update mylist /hdfs/dest/subdir &
and
Algo 2
# I start one distcp that has a blacklist
blacklist = buildBlackList()
hadoop distcp -numListstatusThread 10 -filters blacklist -pct -update /hdfs/root /hdfs/dest
Algo 2 does not even start, it seems that building a diff between source and blacklist is too hard for him, so I use Algo 1, and it works.
OOZIE WORKFLOW
Know I need to schedule all the workflow in a Oozie workflow.
I have put the algo 2 in a shell action, since I have a lot of distcp command and I don't master recursion or loop in oozie.
Once started, after a while, I get the following error:
Container runs beyond physical memory limits. Current usage: 17.2 GB of 16 GB physical memory used
Alright then, i'm gonna add more memory :
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>32768</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx512m</value>
</property>
</configuration>
And still I get: Container runs beyond physical memory limits. Current usage: 32.8 GB of 32 GB physical memory used But the job lived twice as long as the previous one.
The RAM on my cluster is not infinite, so I can't go further. Here are my hypothesis:
A distcp job does not release memory (JVM garbage collector ?)
Oozie sees the addition of all distcp jobs as the current memory usage, which is stupid
This is not the right way to do this (well I know, but still)
Also, there are a lot of things I did not understand about memory management, it's pretty foggy (yarn, oozie, jvm, mapreduce).
While googling, I noticed few people are talking about real distcp use case, this post is 4 days old: https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html and explains the snapshot usage, that I can't use in my case.
I've also heard about http://atlas.incubator.apache.org that would eventually solve my problem by "tagging" files and grant access to specific users, so we can avoid copying to a certain location. My admin team is working on it, but we won't get it to production know.
I'm quite desperate. Help me.
YARN containers are built on top of Linux "cgroups". These "cgroups" are used to put soft limits on CPU, but not on RAM...
Therefore YARN uses a clumsy workaround: it periodically checks how much RAM each container uses, and kills brutally anything that got over quota. So you lose the execution logs, and only get that dreadful message you have seen.
In most cases, you are running some kind of JVM binary (i.e. a Java/Scala utility or custom program) so you can get away by setting your own JVM quotas (especially -Xmx) so that you always stay under the YARN limit. Which means some wasted RAM because of the safety margin. But then the worse case is an clean failure of the JVM when it's out of memory, you get the execution logs in extenso and can start adjusting the quotas -- or fixing your memory leaks :-/
So what happens in your specific case? You are using Oozie to start a shell -- then the shell starts a hadoop command, which runs in a JVM. It is on that embedded JVM that you must set the Max Heap Size.
Long story short: if you allocate 32GB to the YARN container that runs your shell (via oozie.launcher.mapreduce.map.memory.mb) then you must ensure that the Java commands inside the shell do not consume more than, say, 28GB of Heap (to stay on the safe side).
If you are lucky, setting a single env variable will do the trick:
export HADOOP_OPTS=-Xmx28G
hadoop distcp ...........
If you are not lucky, you will have to unwrap the whole mess of hadoop-env.sh mixing different env variables with different settings (set by people that visibly hate you, in init scripts that you cannot even know about) to be interpreted by the JVM using complex precedence rules. Have fun. You may peek at that very old post for hints about where to dig.
Related
What is the main constraint on running larger YARN jobs (Hadoop version HDP-3.1.0.0 (3.1.0.0-78)) and how do I increase it? Basically, want to do more (all of which are pretty large) sqoop jobs concurrently.
I am currently assuming that I need to increase the Resource Manager heap size (since that is what I see going up on the Ambari dashboard when I run YARN jobs). How to add more resources to RM heap / why does RM heap appear to be such a small fraction of total RAM available (to YARN?) across the cluster?
Looking in Ambari: YARN cluster memory is 55GB, but RM heap is only 900MB.
Could anyone with more experience tell me what is the difference and which is the limiting factor in running more YARN applications (and again, how do I increase it)? Anything else that I should be looking at? Any docs explaining this in more detail?
The convenient way to tune your YARN and MapReduce memory is to use yarn-utils script.
Download Companion Files ## Ref
wget http://public-repo-1.hortonworks.com/HDP/tools/2.6.0.3/hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
Executing YARN Utility Script ## Ref
You can execute yarn-utils.py python script by providing Available Cores, Available Memory, No. of Disks, HBase is installed or not.
If you have a heterogeneous Hadoop Cluster then you have to create Configuration groups based on Nodes specification. If you need more info on that let me know I will update my answer according to that.
I am following the tutorial on the hadoop website: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html.
I run the following example in Pseudo-Distributed Mode.
time hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+'
It takes 1:47min to complete. When I turn off the network (wifi), it finishes in approx 50 seconds.
When I run the same command using the Local (Standalone) Mode, it finishes in approx 5 seconds (on a mac).
I understand that in Pseudo-Distributed Mode there is more overhead involved and hence it will take more time, but in this case it takes way more time. The CPU is completely idle during the run.
Do you have any idea what can cause this issue?
First, I don't have an explanation for why turning off your network would result in faster times. You'd have to dig through the Hadoop logs to figure out that problem.
This is typical behavior most people encounter running Hadoop on a single node. Effectively, you are trying to use Fedex to deliver something to your next door neighbor. It will always be faster to walk it over because the inherent overhead of operating a distributed system. When you run local mode, you are only performing the Map-Reduce function. When you run pseudo-distributed, it will use all the Hadoop servers (NameNode, DataNodes for data; Resource Manager, NodeManagers for compute) and what you are seeing is the latencies involved in that.
When you submit your job, the Resource Manager has to schedule it. As your cluster is not busy, it will ask for resources from the Node Manager. The Node Manager will give it a container which will run your Application Master. Typically, this loop takes about 10 seconds. Once your AM is running it will ask for resources from the Resource Manager for it's Map and Reduce tasks. This takes another 10 seconds. Also when you submit your job there is around a 3 second wait before this job is actually submitted to the Resource Manager. So far that's 23 seconds and you haven't done any computation yet.
Once the job is running, the most likely cause of waiting is allocating memory. On smaller systems (> 32GB of memory) the OS might take a while to allocate space. If you were to run the same thing on what is considered commodity hardware for Hadoop (16+ core, 64+ GB) you would probably see run time closer to 25-30 seconds.
I have a situation where I have to copy data/files from PROD to UAT (hadoop clusters). For that I am using 'distcp' now. but it is taking forever. As distcp uses map-reduce under the hood, is there any way to use spark to make the process any faster? Like we can set hive execution engine to 'TEZ' (to replace map-reduce), can we set execution engine to spark for distcp? Or is there any other 'spark' way to copy data across clusters which may not even bother about distcp?
And here comes my second question (assuming we can set distcp execution engine to spark instead of map-reduce, please don't bother to answer this one otherwise):-
As per my knowledge Spark is faster than map-reduce mainly because it stores data in the memory which it might need to process in several occasions so that it does not have to load the data all the way from disk. Here we are copying data across clusters, so there is no need to process one file (or block or split) more than once as each file will go up into the memory then will be sent over the network, gets copied to the destination cluster disk, end of the story for that file. Then how come Spark makes the process faster if the main feature is not used?
Your bottlenecks on bulk cross-cluster IO are usually
bandwidth between clusters
read bandwidth off the source cluster
write bandwidth to the destination cluster (and with 3x replication, writes do take up disk and switch bandwidth)
allocated space for work (i.e. number of executors, tasks)
Generally on long-distance uploads its your long-haul network that is the bottleneck: you don't need that many workers to flood the network.
There's a famous tale of a distcp operation between two Yahoo! clusters which did manage to do exactly that to part of the backbone: the Hadoop ops team happy that the distcp was going so fast, while the networks ops team are panicing that their core services were somehow suffering due to the traffic between two sites. I believe this incident is the reason that distcp now has a -bandwidth option :)
Where there may be limitations in distcp, it's probably in task setup and execution: the decision of which files to copy is made in advance and there's not much (any?) intelligence in rescheduling work if some files copy fast but others are outstanding.
Distcp just builds up the list in advance and hands it off to the special distcp mappers, each of which reads its list of files and copies it over.
Someone could try doing a spark version of distcp; it could be an interesting project if someone wanted to work on better scheduling, relying on the fact that spark is very efficient at pushing out new work to existing executors: a spark version could push out work dynamically, rather than listing everything in advance. Indeed, it could still start the copy operation while enumerating the files to copy, for a faster startup time. Even so: cross-cluster bandwidth will usually be the choke point.
Spark is not really intended for data movement between Hadoop clusters. You may want to look into additional mappers for your distcp job using the "-m" option.
The project I'm currently working on uses a small Hadoop cluster to iterate over about 300gb of data. This data is analyzed and it fills up a mongoDb that is used later on by our system.
Right now the Hadoop cluster is running on 4 physical machine (old Dell Precision t3500's). For testing this was a great setup as I could easily interact, install and test with the machines. But obviously this is less desired when the program releases. For this step the most desired outcome would be to virtualize Hadoop. Spread it out over a set of Docker containers that can run within a cluster.
When searching the internet it quickly became clear that Hadoop can run in a environment like that. Most search results speak about Yarn and the actual hadoop instances and how to start them. That is all great but I was wondering: what happens to HDFS.
In my current test setup HDFS contains 300gb of data that is stored in triple (to prevent data loss). When the system goes live this data set will grow with approximately 250mb each day. Uploading all of these files into HDFS takes a...while.
Now to get to my question:
How would HDFS act when docker starts or stops certain containers. Can it still guarantee that it will not loose any data. And would it not takes ages to re-sync a new node? Also it is very well possible that I'm looking at this from the wrong perspective. I've never done this before so if I'm going the wrong way, please let me know.
ps: I'm sorry if this is a bit of a long/vague question. But like I said this is uncharted territory for me so I'm looking for something that can point me in the right direction, Google only got me sofar but limits its information to YARN and Hadoop self
I'm having trouble figuring out the best way to configure my Hadoop cluster (CDH4), running MapReduce1. I'm in a situation where I need to run both mappers that require such a large amount of Java heap space that I couldn't possible run more than 1 mapper per node - but at the same time I want to be able to run jobs that can benefit from many mappers per node.
I'm configuring the cluster through the Cloudera management UI, and the Max Map Tasks and mapred.map.child.java.opts appear to be quite static settings.
What I would like to have is something like a heap space pool with X GB available, that would accommodate both kinds of jobs without having to reconfigure the MapReduce service each time. If I run 1 mapper, it should assign X GB heap - if I run 8 mappers, it should assign X/8 GB heap.
I have considered both the Maximum Virtual Memory and the Cgroup Memory Soft/Hard limits, but neither will get me exactly what I want. Maximum Virtual Memory is not effective, since it still is a per task setting. The Cgroup setting is problematic because it does not seem to actually restrict the individual tasks to a lower amount of heap if there is more of them, but rather will allow the task to use too much memory and then kill the process when it does.
Can the behavior I want to achieve be configured?
(PS you should use the newer name of this property with Hadoop 2 / CDH4: mapreduce.map.java.opts. But both should still be recognized.)
The value you configure in your cluster is merely a default. It can be overridden on a per-job basis. You should leave the default value from CDH, or configure it to something reasonable for normal mappers.
For your high-memory job only, in your client code, set mapreduce.map.java.opts in your Configuration object for the Job before you submit it.
The answer gets more complex if you are running MR2/YARN since it no longer schedules by 'slots' but by container memory. So memory enters the picture in a new, different way with new, different properties. (It confuses me, and I'm even at Cloudera.)
In a way it would be better, because you express your resource requirement in terms of memory, which is good here. You would set mapreduce.map.memory.mb as well to a size about 30% larger than your JVM heap size since this is the memory allowed to the whole process. It would be set higher by you for high-memory jobs in the same way. Then Hadoop can decide how many mappers to run, and decide where to put the workers for you, and use as much of the cluster as possible per your configuration. No fussing with your own imaginary resource pool.
In MR1, this is harder to get right. Conceptually you want to set the maximum number of mappers per worker to 1 via mapreduce.tasktracker.map.tasks.maximum, along with your heap setting, but just for the high-memory job. I don't know if the client can request or set this though on a per-job basis. I doubt it as it wouldn't quite make sense. You can't really approach this by controlling the number of mappers just because you have to hack around to even find out, let alone control, the number of mappers it will run.
I don't think OS-level settings will help. In a way these resemble more how MR2 / YARN thinks about resource scheduling. Your best bet may be to (move to MR2 and) use MR2's resource controls and let it figure the rest out.