How to configure Hadoop parameters on Amazon EMR? - hadoop

I run a MR job with one Master and two slavers on the Amazon EMR, but got lots of the error messages like running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container after map 100% reduce 35%
I modified my codes by adding the following lines in the Hadoop 2.6.0 MR configuration, but I still got the same error messages.
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest2");
//conf.set("mapreduce.input.fileinputformat.split.minsize","3073741824");
conf.set("mapreduce.map.memory.mb", "8192");
conf.set("mapreduce.map.java.opts", "-Xmx8192m");
conf.set("mapreduce.reduce.memory.mb", "8192");
conf.set("mapreduce.reduce.java.opts", "-Xmx8192m");
What is the correct way to configure those parameters(mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.memory.mb, mapreduce.reduce.java.opts) on Amazon EMR? Thank you!

Hadoop 2.x allows you to set the map and reduce settings per job so you are setting the correct section. The problem is the Java opts Xmx memory must be less than the map/reduce.memory.mb. This property represents the total memory for heap and off heap usage. Take a look at the defaults as an example: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-hadoop-task-config.html. If Yarn was killing off the containers for exceeding the memory when using the default settings then this means you need to give more memory to the off heap portion, thus increasing the gap between Xmx and the total map/reduce.memory.mb.

Take a look at the documentation for the AWS CLI. There is a section on Hadoop and how to map to specific XML config files on EMR instance creation. I have found this to be the best approach available on EMR.

Related

How to unzip large xml files into one HDFS directory

I have a requirement to load Zip files from HDFS directory, unzip it and write back to HDFS in a single directory with all unzipped files. The files are XML and the size varies in GB.
Firstly, I approached by implementing the Map-Reduce program by writing a custom InputFormat and Custom RecordReader to unzip the files and provide these contents to mapper, thereafter each mapper process and writes to HDFS using MultiOutput Format. The map reduce job running on YARN.
This approach works fine and able to get files in unzipped format in HDFS, when the input size is in MB's, but when the input size is in GB's, the job is failing to write and ended up with the following error.
17/06/16 03:49:44 INFO mapreduce.Job:  map 94% reduce 0%
17/06/16 03:49:53 INFO mapreduce.Job:  map 100% reduce 0%
17/06/16 03:51:03 INFO mapreduce.Job: Task Id : attempt_1497463655394_61930_m_000001_2, Status : FAILED
Container [pid=28993,containerID=container_e50_1497463655394_61930_01_000048] is running beyond physical memory limits. Current usage: 2.6 GB of 2.5 GB physical memory used; 5.6 GB of 12.5 GB virtual memory used. Killing container.
It is apparent that each unzipped file is processed by one mapper and yarn child container running mapper not able to hold the large file in the memory.
On the other hand, I would to like try on Spark, to unzip the file and write the unzipped files to a single HDFS directory running on YARN, I wonder with spark also, each executor has to process the single file.
I'm looking for the solution to process the files parallelly, but at the end write it to a single directory.
Please let me know this can be possible in Spark, and share me some code snippets.
Any help appreciated.
Actually, the task itself is not failing! YARN is killing the
container (inside map task is running) as that Yarn child using more
memory than requested memory from YARN. As you are planning to do it
in Spark, you can simply increase the memory to MapReduce tasks.
I would recommend you to
Increase YARN child memory as you are handling GBs of data, Some key properties
yarn.nodemanager.resource.memory-mb => Container Memory
yarn.scheduler.maximum-allocation-mb => Container Memory Maximum
mapreduce.map.memory.mb => Map Task Memory (Must be less then yarn.scheduler.maximum-allocation-mb at any pint of time in runtime)
Focus on data processing(Unzip) only for this job, invoke another job/command to merge files.

Getting node utilization % in YARN (Hadoop 2.6.0)

In a YARN 2.6.0 cluster, is there a way to be able to get all the connected node's CPU utilization at the ResourceManager? Also, is the source code modifiable such that we can decide the nodes for a map-reduce job based on the utilization. If yes, where would this change take place?
Pls, find the implementation of Container Monitor:(CPU Utilization)
hadoop-2.6.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java
We have methods to check if a container is over the limitation.
isProcessTreeOverLimit will show you how yarn get the memory usage of certain container(process).
hadoop-2.6.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java
The above file shows you how Yarn gets memory usage: tracking process file in/proc.

Hive memory setting for local task during map join

I'm using a hdinsight cluster (hive version .13) to run some hive queries. One of the queries (query 7 from the TPCH suit) which launches a local task during map join fails due to insufficient memory (hive aborts it because the hashtable has reached the configured limit).
Hive seems to be allocating 1GB to the local task, from where is this size picked up and how can I increase it?
2015-05-03 05:38:19 Starting to launch local task to process map join; maximum memory = 932184064
I assumed the local task should use the same heap size of the mapper, but it does not seem to be the case. Any help is appreciated.
Quite late on this thread.. but just for others who face the same issue.
The documentation does state that the local (child) JVM will have same size as that of map (https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization), it does not seem to be the case. Instead, the JVM size is governed by HADOOP_HEAPSIZE setting from hive-env.sh. So, in the case of original post from Shradha, I suspect the HADOOP_HEAPSIZE is set to 1GB.
This property controls it :
yarn.app.mapreduce.am.command-opts
This is the Application Manager jvm opts. Since local task runs on AM.
Can you also try this property :
set hive.mapjoin.localtask.max.memory.usage = 0.999;
You can use HADOOP_HEAPSIZE=512 or HADOOP_CLIENT_OPTS=-Xmx512m which can both be tweaked from hadoop-env.sh.
Note however that this might lead to unexpected behaviors for some jobs and you will probably have to play with
mapreduce.map.memory.mb and mapreduce.map.java.opts
as well as
mapreduce.reduce.memory.mb and mapreduce.reduce.java.opts in the mapred-site config file in order to make sure that jobs remain stable.

What effects do dfs.blocksize, file.blocksize, kfs.blocksize and etc have in hadoop mapreduce job?

When I check the job.xml file of a hadoop (version 0.21.0) mapreduce job, I found there are multiple blocksize settings exist:
dfs.blocksize = 134217728 (i.e. 128MB)
file.blocksize = 67108864 (i.e. 64MB)
kfs.blocksize = 67108864
s3.blocksize = 67108864
s3native.blocksize = 67108864
ftp.blocksize = 67108864
I am expecting some answers to explain following related questions:
What are the dfs, file, kfs, s3 and etc mean in this context?
What are the differences among them?
What effects do they have when running a mapreduce job?
Thank you very much!
Map reduce can work on data stored on different types of storage systems.The settings above are the default block sizes on the storage techniques used. dfs(distributed file system) is what we commonly use in hadoop has default block size 128MB. Other settings are for file(local), kfs(kosmos distributed filesystem), s3(amazon cloud storage) and ftp(files on ftp server).
You may research them further for a better understanding of each and using them with hadoop features.While running the map reduce job,the settings which are for the particular storage technique being used,are identified for block size.
I hope it was helpful.

Hadoop Error: Java heap space

So, after seeing the a percent or so of running the job I get an error that says, "Error: Java heap space" and then something along the lines of, "Application container killed"
I am literally running an empty map and reduce job. However, the job does take in an input that is, roughly, about 100 gigs. For whatever reason, I run out of heap space. Although the job does nothing.
I am using default configurations and it's on a single machine. It is running on hadoop version 2.2 and ubuntu. The machine has 4 gigs of ram.
Thanks!
//Note
Got it figured out.
Turns out I was setting the configuration to have a different terminating token/string. The format of the data had changed, so that token/string no longer existed. So it was trying to send all 100gigs into ram for one key.

Resources