Hadoop EMR job runs out of memory before RecordReader initialized - hadoop

I'm trying to figure out what could be causing my emr job to run out of memory before it has even started processing my file inputs. I'm getting a
"java.lang.OutOfMemoryError cannot be cast to java.lang.Exception" error before my RecordReader is even initialized (aka, before it even tried to unzip the files and process them). I am running my job on a directory with a large amount of inputs. I am able to run my job just fine on a smaller input set. Does anyone have any ideas?

I realized that the answer is that there was too much metadata overhead on the master node. The master node must store ~150 kb of data for each file that will be processed. With millions of files, this can be gigabytes of data, which was too much and caused the master node to crash.
Here's a good source for more information: http://www.inquidia.com/news-and-info/working-small-files-hadoop-part-1#sthash.YOtxmQvh.dpuf

Related

How to unzip large xml files into one HDFS directory

I have a requirement to load Zip files from HDFS directory, unzip it and write back to HDFS in a single directory with all unzipped files. The files are XML and the size varies in GB.
Firstly, I approached by implementing the Map-Reduce program by writing a custom InputFormat and Custom RecordReader to unzip the files and provide these contents to mapper, thereafter each mapper process and writes to HDFS using MultiOutput Format. The map reduce job running on YARN.
This approach works fine and able to get files in unzipped format in HDFS, when the input size is in MB's, but when the input size is in GB's, the job is failing to write and ended up with the following error.
17/06/16 03:49:44 INFO mapreduce.Job:  map 94% reduce 0%
17/06/16 03:49:53 INFO mapreduce.Job:  map 100% reduce 0%
17/06/16 03:51:03 INFO mapreduce.Job: Task Id : attempt_1497463655394_61930_m_000001_2, Status : FAILED
Container [pid=28993,containerID=container_e50_1497463655394_61930_01_000048] is running beyond physical memory limits. Current usage: 2.6 GB of 2.5 GB physical memory used; 5.6 GB of 12.5 GB virtual memory used. Killing container.
It is apparent that each unzipped file is processed by one mapper and yarn child container running mapper not able to hold the large file in the memory.
On the other hand, I would to like try on Spark, to unzip the file and write the unzipped files to a single HDFS directory running on YARN, I wonder with spark also, each executor has to process the single file.
I'm looking for the solution to process the files parallelly, but at the end write it to a single directory.
Please let me know this can be possible in Spark, and share me some code snippets.
Any help appreciated.
Actually, the task itself is not failing! YARN is killing the
container (inside map task is running) as that Yarn child using more
memory than requested memory from YARN. As you are planning to do it
in Spark, you can simply increase the memory to MapReduce tasks.
I would recommend you to
Increase YARN child memory as you are handling GBs of data, Some key properties
yarn.nodemanager.resource.memory-mb => Container Memory
yarn.scheduler.maximum-allocation-mb => Container Memory Maximum
mapreduce.map.memory.mb => Map Task Memory (Must be less then yarn.scheduler.maximum-allocation-mb at any pint of time in runtime)
Focus on data processing(Unzip) only for this job, invoke another job/command to merge files.

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Adding new files to a running hadoop cluster

consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Thanks,
Morteza
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm https://storm.apache.org/ or Apache Spark https://spark.apache.org/

Hadoop - data block caching techniques

I am running some experiments to benchmark the time it takes (by map-reduce) to read and process data stored on HDFS with varying parameters. I use pig script to launch map-reduce jobs. Since I am working with the same set of files frequently, my results may get affected because of file/block caching.
I want to understand the various caching techniques employed in a map-reduce environment.
Lets say that a file foo (contains some data to be procesed) stored on HDFS occupies 1 HDFS block and it gets stored in machine STORE. During a map-reduce task, machine COMPUTE reads that block over network and processes it. Caching can happen at two levels:
Cached in memory of machine STORE (in-memory file cache)
Cached in memory/disk of machine COMPUTE.
I am pretty sure that #1 caching happens. I want to ensure whether something like #2 happens? From the post here, it looks like there is no client level caching going on since it is very unlikely that the block cached by COMPUTE will be needed again in the same machine before the cache is flushed.
Also, is the hadoop distributed cache used only to distribute any application specific files (not task specific input data files) to all task tracker nodes? Or is the task specific input file data (like the foo file block) cached in the distributed cache? I assume local.cache.size and related parameters only control the distributed cache.
Please clarify.
The only caching that is ever applied within HDFS is the OS caching to minimize disk accesses.
So if you access a block from a datanode, it is likely to be cached if nothing else is going on there.
On your client side, this depends on what you do with the block. If you directly write it to disk, it is also very likely that your client OS caches it.
The distributed cache is just for jars and files that need to be distributed across the cluster where your job launches tasks. The name is thus a bit misleading, as it "caches" nothing.

How to achieve desired block size with Hadoop with data on local filesystem

I have a 2TB sequence file that I am trying to process with Hadoop which resides on a cluster set up to use a local (lustre) filesystem for storage instead of HDFS. My problem is that no matter what I try, I am always forced to have about 66000 map tasks when I run a map/reduce jobs with this data as input. This seems to correspond with a block size of 2TB/66000 =~ 32MB. The actual computation in each map task executes very quickly, but the overhead associated with so many map tasks slows things down substantially.
For the job that created the data and for all subsequent jobs, I have dfs.block.size=536870912 and fs.local.block.size=536870912 (512MB). I also found suggestions that said to try this:
hadoop fs -D fs.local.block.size=536870912 -put local_name remote_location
to make a new copy with larger blocks, which I did to no avail. I have also changed the stripe size of the file on lustre. It seems that any parameters having to do with block size are ignored for local file system.
I know that using lustre instead of HDFS is a non-traditional use of hadoop, but this is what I have to work with. I'm wondering if others either have experience with this, or have any ideas to try other than what I have mentioned.
I am using cdh3u5 if that is useful.

Resources