How to unzip large xml files into one HDFS directory - hadoop

I have a requirement to load Zip files from HDFS directory, unzip it and write back to HDFS in a single directory with all unzipped files. The files are XML and the size varies in GB.
Firstly, I approached by implementing the Map-Reduce program by writing a custom InputFormat and Custom RecordReader to unzip the files and provide these contents to mapper, thereafter each mapper process and writes to HDFS using MultiOutput Format. The map reduce job running on YARN.
This approach works fine and able to get files in unzipped format in HDFS, when the input size is in MB's, but when the input size is in GB's, the job is failing to write and ended up with the following error.
17/06/16 03:49:44 INFO mapreduce.Job:  map 94% reduce 0%
17/06/16 03:49:53 INFO mapreduce.Job:  map 100% reduce 0%
17/06/16 03:51:03 INFO mapreduce.Job: Task Id : attempt_1497463655394_61930_m_000001_2, Status : FAILED
Container [pid=28993,containerID=container_e50_1497463655394_61930_01_000048] is running beyond physical memory limits. Current usage: 2.6 GB of 2.5 GB physical memory used; 5.6 GB of 12.5 GB virtual memory used. Killing container.
It is apparent that each unzipped file is processed by one mapper and yarn child container running mapper not able to hold the large file in the memory.
On the other hand, I would to like try on Spark, to unzip the file and write the unzipped files to a single HDFS directory running on YARN, I wonder with spark also, each executor has to process the single file.
I'm looking for the solution to process the files parallelly, but at the end write it to a single directory.
Please let me know this can be possible in Spark, and share me some code snippets.
Any help appreciated.

Actually, the task itself is not failing! YARN is killing the
container (inside map task is running) as that Yarn child using more
memory than requested memory from YARN. As you are planning to do it
in Spark, you can simply increase the memory to MapReduce tasks.
I would recommend you to
Increase YARN child memory as you are handling GBs of data, Some key properties
yarn.nodemanager.resource.memory-mb => Container Memory
yarn.scheduler.maximum-allocation-mb => Container Memory Maximum
mapreduce.map.memory.mb => Map Task Memory (Must be less then yarn.scheduler.maximum-allocation-mb at any pint of time in runtime)
Focus on data processing(Unzip) only for this job, invoke another job/command to merge files.

Related

Hadoop EMR job runs out of memory before RecordReader initialized

I'm trying to figure out what could be causing my emr job to run out of memory before it has even started processing my file inputs. I'm getting a
"java.lang.OutOfMemoryError cannot be cast to java.lang.Exception" error before my RecordReader is even initialized (aka, before it even tried to unzip the files and process them). I am running my job on a directory with a large amount of inputs. I am able to run my job just fine on a smaller input set. Does anyone have any ideas?
I realized that the answer is that there was too much metadata overhead on the master node. The master node must store ~150 kb of data for each file that will be processed. With millions of files, this can be gigabytes of data, which was too much and caused the master node to crash.
Here's a good source for more information: http://www.inquidia.com/news-and-info/working-small-files-hadoop-part-1#sthash.YOtxmQvh.dpuf

Adding new files to a running hadoop cluster

consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Thanks,
Morteza
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm https://storm.apache.org/ or Apache Spark https://spark.apache.org/

How to make Hadoop Map Reduce process multiple files in a single run ?

For Hadoop Map Reduce program when we run it by executing this command $hadoop jar my.jar DriverClass input1.txt hdfsDirectory. How to make Map Reduce process multiple files( input1.txt & input2.txt ) in a single run ?
Like that:
hadoop jar my.jar DriverClass hdfsInputDir hdfsOutputDir
where
hdfsInputDir is the path on HDFS where your input files are stored (i.e., the parent directory of input1.txt and input2.txt)
hdfsOutputDir is the path on HDFS where the output will be stored (it should not exist before running this command).
Note that your input should be copied on HDFS before running this command.
To copy it to HDFS, you can run:
hadoop dfs -copyFromLocal localPath hdfsInputDir
This is your small files problem. for every file mapper will run.
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
solution
HAR files
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. To a client using the HAR filesystem nothing has changed: all of the original files are visible and accessible (albeit using a har:// URL). However, the number of files in HDFS has been reduced.
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record).

How much overhead a cachedDistributed file has in a mapreduce program?

How much overhead each cachedDistributed file has in a map-reduce program? I have a mapreduce program in which I need to have 50 cachedDistributed files (of very small size), it seems that the overhead they have is much larger than the case in which I have only 1 cachedDistributed file. Is that true?
As far as I understood, cachedDistributed files are copied to each machine that runs a mapper, thus access to a cachedDistributed file is local and shouldn't have too much overhead.
I think you may try to use archive files (files are unarchived on the task node automitically).
You can add archive files to the DistributedCache by to mean :
With tool that use GenericOptionsParser. Then, you can specify the files to be distributed as a comma-separated list of URIs as the argument to -archives option. If you don't specify the scheme , the files are assumed to be local. So, when you launch the job, the local file is copied to the distributed filesystem (often HDFS)
$> hadoop jar foo.jar ClassUsingDistributedCacheFile -archives archive.jar input output
With the distributed cache API (see the javaDoc). With the API, the files specified by the URI must be in a shared filesystem (so the java API does not copy the file.
Before a task is run, the tasktracker copies the files from the distributed filesystem to a local disk, as you say. I think the overhead come from retrieving all your little files in the HDFS

How to achieve desired block size with Hadoop with data on local filesystem

I have a 2TB sequence file that I am trying to process with Hadoop which resides on a cluster set up to use a local (lustre) filesystem for storage instead of HDFS. My problem is that no matter what I try, I am always forced to have about 66000 map tasks when I run a map/reduce jobs with this data as input. This seems to correspond with a block size of 2TB/66000 =~ 32MB. The actual computation in each map task executes very quickly, but the overhead associated with so many map tasks slows things down substantially.
For the job that created the data and for all subsequent jobs, I have dfs.block.size=536870912 and fs.local.block.size=536870912 (512MB). I also found suggestions that said to try this:
hadoop fs -D fs.local.block.size=536870912 -put local_name remote_location
to make a new copy with larger blocks, which I did to no avail. I have also changed the stripe size of the file on lustre. It seems that any parameters having to do with block size are ignored for local file system.
I know that using lustre instead of HDFS is a non-traditional use of hadoop, but this is what I have to work with. I'm wondering if others either have experience with this, or have any ideas to try other than what I have mentioned.
I am using cdh3u5 if that is useful.

Resources