Merge multiple files into one on hadoop - hadoop

A rather stupid question but how do I combine multiple files in a folder into one file without copying them to local machine?I do not care about the order.
I thought hadoop fs -getmerge would do the job but I have since found out that it copies the data into your local machine.
I would do it in my original spark application but adding coalesce is increasing my runtime by a large amount.
I am on Hadoop 2.4 if that matters.

how do I combine multiple files in a folder into one file without copying them to local machine?
You have to either copy the files to local node or one of the computation node.
HDFS is a file system. It doesn't care about your file format. If your file is raw text/binary, you can try the concatenation API which only manipulate metadata in NameNode without copying data. But if your file is parquet/gzip/lzo or else, these files can't not be simply concated, you have to download them from HDFS, merge them into one, and upload the merged one. Spark's coalesce(1) do the same thing except it's done in the executor node instead of your local node.
If you have many folders has files need to be merged, spark/MR is definitely the right choice. One reason is the parallelism. The other reason is, if your file is like gzip doesn't support split, one huge gzip file may slow down your job. With some math calculation, you can merge small files into relative large files. ( file size equals to or slightly smaller than blocksize). It very easy with coalesce(n) API.
I suggest you to merge small files. But as #cricket_007 mentioned in the comment, merging doesn't always gain benefit.

Related

HDFS small file design

I want to be able to store millions of small files (binary files- images,exe etc) (~1Mb) on HDFS, my requirements are basically to be able to query random files and not running MapReduce jobs.
The main problem for me is the Namenode memory issue, and not the MapReduce mappers issue.
So my options are:
HAR files - aggregate small files and only than saving them with their har:// path in another place
Sequence files - append them as they come in, this is more suitable for MapReduce jobs so i pretty much eliminated it
HBase - saving the small files to Hbase is another solution described in few articles on google
i guess i'm asking if there is anything i missed? can i achieve what i need by appeding binary files to big Avro/ORC/Parquet files? and then query them by name or by hash from java/client program?
Thanks,
If you append multiple files into large files, then you'll need to maintain an index of which large file each small file resides in. This is basically what Hbase will do for you. It combines data into large files, stores them in HDFS and uses sorting on keys to support fast random access. It sounds to me like Hbase would suit your needs, and if you hand rolled something yourself, you may end up redoing a lot of work that Hbase already does.

Will sequence file help in improve performance for reading in HDFS compared to Local File System?

I want to compare performance for HDFS and Local File System for 1000 of small files (1-2 mb). Without using Sequence files, HDFS takes almost double the time for reading up 1000 files as compared to local file system.
I heard of sequence files here - Small Files Problem in HDFS
I want to show better response time for HDFS for retrieving these records than Local FS. Will sequence files help or should I look for something else? (HBase maybe)
edit: I'm using Java program to read files like here HDFS Read though Java
Yes, for simple file retrieval grabbing a single sequence file will be much quicker then grabbing 1000 files. When reading from HDFS you incur much more overhead including spinning up the JVM (assuming you're using hadoop fs -get ...), getting the location of each of the files from the NameNode, as well as network time (assuming you have more then one datanode).
A sequence file can be thought of as a form of container. If you put all the 1000 files into a sequence file, you only need to grab 32 blocks (if your blocksize is set to 64MB) rather then 1000. This will reduce location lookups and total network connections made. You do run into another issue at this point with reading the sequence file. It is a binary format.
HBase is better suited for low-latency and random reads, so it may be a better option for you. Keep in mind that disk seeks still occur (unless you're working from memory), so reading a bunch of small files locally may be a better solution then using HDFS as a file store.

How to handle unsplittable 500 MB+ input files in hadoop?

I am writing a hadoop MapReduce job that is running over all source code files of a complete Debian mirror (≈ 40 GB). Since the Debian mirror data is on a separate machine and not in the hadoop cluster, the first step is to download the data.
My first implementation downloads a file and outputs key=$debian_package, value=$file_contents. The various values (typically 4) per key should then be reduced to a single entry. The next MapReduce job will then operate on debian packages as keys and all their files as values.
However, I noticed that hadoop works very poorly with output values that can sometimes be really big (700 MB is the biggest I’ve seen). In various places in the MapReduce framework, entire files are stored in memory, sometimes twice or even three times. I frequently encounter out of memory errors, even with a java heap size of 6 GB.
Now I wonder how I could split the data so that it better matches hadoop’s 64 MB block size.
I cannot simply split the big files into multiple pieces, because they are compressed (tar/bz2, tar/xz, tar/gz, perhaps others in the future). Until I shell out to dpkg-source on them to extract the package as a whole (necessary!), the files need to keep their full size.
One idea that came to my mind was to store the files on hdfs in the first MapReduce and only pass the paths to them to the second MapReduce. However, then I am circumventing hadoop’s support for data locality, or is there a way to fix that?
Are there any other techniques that I have been missing? What do you recommend?
You are correct. This is NOT a good case for Hadoop internals. Lots of copying... There are two obvious solutions, assuming you can't just untar it somewhere:
break up the tarballs using any of several libraries that will allow you to recursively read compressed and archive files (apache VFS has limited capability for this, but the apache compression library has more capability).
nfs mount a bunch of data nodes local space to your master node and then fetch and untar into that directory structure... then use forqlift or similar utility to load the small files into HDFS.
Another option is to write a utility to do this. I have done this for a client. Apache VFS and compression, truezip, then hadoop libraries to write (since I did a general purpose utility I used a LOT of other libraries, but this is the basic flow).

multiple file streaming hdfs

I have two matrices on separate files . I have to read the files into cache so that I can multiply them. I have been wondering if HDFS would help me. I am suspecting that HDFS does not because it does not have enough cache memory to read the files and processes it . in short can i open two files at the same time
To answer your shorter version of the question, yes the HDFS API does allow concurrent reads of two files at a time. You may simply create two input streams over the two files and read them in parallel (as you would with regular files) and manage your logic around that.
However, the HDFS is a simple FileSystem and has no cache of its own to offer (other than the OS buffer cache) and any cache for computation you need to carry, needs to be taken care of by your own application.
As another general recommendation, since you look to be multiplying matrices, perhaps look at the Apache Mahout and Apache Hama projects that support HDFS.

how to load a tarball to pig

i have a log files that is in a tarball (access.logs.tar.gz) loaded into my hadoop cluster. I was wondering is their way to directly load it to pig with out untaring it?
#ChrisWhite's answer is technically correct and you should accept his answer instead of mine (IMO at least).
You need to get away from tar.gz files with Hadoop. Gzip files are not splittable, so you get in the situation where if your gzip files are large, you're going to see hotspotting in your mappers. For example, if you have a .tar.gz file that is 100gb, you aren't going to be able to split the computation.
Let's say on the other hand that they are tiny. In which case, Pig will do a nice job of collecting them together and the splitting problem goes away. This has the downside of the fact that now you are dealing with tons of tiny files with the NameNode. Also, since the files are tiny, it should be relatively cheap computationally to reform the files into a more reasonable format.
So what format should you reformulate the files into? Good question!
Just concatenating them all into one large block-level compressed
sequence file might be the most challenging but the most rewarding in
terms of performance.
The other is to just ignore compression
entirely and just explode those files out, or at least concatenate
them (you do see performance hits without compression).
Finally, you could blob files into ~100MB chunks and then gzip them.
I think it would be completely reasonable to write some sort of tarball loader into piggybank, but I personally would just rather lay the data out differently.
PigStorage will recognize the file is compressed (by the .gz extension, this is actually implemented in the TextInputFormat which PigTextInputFormat extends), but after that you'll be dealing with a tar file. If you're able to handle the header lines between the files in the tar then you can just use PigStorage as is, otherwise you'll need to write your own extension of PigTextInputFormat to handle stripping out the tar header lines between each file

Resources