i have a log files that is in a tarball (access.logs.tar.gz) loaded into my hadoop cluster. I was wondering is their way to directly load it to pig with out untaring it?
#ChrisWhite's answer is technically correct and you should accept his answer instead of mine (IMO at least).
You need to get away from tar.gz files with Hadoop. Gzip files are not splittable, so you get in the situation where if your gzip files are large, you're going to see hotspotting in your mappers. For example, if you have a .tar.gz file that is 100gb, you aren't going to be able to split the computation.
Let's say on the other hand that they are tiny. In which case, Pig will do a nice job of collecting them together and the splitting problem goes away. This has the downside of the fact that now you are dealing with tons of tiny files with the NameNode. Also, since the files are tiny, it should be relatively cheap computationally to reform the files into a more reasonable format.
So what format should you reformulate the files into? Good question!
Just concatenating them all into one large block-level compressed
sequence file might be the most challenging but the most rewarding in
terms of performance.
The other is to just ignore compression
entirely and just explode those files out, or at least concatenate
them (you do see performance hits without compression).
Finally, you could blob files into ~100MB chunks and then gzip them.
I think it would be completely reasonable to write some sort of tarball loader into piggybank, but I personally would just rather lay the data out differently.
PigStorage will recognize the file is compressed (by the .gz extension, this is actually implemented in the TextInputFormat which PigTextInputFormat extends), but after that you'll be dealing with a tar file. If you're able to handle the header lines between the files in the tar then you can just use PigStorage as is, otherwise you'll need to write your own extension of PigTextInputFormat to handle stripping out the tar header lines between each file
Related
I want to be able to store millions of small files (binary files- images,exe etc) (~1Mb) on HDFS, my requirements are basically to be able to query random files and not running MapReduce jobs.
The main problem for me is the Namenode memory issue, and not the MapReduce mappers issue.
So my options are:
HAR files - aggregate small files and only than saving them with their har:// path in another place
Sequence files - append them as they come in, this is more suitable for MapReduce jobs so i pretty much eliminated it
HBase - saving the small files to Hbase is another solution described in few articles on google
i guess i'm asking if there is anything i missed? can i achieve what i need by appeding binary files to big Avro/ORC/Parquet files? and then query them by name or by hash from java/client program?
Thanks,
If you append multiple files into large files, then you'll need to maintain an index of which large file each small file resides in. This is basically what Hbase will do for you. It combines data into large files, stores them in HDFS and uses sorting on keys to support fast random access. It sounds to me like Hbase would suit your needs, and if you hand rolled something yourself, you may end up redoing a lot of work that Hbase already does.
A rather stupid question but how do I combine multiple files in a folder into one file without copying them to local machine?I do not care about the order.
I thought hadoop fs -getmerge would do the job but I have since found out that it copies the data into your local machine.
I would do it in my original spark application but adding coalesce is increasing my runtime by a large amount.
I am on Hadoop 2.4 if that matters.
how do I combine multiple files in a folder into one file without copying them to local machine?
You have to either copy the files to local node or one of the computation node.
HDFS is a file system. It doesn't care about your file format. If your file is raw text/binary, you can try the concatenation API which only manipulate metadata in NameNode without copying data. But if your file is parquet/gzip/lzo or else, these files can't not be simply concated, you have to download them from HDFS, merge them into one, and upload the merged one. Spark's coalesce(1) do the same thing except it's done in the executor node instead of your local node.
If you have many folders has files need to be merged, spark/MR is definitely the right choice. One reason is the parallelism. The other reason is, if your file is like gzip doesn't support split, one huge gzip file may slow down your job. With some math calculation, you can merge small files into relative large files. ( file size equals to or slightly smaller than blocksize). It very easy with coalesce(n) API.
I suggest you to merge small files. But as #cricket_007 mentioned in the comment, merging doesn't always gain benefit.
I am writing a hadoop MapReduce job that is running over all source code files of a complete Debian mirror (≈ 40 GB). Since the Debian mirror data is on a separate machine and not in the hadoop cluster, the first step is to download the data.
My first implementation downloads a file and outputs key=$debian_package, value=$file_contents. The various values (typically 4) per key should then be reduced to a single entry. The next MapReduce job will then operate on debian packages as keys and all their files as values.
However, I noticed that hadoop works very poorly with output values that can sometimes be really big (700 MB is the biggest I’ve seen). In various places in the MapReduce framework, entire files are stored in memory, sometimes twice or even three times. I frequently encounter out of memory errors, even with a java heap size of 6 GB.
Now I wonder how I could split the data so that it better matches hadoop’s 64 MB block size.
I cannot simply split the big files into multiple pieces, because they are compressed (tar/bz2, tar/xz, tar/gz, perhaps others in the future). Until I shell out to dpkg-source on them to extract the package as a whole (necessary!), the files need to keep their full size.
One idea that came to my mind was to store the files on hdfs in the first MapReduce and only pass the paths to them to the second MapReduce. However, then I am circumventing hadoop’s support for data locality, or is there a way to fix that?
Are there any other techniques that I have been missing? What do you recommend?
You are correct. This is NOT a good case for Hadoop internals. Lots of copying... There are two obvious solutions, assuming you can't just untar it somewhere:
break up the tarballs using any of several libraries that will allow you to recursively read compressed and archive files (apache VFS has limited capability for this, but the apache compression library has more capability).
nfs mount a bunch of data nodes local space to your master node and then fetch and untar into that directory structure... then use forqlift or similar utility to load the small files into HDFS.
Another option is to write a utility to do this. I have done this for a client. Apache VFS and compression, truezip, then hadoop libraries to write (since I did a general purpose utility I used a LOT of other libraries, but this is the basic flow).
I have around 14000 small .gz files (~from 90kb to 4mb) which are loaded into HDFS all in the same directory.
So the size of each of them is far away from the standard 64mb or 128mb block size of HDFS, which can lead to serious trouble (the "small files problem", see this blog post by cloudera) when running MR jobs which process these files.
The aforementioned blog post contains a number of solutions to this problem, which mostly involve writing a MapReduce Job or using Hadoop Archives (HAR).
However, I would like to tackle the problem at the source and merge the small files into 64mb or 128mb .gz files which will then be fed directly into HDFS.
What's the simplest way of doing this?
cat small-*.gz > large.gz
should be enough. Assuming you don't need to extract separate files from there, and the data is enough.
If you want separate files, just tar it:
tar cf large.tar small-*.gz
After experimenting a bit further, the following two steps do what I want:
zcat small-*.gz | split -d -l2000000 -a 3 - large_
This works in my case, because there is very little variance in the length of a line.
2000000 lines correspond to almost exactly 300Mb files.
Unfortunately, for some reason, gzip cannot be piped like this, so I have to do another step:
gzip *
This will then also compress the generated large files.
Gzip compresses each of these files by a factor of ~5, leading to 60mb files and thus satisfying my initial constraint of receiving .gz files < 64mb.
I need to parse few XML's to TSV, the Size of the XML Files is of the order of 50 GB, I am basically doubtful about the implemetation i should choose to parse this i have two oprions
using SAXParser
use Hadoop
i have a fair bit of idea about SAXParser implementaion but i think having access to Hadoop cluster, i should use Hadoop as this is what hadoop is for i.e. Big Data
it would be great someone could provide a hint/doc as how to do this in Hadoop or efficient SAXParser implementaion for such a big file or rather what should i go for Hadoop or SAXparser?
I process large XML files in Hadoop quite regularly. I found it to be the best way (not the only way... the other is to write SAX code) since you can still operate on the records in a dom-like fashion.
With these large files, one thing to keep in mind is that you'll most definitely want to enable compression on the mapper output: Hadoop, how to compress mapper output but not the reducer output... this will speed things up quite a bit.
I've written a quick outline of how I've handled all this, maybe it'll help: http://davidvhill.com/article/processing-xml-with-hadoop-streaming. I use Python and Etrees which makes things really simple....
I don't know about SAXparser. But definitely Hadoop will do your job if you have a hadoop cluster with enough data nodes. 50Gb is nothing as I was performing operations on more than 300GB of data on my cluster. Write a map reduce job in java and the documentation for hadoop can be found at http://hadoop.apache.org/
It is rilatively trivial to process XML on hadoop by having one mapper per XML file. This approach will be fine for large number of relatively small XMLs
The problem is that in Your case files are big and thier number is small so without splitting hadoop benefit will be limited. Taking to account hadoop's overhead the benefit be negative...
In hadoop we need to be able to split input files into logical parts (called splits) to efficiently process large files.
In general XML is not looks like "spliitable" format since there is no well defined division into blocks, which can be processed independently. In the same time, if XML contains "records" of some kind splitting can be implemented.
Good discussion about splitting XMLs in haoop is here:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
where Mahout's XML input format is suggested.
Regarding your case - I think as long as number of your files is not much bigger then number of cores you have on single system - hadoop will not be efficient solution.
In the same time - if you want to accumulate them over time - you can profit from hadoop as a scalable storage also.
I think that SAX has traditionally been mistakenly associated with processing big XML files... in reality, VTD-XML is often the best option, far better than SAX in terms of performance, flexibility, code readability and maintainability... on the issue of memory, VTD-XML's in-memory model is only 1.3x~1.5X the size of the corresponding XML document.
VTD-XML has another significant benefit over SAX: its unparalleled XPath support. Because of it, VTD-XML users routinely report performance gain of 10 to 60x over SAX parsing over hundreds of MB XML files.
http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java#anch104307
Read this paper that comprehensively compares the existing XML parsing frameworks in Java.
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf