What's the best way to process small files? I have been reading answers and reading and I don't find any real good way to do it. If I have 20Tb of small data in HDFS, what should I do?
If I'm going to process my data a lot of times, I would turn them into SequenceFiles, but what happen if I will only process them once?
I have read some possibilities, If there're more and someone could correct me in some of them, it'd be great.
SequenceFiles.
CONS: The problem is that I have to run a mapreduce, so if I only want to process the data once, I think it's not worth it. If I have to run so many mapreduce as files I have, why should I waste my time turn my files into a SequenceFile??
PROS: It saves space in the nameNode and there's a SequenceInputFormat implemented.
Files
CONS: So many mapreduces as files I have. It spends too many memory in the NameNode
CombineFileInputFormat
CONS: It spends too many memory in the NameNode
PROS: It could combile files by blocks, so I don't have to execute as many maps as files.
HAR's
CONS: If I want to generate, I have to execute a mapreduce job, same problem than SequenceFiles. Some point files are duplicated, so I need extra memory to generate them, after that, I could delete the old files.
PROS: We could pack files, I'm not sure if each HAR goes just one mapreduce.
What I'm looking for if a way to pack (if it's possible to compress files) and don't have to execute one mapreduce per file without executing a mapreduce to generate those "new" files, and in the same time to save memory in the NameNode.
SequenceFiles looks pretty good, but it's looks too expensive to generate them.
Related
I want to be able to store millions of small files (binary files- images,exe etc) (~1Mb) on HDFS, my requirements are basically to be able to query random files and not running MapReduce jobs.
The main problem for me is the Namenode memory issue, and not the MapReduce mappers issue.
So my options are:
HAR files - aggregate small files and only than saving them with their har:// path in another place
Sequence files - append them as they come in, this is more suitable for MapReduce jobs so i pretty much eliminated it
HBase - saving the small files to Hbase is another solution described in few articles on google
i guess i'm asking if there is anything i missed? can i achieve what i need by appeding binary files to big Avro/ORC/Parquet files? and then query them by name or by hash from java/client program?
Thanks,
If you append multiple files into large files, then you'll need to maintain an index of which large file each small file resides in. This is basically what Hbase will do for you. It combines data into large files, stores them in HDFS and uses sorting on keys to support fast random access. It sounds to me like Hbase would suit your needs, and if you hand rolled something yourself, you may end up redoing a lot of work that Hbase already does.
The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.
It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)
This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).
I am writing a hadoop MapReduce job that is running over all source code files of a complete Debian mirror (≈ 40 GB). Since the Debian mirror data is on a separate machine and not in the hadoop cluster, the first step is to download the data.
My first implementation downloads a file and outputs key=$debian_package, value=$file_contents. The various values (typically 4) per key should then be reduced to a single entry. The next MapReduce job will then operate on debian packages as keys and all their files as values.
However, I noticed that hadoop works very poorly with output values that can sometimes be really big (700 MB is the biggest I’ve seen). In various places in the MapReduce framework, entire files are stored in memory, sometimes twice or even three times. I frequently encounter out of memory errors, even with a java heap size of 6 GB.
Now I wonder how I could split the data so that it better matches hadoop’s 64 MB block size.
I cannot simply split the big files into multiple pieces, because they are compressed (tar/bz2, tar/xz, tar/gz, perhaps others in the future). Until I shell out to dpkg-source on them to extract the package as a whole (necessary!), the files need to keep their full size.
One idea that came to my mind was to store the files on hdfs in the first MapReduce and only pass the paths to them to the second MapReduce. However, then I am circumventing hadoop’s support for data locality, or is there a way to fix that?
Are there any other techniques that I have been missing? What do you recommend?
You are correct. This is NOT a good case for Hadoop internals. Lots of copying... There are two obvious solutions, assuming you can't just untar it somewhere:
break up the tarballs using any of several libraries that will allow you to recursively read compressed and archive files (apache VFS has limited capability for this, but the apache compression library has more capability).
nfs mount a bunch of data nodes local space to your master node and then fetch and untar into that directory structure... then use forqlift or similar utility to load the small files into HDFS.
Another option is to write a utility to do this. I have done this for a client. Apache VFS and compression, truezip, then hadoop libraries to write (since I did a general purpose utility I used a LOT of other libraries, but this is the basic flow).
Are there any guidelines for whether to distribute a file using a distributed cache or not ?
I have a file of size 86746785 (I use hadoop dfs -dus - don't know if this is in bytes or what). Is it a good idea to distribue this file ?
The only viable answer is "it depends".
What you have to consider about using distributed cache is the file gets copied to every node that is involved in your task, which obviously takes bandwidth. Also, usually if you want the file in distributed cache, you'll keep the file in memory, so you'd have to take that into consideration.
As for your case -- yes, those are bytes. The size is roughly 86 MB, which is perfectly fine for distributed cache. Anything within a couple hundred MBs should probably still be.
In addition to TC1's answer, also consider:
When/where are you going to use the file(s) and how big is your cluster?
In a many mappers, single reducer (or small number of) scenario where you only need the file in the reducer i would advise against it as you might as well just pull down the file yourself in the reducer (setup method), rather than unnecessarily for each task node your mappers run on - especially if the file is large (this depends on how many nodes you have in your cluster)
How many files are you putting into the cache?
If for some reason you have 100's of files to distribute, you're better off tar'ing them up and putting the tar file in the distributed cache's archives set (the dist cache will take care of untaring the file for you). The thing you're trying to avoid here is if you didn't put them in the dist cache but directly loaded them from HDFS, you may run into a scenario where you have 1000's of mappers and or reducers trying to open the same file which could caused too many open files problems for the name node and data nodes
The size of Distributed Cache is 10GB by default. But its better to keep a few MBs of data in Distributed Cache.Otherwise it will affect the performance of your application.
I have a system I wish to distribute where I have a number of very large non-splittable binary files I wish to process in a distributed fashion. These are of the order of a couple of hundreds of Gb. For a variety of fixed, implementation specific reasons, these files cannot be processed in parallel but have to be processed sequentially by the same process through to the end.
The application is developed in C++ so I would be considering Hadoop pipes to stream the data in and out. Each instance will need to process of the order of 100Gb to 200Gb sequentially of its own data (currently stored in one file), and the application is currently (probably) IO limited so it's important that each job is run entirely locally.
I'm very keen on HDFS for hosting this data - the ability to automatically maintain redundant copies and to rebalance as new nodes are added will be very useful. I'm also keen on map reduce for its simplicity of computation and its requirement to host the computation as close as possible to the data. However, I'm wondering how suitable Hadoop is for this particular application.
I'm aware that for representing my data it's possible to generate non-splittable files, or alternatively to generate huge sequence files (in my case, these would be of the order of 10Tb for a single file - should I pack all my data into one). And that it's therefore possible to process my data using Hadoop. However it seems like my model doesn't fit Hadoop that well: does the community agree? Or have suggestions for laying this data out optimally? Or even for other cluster computing systems that might fit the model better?
This question is perhaps a duplicate of existing questions on hadoop, but with the exception that my system requires an order of magnitude or two more data per individual file (previously I've seen the question asked about individual files of a few Gb in size). So forgive me if this has been answered before - even for this size of data.
Thanks,
Alex
It seems like you are working with relatively few numbers of large files. Since your files are huge and not splittable, Hadoop will have trouble scheduling and distributing jobs effectively across the cluster. I think the more files that you process in one batch (like hundreds), the more worth while it will be to use Hadoop.
Since you're only working with a few files, have you tried a simpler distribution mechanism, like launching processes on multiple machines using ssh, or GNU Parallel? I've had a lot of success using this approach for simple tasks. Using a NFS mounted drive on all your nodes can share limits the amount of copying you would have to do as well.
You can write a custom InputSplit for your file, but as bajafresh4life said it won't really be ideal because unless your HDFS chunk size is the same as your file size your files are going to be spread all around and there will be network overhead. Or if you do make your HDFS size match your file size then you're not getting the benefit of all your cluster's disks. Bottom line is that Hadoop may not be the best tool for you.