is there any way to control inputsplit in map reduce - hadoop

I have lots of small(150-300 KB) text file 9000 per hour,I need to process them through map reduce. I created a simple MR which will process all the file and create single output file. when i run job this job for 1 hour data, it took 45 min. i started digging reason of poor performance, i found it takes as many input-split as the number of file. as i am guessing one reason for poor performance.
is there any way to control the input split by which i can say 1000 file would be entertained by one input split/Map.

Hadoop is designed for huge files in small numbers and not the other way. There are some ways to get around it like preprocessing the data, using the CombineFileInputFormat.

Related

Processing HUGE number of small files independently

The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.
It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)
This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).

how can & where can i edit the InputSplit size in CDH4.7. By default it is 64 MB, but i want to mention it as 1MB

How can and where can i edit the Input Split size in CDH4.7 By default it is 64 MB but i want to mention it as 1MB because my MR job is running slow and i want increase the speed of MR job. i guess need to edit cor-site property IO.file.buffer.size but CDH4.7 is not allowing me to edit as it is read only.
just reapeating the question below the get my question posted
How can and where can i edit the Input Split size in CDH4.7 By default it is 64 MB but i want to mention it as 1MB because my MR job is running slow and i want increase the speed of MR job. i guess need to edit cor-site property IO.file.buffer.size but CDH4.7 is not allowing me to edit as it is read only.
The parameter "mapred.max.split.size" which can be set per job individually is what you looking for.
You don't change "dfs.block.size" because Hadoop Works better with a small number of large files than a large number of small files. One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small ("small" means significantly smaller than an HDFS block) and there are a lot of them, then each map task will process very little input, and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. Compare a 1gb file broken into sixteen 64mb blocks, and 10.000 or so 100kb files. The 10.000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file and 16 map tasks.
You can change it directly from within the command using -D mapred.max.split.size=.. from command line and don't necessarily change any file permanently.

improving performance when you have many small input files using Pig Latin

Currently I'm working with approximately 19 gigabytes of log data,
and they are much seperated so that the nubmer of input files is 145258(pig stat).
Between executing application and starting mapreduce job in web UI,
enormous time is wasted to get prepared(about 3hours?) and then the mapreduce job starts.
and also mapreduce job itself(through Pig script) is pretty slow, it takes about an hour.
mapreduce logic is not that complex, just like a group by operation.
I have 3 datanodes and 1 namenode, 1 secondary namenode.
How can I optimize configuration to improve mapreduce performance?
You should set pig.maxCombinedSplitSize to a reasonable size and make sure that pig.splitCombination is set to its default true.
Where is your data? on HDFS? on S3? If the data is on S3, you should merge the data into larger files once and then execute your pig scripts on it, otherwise, it's going to take a long time anyway - S3 returns object lists with pagination and it takes a long time to fetch the list (also if you have more objects in the bucket and you're not searching for your files with a prefix only pattern, hadoop will list all of the objects (because there's no other option in S3).
Try a hadoop fs -ls /path/to/files | wc -l and look at how long that takes to come back - you have two problems:
Discovering the files to process - the above ls will probably take a good number of minutes to complete. Each file then has to be queried for its block size to determine whether it can be split / processed by multiple mappers
Retaining all the information from the above is most probably going to push the JVM limits of your client, you'll probably see a huge amount of GC trying to assign, allocate and grow the collection used to store the split information for the at minimum 145k splits.
So as already suggested, try to combine your files into more sensible file sizes (somewhere near you block size, or a multiple thereof). Maybe you can combine all files for the same hour into a single concatenated file (or to day, depends on your processing use case).
Looks like the problem is more of Hadoop than Pig. You might want to try to combine all the small files into a Hadoop Archive and see if it improves the performance. For details refer to this link
Another approach you can try is run a separate Pig job which periodically UNIONs all the log files into one "big" log file. This should help in reducing the processing time for your main job.

Not getting performance with MapReduce

I am new to Map-reduce and I am using Hadoop Pipes. I have an input file which contains the number of records, one per line. I have written one simple program to print those lines in which three words are common. In map function I have emitted the word as a key and record as a value and compared those records in reduce function. Then I compared Hadoop's performance with simple C++ program in which I read the records from file and split it into words and load the data in map. Map contains word as a key and record as a value. After loading all the data, I compared that data. But I found that for doing the same task Hadoop Map-reduce takes long time compared with plain C++ program. When I run my program on hadoop it takes about 37 minutes where as it takes only about 5 minutes for simple C++ program. Please, somebody help me to figure out whether I am doing wrong somewhere? Our application needs performance.
There are several points which should be made here:
Hadoop is not high performance - it is scalable. Local program doing the same on small data set will always outperform hadoop. So its usage makes sense only when you want to run on cluster on machine and enjoy Hadoop's parallel processing.
Hadoop streaming is also not best thing performance wise since there are task switches per line. In many cases native hadoop program written in Java will have better performance

Hadoop file size Clarification

I am having clarification regarding using Hadoop for large file size around 2 million. I have file data that consists of 2 million lines for which I want to split each line as single file, copy it in Hadoop File System and do perform calculation of term frequency using Mahout. Mahout uses map-reduce computation in a distributed fashion. But for this, say If I have a file that consist of 2 million lines, I want to take each line as a document for calculation of term-frequency. I will finally have one directory where I will have 2 million documents, each document consist of single line. Will this create n-maps for n-files, here 2 million maps for the process. This takes lot of time for computation. Is there is any alternative way of representing documents for faster computation.
2 millions files is a lot for hadoop. More then that - running 2 million tasks will have roughly 2M seconds overhead, what means a few days of small cluster work.
I think that the problem is of algorithmic nature - how to map your computation to the map reduce paradigm in the way that you will have modest number of mappers. Please drop a few lines about task you need, and I might suggest algorithm.
Mahout has implementation for calcualating TF and IDF for text.
check mahout liberary for it,
and splitting each line as a file is not good idea in hadoop map reduce framework.

Resources