I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.
Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.
How is the total mappers determined as 28?
I added the below line into my wordcount.java program to check.
FileInputFormat.setMaxInputSplitSize(job, 2);
Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.
row1,row2,row3,row4,row5,row6.......row20
Should I split the input file into 20 different files each having only 2 rows?
HDFS block and MapReduce splits are 2 different things. Blocks are physical division of data while a Split is just a logical division done during a MR job. It is the duty of InputFormat to create the Splits from a given set data and based on the number of Splits the number of Mappers is decided. When you use setMaxInputSplitSize, you overrule this behavior and give a Split size of your own. But giving a very small value to setMaxInputSplitSize would be an overkill as there will be a lot of very small Splits, and you'll end up having a lot of unnecessary Map tasks.
Actually I don't see any need for you to use FileInputFormat.setMaxInputSplitSize(job, 2); in your WC program. Also,it looks like you have mistaken the 2 here. It is not the number of lines in a file. It is the Split size, in long, which you would like to have for your MR job. You can have any number of lines in the file which you are going to use as your MR input.
Does this sound OK?
That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.
Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?
Related
Couldn't find enough information on internet so asking here:
Assuming I'm writing a huge file to disk, hundreds of Terabytes, which is a result of mapreduce (or spark or whatever). How would mapreduce write such a file to HDFS efficiently (potentially parallel?) which could be read later in a parallel way as well?
My understanding is that HDFS is simply block based (128MB e.g.). so in order to write the second block, you must have wrote the first block (or at least determine what content will go to block 1). Let's say it's a CSV file, it is quite possible that a line in the file will span two blocks -- how could we read such CSV to different mapper in mapreduce? Does it have to do some smart logic to read two blocks, concat them and read the proper line?
Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.
By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the \r\n characters
Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files
I tried to run a map reduce job on a small file (200kb), with only 10 lines. I used hadoop streaming as the map reduce job is written using shell script. I assumed it will use only 1 map, but the job tracker shows 'Map Total' as 2.
Please advice what could be the reason as I assumed the number of mappers will depend on the input file size and the allocated block size.
Thanks
I have a large text file of size around 13gb. I want to process the file using Hadoop. I know that hadoop uses FileInputFormat to create InputSplits which are assigned to mapper tasks. I want to know if hadoop creates these InputSplits sequentially or in parallel. I mean does it read the large text file sequentially on a single host and create split files which are then distributed to datanodes, or does it read chunks of say 50mb in parallel?
Does hadoop replicate the big file on multiple hosts before splitting it up?
Is it recommended that I split up the file into 50mb chunks to speed up the processing? There are many questions on appropriate split size for mapper tasks but not the exact split process itself.
Thanks
InputSplits are created in the client side and it just a logical representation of the file in the sense it would only contain the file path,start and end offset values(calculated from linerecordreader initialize function). So calculating this logical rep. will not take much time so need to split your chunks the real execution happens at the mapper end where the execution is done in a parallel way. Then the client places the inputsplits into hdfs and jobtracker takes it from there and depending on the splits it allocates a tasktracker. Now here one mapper execution is not dependent on the other. The second mapper knows very well that where it has to start processing that split, so the mapper executions are done in parallel.
I suppose you want to process the file using MapReduce not Hadoop. Hadoop is a platform which provide tools to process and store large size data.
When you store the file in HDFS (Hadoop filesystem) it splits the file into multiple blocks. The size of the block is defined in hdfs-site.xml file as dfs.block.size. For example, if dfs.block.size=128 then your input file will be split into 128MB blocks. This is how HDFS store the data internally. For user it is always as a single file.
When you provide the input file (stored in HDFS) to MapReduce, it launches mapper task for each block/split of the file. This is default behavior.
you need not to split the file in chunks, just store the file in HDFS and it will the desired for you.
First let us understand what is meant by input split.
When your text file is divided into blocks of 128 MB size (default) by hdfs, assume that 10th line of the file is divided and first half of the is in first block and the other half is in second block. But when you submit a Map Program, hadoop understands that the last line of 1st block (which becomes input split here) is not complete. So it carries the second half of the 10th line to first input split. Which implies,
1) 1st input split = 1st Block + 2nd part of 10th line from 2nd block
2) 2nd input split = 2nd Block - 2nd part of 10th line from 2nd block.
This is an inbuilt process of hadoop and you cannot change or set the size of input split. The block size of hadoop v2 is by default 128 MB. You can increase during installation but you cannot decrease it.
My job is computational intensive so I am actually only using the distribution function of Hadoop, and I want all my output to be in 1 single file so I have set the number of reducer to 1. My reducer is actually doing nothing...
By explicitly setting the number of reducer to 0, may I know how can I control in the mapper to force all the outputs are written into the same 1 output file? Thanks.
You can't do that in Hadoop. Your mappers each have to write to independent files. This makes them efficient (no contention or network transfer). If you want to combine all those files, you need a single reducer. Alternatively, you can let them be separate files, and combine the files when you download them (e.g., using HDFS's command-line cat or getmerge options).
EDIT: From your comment, I see that what you want is to get away with the hassle of writing a reducer. This is definitely possible. To do this, you can use the IdentityReducer. You can check its API here and an explanation of 0 reducers vs. using the IdentityReducer is available here.
Finally, when I say that having multiple mappers generate a single output is not possible, I mean it is not possible with plain files in HDFS. You could do this with other types of output, like having all mappers write to a single database. This is OK if your mappers are not generating much output. Details on how this would work are available here.
cabad is correct for the most part. However, if you want to process the file with a single Mapper to a single output file you could use a FileInputFormat that marks the file as not splittable. Do this as well as set the number of Reducers to 0. This reduces the performance of using multiple data nodes but skips Shuffle and Sort.
Currently I'm working with approximately 19 gigabytes of log data,
and they are much seperated so that the nubmer of input files is 145258(pig stat).
Between executing application and starting mapreduce job in web UI,
enormous time is wasted to get prepared(about 3hours?) and then the mapreduce job starts.
and also mapreduce job itself(through Pig script) is pretty slow, it takes about an hour.
mapreduce logic is not that complex, just like a group by operation.
I have 3 datanodes and 1 namenode, 1 secondary namenode.
How can I optimize configuration to improve mapreduce performance?
You should set pig.maxCombinedSplitSize to a reasonable size and make sure that pig.splitCombination is set to its default true.
Where is your data? on HDFS? on S3? If the data is on S3, you should merge the data into larger files once and then execute your pig scripts on it, otherwise, it's going to take a long time anyway - S3 returns object lists with pagination and it takes a long time to fetch the list (also if you have more objects in the bucket and you're not searching for your files with a prefix only pattern, hadoop will list all of the objects (because there's no other option in S3).
Try a hadoop fs -ls /path/to/files | wc -l and look at how long that takes to come back - you have two problems:
Discovering the files to process - the above ls will probably take a good number of minutes to complete. Each file then has to be queried for its block size to determine whether it can be split / processed by multiple mappers
Retaining all the information from the above is most probably going to push the JVM limits of your client, you'll probably see a huge amount of GC trying to assign, allocate and grow the collection used to store the split information for the at minimum 145k splits.
So as already suggested, try to combine your files into more sensible file sizes (somewhere near you block size, or a multiple thereof). Maybe you can combine all files for the same hour into a single concatenated file (or to day, depends on your processing use case).
Looks like the problem is more of Hadoop than Pig. You might want to try to combine all the small files into a Hadoop Archive and see if it improves the performance. For details refer to this link
Another approach you can try is run a separate Pig job which periodically UNIONs all the log files into one "big" log file. This should help in reducing the processing time for your main job.