Hadoop HDFS: Read/Write parallelism? - hadoop

Couldn't find enough information on internet so asking here:
Assuming I'm writing a huge file to disk, hundreds of Terabytes, which is a result of mapreduce (or spark or whatever). How would mapreduce write such a file to HDFS efficiently (potentially parallel?) which could be read later in a parallel way as well?
My understanding is that HDFS is simply block based (128MB e.g.). so in order to write the second block, you must have wrote the first block (or at least determine what content will go to block 1). Let's say it's a CSV file, it is quite possible that a line in the file will span two blocks -- how could we read such CSV to different mapper in mapreduce? Does it have to do some smart logic to read two blocks, concat them and read the proper line?

Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.
By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the \r\n characters
Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files

Related

Does hadoop create InputSplits parallely

I have a large text file of size around 13gb. I want to process the file using Hadoop. I know that hadoop uses FileInputFormat to create InputSplits which are assigned to mapper tasks. I want to know if hadoop creates these InputSplits sequentially or in parallel. I mean does it read the large text file sequentially on a single host and create split files which are then distributed to datanodes, or does it read chunks of say 50mb in parallel?
Does hadoop replicate the big file on multiple hosts before splitting it up?
Is it recommended that I split up the file into 50mb chunks to speed up the processing? There are many questions on appropriate split size for mapper tasks but not the exact split process itself.
Thanks
InputSplits are created in the client side and it just a logical representation of the file in the sense it would only contain the file path,start and end offset values(calculated from linerecordreader initialize function). So calculating this logical rep. will not take much time so need to split your chunks the real execution happens at the mapper end where the execution is done in a parallel way. Then the client places the inputsplits into hdfs and jobtracker takes it from there and depending on the splits it allocates a tasktracker. Now here one mapper execution is not dependent on the other. The second mapper knows very well that where it has to start processing that split, so the mapper executions are done in parallel.
I suppose you want to process the file using MapReduce not Hadoop. Hadoop is a platform which provide tools to process and store large size data.
When you store the file in HDFS (Hadoop filesystem) it splits the file into multiple blocks. The size of the block is defined in hdfs-site.xml file as dfs.block.size. For example, if dfs.block.size=128 then your input file will be split into 128MB blocks. This is how HDFS store the data internally. For user it is always as a single file.
When you provide the input file (stored in HDFS) to MapReduce, it launches mapper task for each block/split of the file. This is default behavior.
you need not to split the file in chunks, just store the file in HDFS and it will the desired for you.
First let us understand what is meant by input split.
When your text file is divided into blocks of 128 MB size (default) by hdfs, assume that 10th line of the file is divided and first half of the is in first block and the other half is in second block. But when you submit a Map Program, hadoop understands that the last line of 1st block (which becomes input split here) is not complete. So it carries the second half of the 10th line to first input split. Which implies,
1) 1st input split = 1st Block + 2nd part of 10th line from 2nd block
2) 2nd input split = 2nd Block - 2nd part of 10th line from 2nd block.
This is an inbuilt process of hadoop and you cannot change or set the size of input split. The block size of hadoop v2 is by default 128 MB. You can increase during installation but you cannot decrease it.

Hadoop - set reducer number to 0 but write to same file?

My job is computational intensive so I am actually only using the distribution function of Hadoop, and I want all my output to be in 1 single file so I have set the number of reducer to 1. My reducer is actually doing nothing...
By explicitly setting the number of reducer to 0, may I know how can I control in the mapper to force all the outputs are written into the same 1 output file? Thanks.
You can't do that in Hadoop. Your mappers each have to write to independent files. This makes them efficient (no contention or network transfer). If you want to combine all those files, you need a single reducer. Alternatively, you can let them be separate files, and combine the files when you download them (e.g., using HDFS's command-line cat or getmerge options).
EDIT: From your comment, I see that what you want is to get away with the hassle of writing a reducer. This is definitely possible. To do this, you can use the IdentityReducer. You can check its API here and an explanation of 0 reducers vs. using the IdentityReducer is available here.
Finally, when I say that having multiple mappers generate a single output is not possible, I mean it is not possible with plain files in HDFS. You could do this with other types of output, like having all mappers write to a single database. This is OK if your mappers are not generating much output. Details on how this would work are available here.
cabad is correct for the most part. However, if you want to process the file with a single Mapper to a single output file you could use a FileInputFormat that marks the file as not splittable. Do this as well as set the number of Reducers to 0. This reduces the performance of using multiple data nodes but skips Shuffle and Sort.

fs -put (or copyFromLocal) and data type awareness

If I upload a text file of size 117MB to HDFS using hadoop fs -put filename, I can see that one datanode contains a filepart of size 64.98MB (the default file split size) and another data node contains a filepart of size 48.59MB.
My question is whether this split position was calculated in a data aware way (recognising somehow that the file is text and thus splitting the file at "\n", for example).
I realise that InputFileFormat can be used to tell running jobs how to split the file in an intelligent way but as I didn't specify the file type in the fs -put command, I was wondering if (and if so how) an intelligent splitting would be done in this case.
Ellie
I think you are mixing up 2 things here, following 2 types of splitting are completely separate:
Splitting files into HDFS blocks
Splitting files to be distributed to the mappers
And, no, split position wasn't calculated in a data aware way.
Now, by default if you are using FileInputFormat, then these both types of splitting kind-of overlaps (and hence are identical).
But you can always have a custom way of splitting for the second point above(or even have no splitting at all, i.e. have one complete file go to a single mapper).
Also you can change the hdfs block size independent of the way your InputFormat is splitting input data.
Another important point to note here is that, while the files are actually broken physically when getting stored in HDFS, but for the split in order to distribute to mappers, there is no actual physical split of files, rather it is only logical split.
Taking example from here :
Suppose we want to load a 110MB text file to hdfs. hdfs block size and
Input split size is set to 64MB.
Number of mappers is based on number of Input splits not number of hdfs block splits.
When we set hdfs block to 64MB, it is exactly 67108864(64*1024*1024) bytes. I mean it doesn't matter the file will
be split from middle of the line.
Now we have 2 input split (so two maps). Last line of first block and first line of second block is not meaningful. TextInputFormat is
responsible for reading meaningful lines and giving them to map jobs.
What TextInputFormat does is:
In second block it will seek to second line which is a complete line and read from there and gives it to second mapper.
First mapper will read until the end of first block and also it will process the (last incomplete line of first block + first
incomplete line of second block).
Read more here.

Writing to the same file from two mappers

In Hadoop MR (basically HDFS), is it possible to write to the same file from two mappers belonging to a single job in synchronous/serialized fashion?
Also writing to a single file from two mappers running in different jobs in a serialized fashion?
There are semaphores in other filesystems. What is the mechanism in HDFS?
There is no communication between the map tasks in Hadoop, so some sort of synchronization between them is not possible.
Files in HDFS may be written by a single writer, while many readers can read it.
I think MapR allows multiple writers to the same file.
FYI, the file has to be appended at the end and modifications at any arbitrary offset are also not possible.
Just curious, what is the use case for multiple map tasks writing to a single file?
Set the number or reducers = 1 (mapred.reduce.tasks=1)

Hadoop Pipes: how to pass large data records to map/reduce tasks

I'm trying to use map/reduce to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to store each record as a separate file in HDFS (I was planning to concatenate them all into a single binary sequence file), and each record is a large coherent (i.e. non-splittable) blob, between one and several hundred MB in size. The records will be consumed and processed by a C++ executable. If it weren't for the size of the records, the Hadoop Pipes API would be fine: but this seems to be based around passing the input to map/reduce tasks as a contiguous block of bytes, which is impractical in this case.
I'm not sure of the best way to do this. Does any kind of buffered interface exist that would allow each M/R task to pull multiple blocks of data in manageable chunks? Otherwise I'm thinking of passing file offsets via the API and streaming in the raw data from HDFS on the C++ side.
I'd like to have any opinions from anyone who's tried anything similar - I'm pretty new to hadoop.
Hadoop is not designed for records about 100MB in size. You will get OutOfMemoryError and uneven splits because some records are 1MB and some are 100MB. By Ahmdal's Law your parallelism will suffer greatly, reducing throughput.
I see two options. You can use Hadoop streaming to map your large files into your C++ executable as-is. Since this will send your data via stdin it will naturally be streaming and buffered. Your first map task must break up the data into smaller records for further processing. Further tasks then operate on the smaller records.
If you really can't break it up, make your map reduce job operate on file names. The first mapper gets some file names, runs them thorough your mapper C++ executable, stores them in more files. The reducer is given all the names of the output files, repeat with a reducer C++ executable. This will not run out of memory but it will be slow. Besides the parallelism issue you won't get reduce jobs scheduled onto nodes that already have the data, resulting in non-local HDFS reads.

Resources