Writing to the same file from two mappers - hadoop

In Hadoop MR (basically HDFS), is it possible to write to the same file from two mappers belonging to a single job in synchronous/serialized fashion?
Also writing to a single file from two mappers running in different jobs in a serialized fashion?
There are semaphores in other filesystems. What is the mechanism in HDFS?

There is no communication between the map tasks in Hadoop, so some sort of synchronization between them is not possible.
Files in HDFS may be written by a single writer, while many readers can read it.
I think MapR allows multiple writers to the same file.
FYI, the file has to be appended at the end and modifications at any arbitrary offset are also not possible.
Just curious, what is the use case for multiple map tasks writing to a single file?

Set the number or reducers = 1 (mapred.reduce.tasks=1)

Related

Hadoop HDFS: Read/Write parallelism?

Couldn't find enough information on internet so asking here:
Assuming I'm writing a huge file to disk, hundreds of Terabytes, which is a result of mapreduce (or spark or whatever). How would mapreduce write such a file to HDFS efficiently (potentially parallel?) which could be read later in a parallel way as well?
My understanding is that HDFS is simply block based (128MB e.g.). so in order to write the second block, you must have wrote the first block (or at least determine what content will go to block 1). Let's say it's a CSV file, it is quite possible that a line in the file will span two blocks -- how could we read such CSV to different mapper in mapreduce? Does it have to do some smart logic to read two blocks, concat them and read the proper line?
Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.
By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the \r\n characters
Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files

Concept of blocks in Hadoop HDFS

I have some questions regarding the blocks in Hadoop. I read that Hadoop uses HDFS which will creates blocks of specific size.
First Question Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Second Question Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
1.Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes. Blocks exist physically. You can use commands like hadoop fsck /path/to/file -files -blocks
Refer below SE questions for commands to view blocks :
Viewing the number of blocks for a file in hadoop
2.Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop = Distributed storage ( HDFS) + Distributed processing ( MapReduce & Yarn).
A MapReduce job works on input splits => The input splits are are created from Data blocks in Datanodes. Data blocks are created during write operation of a file. If you are running a job on existing files, data blocks are pre-creared before the job and InputSplits are created during Map operation. You can think data block as physical entity and InputSplit as logical entity. Mapreduce job does not change input data blocks. Reducer generates output data as new data blocks.
Mapper process input splits and emit output to Reducer job.
3.Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Input is already available with physicals DFS blocks. A MapReduce job works in InputSplit. Blocks and InputSplits may or may not be same. Block is a physical entity and InputSplit is logical entity. Refer to below SE question for more details :
How does Hadoop perform input splits?
4.Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Mapper input : Input blocks pre-exists. Map process starts on input blocks/splits, which have been stored in HDFS before commencement of Mapper job.
Mapper output : Not stored in HDFS and it does not make sense to store intermediate results on HDFS with replication factor of X more than 1.
Reducer output: Reducer output is stored in HDFS. Number of blocks will depend on size of reducer output data.
Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes, the blocks exist physically on disk across the datanodes in your cluster. I suppose you could "see" them if you were on one of the datanodes and you really wanted to, but it would likely not be illuminating. It would only be a random 128m (or whatever dfs.block.size is set to in hdfs-site.xml) fragment of the file with no meaningful filename. The hdfs dfs commands enable you to treat HDFS as a "real" filesystem.
Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop takes care of splitting the file into blocks and distributing them among the datanodes when you put a file in HDFS (through whatever method applies to your situation).
Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Not entirely sure what you mean, but the blocks exist before, and irrespective of, any processing you do with them.
Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Again, blocks in HDFS are determined before any processing is done, if any is done at all. HDFS is simply a way to store a large file in a distributed fashion. When you do processing, for example with a MapReduce job, Hadoop will write intermediate results to disk. This is not related to the blocking of the raw file in HDFS.

Write to a single file from multiple reducers in hadoop

I am trying to run Kmeans using Hadoop. I want to save the centroids of the clusters calculated in the cleanup method of the Reducer to some file say centroids.txt. Now, I would like to know what will happen if multiple reducers' cleanup method starts at the same time and all of them try to write to this file simultaneously. Will it be handled internally? If not is there a way to synchronize this task?
Note that this is not my output file of reducer. It is an additional file that I am maintaining to keep track of the centroids. I am using BufferedWriter from the reducer's cleanup method to do this.
Yes you are right. You cannot achieve that using existing framework.
Cleanup will be called many times.and you cannot synchronize. Possible
approaches you can follow are
Call merge after successful job.
hadoop fs -getmerge <src> <localdst> [addnl]
here
2 Clearly specify where your output file(s) should go. Use this folder as input to your next job.
3 Chain one more MR. where map and reduce don't change the data, and partitioner assigns all data to a single reducer
Each reducer writes to a separate file. Multiple reducers can never modify the same file.
Since the centroids are relatively few you can write them into zookeeper. If you had a high read/write load you would probably need HBase (which you can also use here but it would be an overkill)
Also note that there are several k-means implementation on Hadoop like Mahout. Some of these implementations are more efficient than map/reduce like Apache Hama which uses BSP or Spark which runs in-memory

Hadoop - set reducer number to 0 but write to same file?

My job is computational intensive so I am actually only using the distribution function of Hadoop, and I want all my output to be in 1 single file so I have set the number of reducer to 1. My reducer is actually doing nothing...
By explicitly setting the number of reducer to 0, may I know how can I control in the mapper to force all the outputs are written into the same 1 output file? Thanks.
You can't do that in Hadoop. Your mappers each have to write to independent files. This makes them efficient (no contention or network transfer). If you want to combine all those files, you need a single reducer. Alternatively, you can let them be separate files, and combine the files when you download them (e.g., using HDFS's command-line cat or getmerge options).
EDIT: From your comment, I see that what you want is to get away with the hassle of writing a reducer. This is definitely possible. To do this, you can use the IdentityReducer. You can check its API here and an explanation of 0 reducers vs. using the IdentityReducer is available here.
Finally, when I say that having multiple mappers generate a single output is not possible, I mean it is not possible with plain files in HDFS. You could do this with other types of output, like having all mappers write to a single database. This is OK if your mappers are not generating much output. Details on how this would work are available here.
cabad is correct for the most part. However, if you want to process the file with a single Mapper to a single output file you could use a FileInputFormat that marks the file as not splittable. Do this as well as set the number of Reducers to 0. This reduces the performance of using multiple data nodes but skips Shuffle and Sort.

processing very small file with hadoop

I have a question about using hadoop to process a small file. My file only has about a 1,000 or so records but i want the records to roughly be evenly distributed among the nodes. Is there a way to do this? I'm new to hadoop and so far it seems that all the execution is happening on one node instead a multiple simultaneously. Let me know if my question makes sense or if I need to clarify anything. Like I said, i'm very new to Hadoop but am hoping to get some clarification. Thanks.
Use the NLineInputFormat and specify the number of records to be processed by each mapper. This way the records in a single block will be processed by multiple mappers.
The other option is to split your one input file into multiple input files (in the one input path directory).
Each of those input files will then be able to be spread across the hdfs and the map
operations will occur on the worker machines that own those input splits.

Resources