Write to a single file from multiple reducers in hadoop - hadoop

I am trying to run Kmeans using Hadoop. I want to save the centroids of the clusters calculated in the cleanup method of the Reducer to some file say centroids.txt. Now, I would like to know what will happen if multiple reducers' cleanup method starts at the same time and all of them try to write to this file simultaneously. Will it be handled internally? If not is there a way to synchronize this task?
Note that this is not my output file of reducer. It is an additional file that I am maintaining to keep track of the centroids. I am using BufferedWriter from the reducer's cleanup method to do this.

Yes you are right. You cannot achieve that using existing framework.
Cleanup will be called many times.and you cannot synchronize. Possible
approaches you can follow are
Call merge after successful job.
hadoop fs -getmerge <src> <localdst> [addnl]
here
2 Clearly specify where your output file(s) should go. Use this folder as input to your next job.
3 Chain one more MR. where map and reduce don't change the data, and partitioner assigns all data to a single reducer

Each reducer writes to a separate file. Multiple reducers can never modify the same file.

Since the centroids are relatively few you can write them into zookeeper. If you had a high read/write load you would probably need HBase (which you can also use here but it would be an overkill)
Also note that there are several k-means implementation on Hadoop like Mahout. Some of these implementations are more efficient than map/reduce like Apache Hama which uses BSP or Spark which runs in-memory

Related

Is one map per line in hadoop grep example justified?

I'm a hadoop newbie. While going through hadoop example for a similar implementation in a rather large cluster, I was wondering why the grep example that comes along with hadoop code, why do they have one map per line ?
I know that it makes sense from the perspective of a teaching example. But in a real hadoop cluster, where a grep is to be done on an industry(1 PB log files) scale, is it worth creating a map() per line? Is the overhead of creating a map(), and the tasktracker keeping track of it and the associated bandwidth usage justified if we create a map per line?
A separate Map task will not be done for every line; You are confusing the programming model for MapReduce with the execution model.
When you implement a mapper, you are implementing a function that operates on a single piece of data (let's say a line in a log file). The hadoop framework takes care of essentially looping over all your log files, reading each line, and passing that line into your mapper.
MapReduce allows you to write your code in such a way that you are dealing with an abstraction that's useful: a line in a log file is a good example. The advantage of using something like Hadoop is that it will take care of the parallelization of this code for you: It will distribute your program out to a bunch of processes that will execute it (TaskTracker) and those TaskTrackers will read chunks llof data from the HDFS nodes that store it (Data Nodes).

Time spent by a Hadoop MapReduce mapper task to read input files from HDFS or S3

I am running a Hadoop MapReduce job, getting input files from HDFS or Amazon S3. I am wondering if it's possible to know how long does it take for a mapper task to read file from HDFS or S3 to the mapper. I'd like to know the time just for reading data, not include mapper processing time of those data. The result I am looking for is something like MB/second for a certain mapper task, which indicates how fast the mapper can read from HDFS or S3. It's something like a I/O performance.
Thanks.
Maybe you can just use a unit mapper and set the number of reducer to zero. Then the only thing that is done in your simulation is I/O, there will be no sorting and shuffling. Or if you specifically want to focus on reading then you can replace the unit mapper with a function that doesn't write any output.
Next I would set mapred.jvm.reuse=-1, to remove the jvm overhead. It isn't perfect but it is probably the easiest way to have a quick idea. If you want to do it precisely I would consider having a look at implemening your own hadoop counters, but currently I have no expericence with that.

Order of execution / priority of Hadoop map tasks

I have ~5000 entries in my Hadoop input file, but I know in advance that some of the lines will take much longer to process than others (in the map stage).
(Mainly because I need to download a file from Amazon S3, and the size of the file will vary between tasks)
I want to make sure that the biggest map tasks are processed first, to make sure that all my hadoop nodes will finish working roughly at the same time.
Is there a way to do that with Hadoop? Or do I need to rework the whole thing? (I am new to Hadoop)
Thanks!
Well if you would implement your custom InputFormat (the getSplits() method contains the logic about split creation), then theoretically you could achieve what you want.
BUT, you have to take special care, because the order of how the InputFormat returns the splits is not the order of how Hadoop will process it.
There is a split re-ordering code inside the JobClient:
// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(array, new NewSplitComparator());
which will make the whole thing more tricky.
But you could implement a custom InputFormat + a custom InputSplit and make the InputSlip#length() dependent on its expected execution time.

Synchronize data to HBase/HDFS and use it as input to MapReduce job

I would like to synchronize data to a Hadoop filesystem. This data is intended to be used as input for a scheduled MapReduce job.
This example might explain more:
Lets say I have an input stream of documents which contain a bunch of words, these words are needed as input for a MapReduce WordCount job. So, for each document, all words should be parsed out and uploaded to the filesystem. However, if the same document arrives from the input stream again, I only want the changes to be uploaded (or deleted) from the filesystem.
How should the data be stored; should I use HDFS or HBase? The amount of data is not very large, maybe a couple of GB.
Is it possible to start scheduled MapReduce jobs with input from HDFS and/or HBase?
I would first pick the best tool for the job, or do some research to make a reasonable choice. You're asking the question, which is the most important step. Given the amount of data you're planning to process, Hadoop is probably just one option. If this is the first step towards bigger and better things, then that would narrow the field.
I would then start off with the simplest approach that I expect to work, which typically means using the tools I already know. Write code flexibly to make it easier to replace original choices with better ones as you learn more or run into roadblocks. Given what you've stated in your question, I'd start off by using HDFS, using Hadoop command-lines tools to push the data to an HDFS folder (hadoop fs -put ...). Then, I'd write an MR job or jobs to do the processing, running them manually. When it was working I'd probably use cron to handle scheduling of the jobs.
That's a place to start. As you build the process, if you reach a point where HBase seems like a natural fit for what you want to store, then switch over to that. Solve one problem at a time, and that will give you clarity on which tools are the right choice each step of the way. For example, you might get to the scheduling step and know by that time that cron won't do what you need - perhaps your organization has requirements for job scheduling that cron won't fulfil. So, you pick a different tool.

Writing to the same file from two mappers

In Hadoop MR (basically HDFS), is it possible to write to the same file from two mappers belonging to a single job in synchronous/serialized fashion?
Also writing to a single file from two mappers running in different jobs in a serialized fashion?
There are semaphores in other filesystems. What is the mechanism in HDFS?
There is no communication between the map tasks in Hadoop, so some sort of synchronization between them is not possible.
Files in HDFS may be written by a single writer, while many readers can read it.
I think MapR allows multiple writers to the same file.
FYI, the file has to be appended at the end and modifications at any arbitrary offset are also not possible.
Just curious, what is the use case for multiple map tasks writing to a single file?
Set the number or reducers = 1 (mapred.reduce.tasks=1)

Resources