Is it possible to write the output of a mapreduce directly into the data which was the input of this mapreduce?
Thanks!
I suppose your job does something new, even dealing with the same data (maybe it takes care of the timestamp of execution or it accesses an external service).
You could try to set the same path for both input and output data:
FileInputFormat.addInputPath(job, new Path(configuration.get("/path/to/data"));
FileOutputFormat.setOutputPath(job, new Path(configuration.get("/path/to/data")));
Since the mappers write their data onto a temp directory, it could work (caveat: I never tried to do that!).
As you have mention You dnt want duplicate Data, which seems like you dnt want to perform any Operation/Analysis of data in dfs, where Mapreduce is only used when analyzing data, so in said case, you can read the previous data in mention location repeatedly.
Note: If You are using language like pig/hive, you need to keep a copy/history of previous data as Pig/Hive you to clear the location before/after processing. So history location can be use to call back the same data again. :)
Related
Lets imagine I want to store a big number of urls with associated metadata
URL => Metadata
in a file
hdfs://db/urls.seq
I would like this file to grow (if new URLs are found) after every run of MapReduce.
Would that work with Hadoop? As I understand MapReduce outputs data to a new directory. Is there any way to take that output and append it to the file?
The only idea which comes to my mind is to create a temporary urls.seq and then replace the old one. It works but it feels wasteful. Also from my understanding Hadoop likes the "write once" approach and this idea seams to be in conflict with that.
As blackSmith has explained that you can easily append an existing file in hdfs but it would bring down your performance because hdfs is designed with "write once" strategy. My suggestion is to avoid this approach until no option left.
One approach you may consider that is you can make a new file for every mapreduce output , if size of every output is large enough then this technique will benefit you most because writing a new file will not affect performance as appending does. And also if you are reading the output of each mapreduce in next mapreduce then reading anew file won't affect your performance that much as appending does.
So there is a trade off it depends what you want whether performance or simplicity.
( Anyways Merry Christmas !)
(from a Hadoop newbie)
I want to avoid files where possible in a toy Hadoop proof-of-concept example. I was able to read data from non-file-based input (thanks to http://codedemigod.com/blog/?p=120) - which generates random numbers.
I want to store the result in memory so that I can do some further (non-Map-Reduce) business logic processing on it. Essetially:
conf.setOutputFormat(InMemoryOutputFormat)
JobClient.runJob(conf);
Map result = conf.getJob().getResult(); // ?
The closest thing that seems to do what I want is store the result in a binary file output format and read it back in with the equivalent input format. That seems like unnecessary code and unnecessary computation (am I misunderstanding the premises which Map Reduce depends on?).
The problem with this idea is that Hadoop has no notion of "distributed memory". If you want the result "in memory" the next question has to be "which machine's memory?" If you really want to access it like that, you're going to have to write your own custom output format, and then also either use some existing framework for sharing memory across machines, or again, write your own.
My suggestion would be to simply write to HDFS as normal, and then for the non-MapReduce business logic just start by reading the data from HDFS via the FileSystem API, i.e.:
FileSystem fs = new JobClient(conf).getFs();
Path outputPath = new Path("/foo/bar");
FSDataInputStream in = fs.open(outputPath);
// read data and store in memory
fs.delete(outputPath, true);
Sure, it does some unnecessary disk reads and writes, but if your data is small enough to fit in-memory, why are you worried about it anyway? I'd be surprised if that was a serious bottleneck.
When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';
In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.
I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.