I am working with set of files from directory, which is output of another task. I need to process content of entire file at once (calculate MD5 checksums and do some transformations). I'm not sure how signature of my Mapper should look like, if I will make is as
class MyMapper extends Mapper<LongWritable, Text, NullWritable, NullWritable> { ... }
then I will get entire content of an input file in in map method. And this will be stored in memory, but files could be quite big.
Is there any way to not read the complete "record" into memory for processing by Hadoop map task, but get a "stream" for the record?
You actually don't need to worry about it. Hadoop is optimized to leverage all the resources of your cluster to do the jobs. The whole point of it is to abstract away the low level details of all that and let you focus on your use case.
I assure you Hadoop can handle your files. If they are really big and/or your cluster has less powerful or unreliable machines, then the jobs might take longer. But they won't fail (absent any other errors).
So I think your approach is fine. The only suggestion I would make is to consider avoiding canonical MapReduce because it isn't a high enough level of abstraction. Consider Cascading or JCascalog instead.
Related
I am using hadoop 1.0.3. Can the input split/block be changed (increase/decrease) during run time based on some constraints. Is there a class to override to accomplish this mechanism like FileSplit/InputTextFormat? Can we have variant size blocks in HDFS depending on logical constraint in one job?
You're not limited to TextInputFormat... Thats entirely configurable based on the data source you are reading. Most examples are line delimited plaintext, but that obviously doesn't work for XML, for example.
No, block boundaries can't change during runtime as your data should already be on disk, and ready to read.
But the InputSplit is dependent upon the InputFormat for the given job, which should remain consistent throughout a particular job, but the Configuration object in the code is basically a Hashmap which can be changed while running, sure
If you want to change block size only for a particular run or application you can do by overriding "-D dfs.block.size=134217728" .It helps you to change block size for your application instead of changing overall block size in hdfs-site.xml.
-D dfs.block.size=134217728
Lets imagine I want to store a big number of urls with associated metadata
URL => Metadata
in a file
hdfs://db/urls.seq
I would like this file to grow (if new URLs are found) after every run of MapReduce.
Would that work with Hadoop? As I understand MapReduce outputs data to a new directory. Is there any way to take that output and append it to the file?
The only idea which comes to my mind is to create a temporary urls.seq and then replace the old one. It works but it feels wasteful. Also from my understanding Hadoop likes the "write once" approach and this idea seams to be in conflict with that.
As blackSmith has explained that you can easily append an existing file in hdfs but it would bring down your performance because hdfs is designed with "write once" strategy. My suggestion is to avoid this approach until no option left.
One approach you may consider that is you can make a new file for every mapreduce output , if size of every output is large enough then this technique will benefit you most because writing a new file will not affect performance as appending does. And also if you are reading the output of each mapreduce in next mapreduce then reading anew file won't affect your performance that much as appending does.
So there is a trade off it depends what you want whether performance or simplicity.
( Anyways Merry Christmas !)
I have few tens of full sky maps, in binary format (FITS) of about 600MB each.
For each sky map I already have a catalog of the position of few thousand sources, i.e. stars, galaxies, radio sources.
For each source I would like to:
open the full sky map
extract the relevant section, typically 20MB or less
run some statistics on them
aggregate the outputs to a catalog
I would like to run hadoop, possibly using python via the streaming interface, to process them in parallel.
I think the input to the mapper should be each record of the catalogs,
then the python mapper can open the full sky map, do the processing and print the output to stdout.
Is this a reasonable approach?
If so, I need to be able to configure hadoop so that a full sky map is copied locally to the nodes that are processing one of its sources. How can I achieve that?
Also, what is the best way to feed the input data to hadoop? for each source I have a reference to the full sky map, latitude and longitude
Though it doesn't sound like a few tens of your sky maps are a very big data set, I've used Hadoop successfully as an simple way to write distributed applications/scripts.
For the problem you describe, I would try implementing a solution with Pydoop, and specifically Pydoop Script (full disclaimer: I'm one of the Pydoop developers).
You could set up a job that takes as input the list of sections of the sky map that you want to process, serialized in some sort of text format with one record per line. Each map task should process one of these; you can achieve this split easily with the standard NLineInputFormat.
You don't need to copy the sky map locally to all the nodes as long as the map tasks can access the file system on which it is stored. Using the pydoop.hdfs module, the map function can read the section of the sky map that it needs to process (given the coordinates it received as input) and then emit the statistics as you were saying so that they can be aggregated in the reducer. pydoop.hdfs can read from both "standard" mounted file systems and HDFS.
Though the problem domain is totally unrelated, this application may serve as an example:
https://github.com/ilveroluca/seal/blob/master/seal/dist_bcl2qseq.py#L145
It uses the same strategy, preparing a list of "coordinates" to be processed, serializing them to a file, and then launching a simple pydoop job that takes that file as input.
Hope that helps!
(from a Hadoop newbie)
I want to avoid files where possible in a toy Hadoop proof-of-concept example. I was able to read data from non-file-based input (thanks to http://codedemigod.com/blog/?p=120) - which generates random numbers.
I want to store the result in memory so that I can do some further (non-Map-Reduce) business logic processing on it. Essetially:
conf.setOutputFormat(InMemoryOutputFormat)
JobClient.runJob(conf);
Map result = conf.getJob().getResult(); // ?
The closest thing that seems to do what I want is store the result in a binary file output format and read it back in with the equivalent input format. That seems like unnecessary code and unnecessary computation (am I misunderstanding the premises which Map Reduce depends on?).
The problem with this idea is that Hadoop has no notion of "distributed memory". If you want the result "in memory" the next question has to be "which machine's memory?" If you really want to access it like that, you're going to have to write your own custom output format, and then also either use some existing framework for sharing memory across machines, or again, write your own.
My suggestion would be to simply write to HDFS as normal, and then for the non-MapReduce business logic just start by reading the data from HDFS via the FileSystem API, i.e.:
FileSystem fs = new JobClient(conf).getFs();
Path outputPath = new Path("/foo/bar");
FSDataInputStream in = fs.open(outputPath);
// read data and store in memory
fs.delete(outputPath, true);
Sure, it does some unnecessary disk reads and writes, but if your data is small enough to fit in-memory, why are you worried about it anyway? I'd be surprised if that was a serious bottleneck.
In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.