How to report a value back to the driver from mapper? - hadoop

In my hadoop application I need to report a value (let's time when mapper is done processing) back to the driver program. How can I do that?

You maybe able to get such information by looking at the different reports generated by Hadoop for any mapreduce job.
In general, however, you can pass information back to the driver using counters. In your mapper you can do something like:
context.getCounter("records", "last_seen").setValue(System.currentTimeMillis());
and then read it from the driver as:
job.getCounters().getGroup("records").findCounter("last_seen").getValue();

Related

How can Spark take input after it is submitted

I am designing an application, which requires response very fast and need to retrieve and process a large volume of data (>40G) from hadoop file system, given one input (command).
I am thinking, if it is possible to catch such high amount of data in the distributed memory using spark, and let the application running all the time. If I give the application an command, it could start to process data based on the input.
I think catching such big data is not a problem. However, how can I let the application running, and take input?
As far as I know, there is nothing can be done after "spark-submit" command...
You can try spark job server and Named Objects to cache dataset in distributed memory and use it in various input commands.
The requirement is not clear!!!, but based on my understanding,
1) In spark-submit after the application.jar, you can provide application specific command line arguments. But if you want to send commands after the job was started, then you can write a spark streaming job which processes kafka messages.
2) HDFS is already optimised for processing large volume of data. You can cache intermediate reusable data so that they do not get re-computed. But for better performance you might consider using something like elasticsearch/cassandra, so that they can be fetched/stored even faster.

Mesos & Hadoop: How to get the running job input data size?

I'm running Hadoop 1.2.1 on top of Mesos 0.14. My goal is to log the input data size, running time, cpu usage, memory usage, and so on for optimization purposes later. All of these but data size are obtained using Sigar.
Is there any way I can get the input data size of any job which is running?
For example, when I'm running hadoop example's terasort, I need to get the teragen's generated data size before the job actually runs. If I'm running Wordcount example, I need to get the wordcount input file size. I need to get the data size automatically since I won't be able to know what job will be run inside this framework later.
I'm using Java to write some of the mesos library code. Preferably, I want to get the data size inside MesosExecutor class. For some reason, upgrading Hadoop/Mesos isn't an option.
Any suggestions or related API will be appreciated. Thank you.
Does hadoop fs -dus satisfy your requirement? Before submit the job to hadoop, calculate the input file size and pass it as params to your executor.

Java Vs Scripting for HDFS map/reduce

I am a DB person, so java is new to me. Looking for scripting language for working with HDFS, may be Python I am looking for. But I see in one of the previous question, you mentioned that "Heart Beat" between Name and Data node will not happen if we use scripting language. Why, I could not understand? As we are writing our application logic to process data in the scripts or java code, and how it matter for the "Heart Beat"?
Any idea, on this?
Python is good choice for hadoop if you know already how to code with it. I've used php and perl with success. This part of Hadoop framework is called Streaming.
For "Heart Beat" I believe you are thinking of Counters. They are user defined "variables" that can only be incremented. Hadoop will terminate task attempt if no counters are incremented for 10 minutes. However you shouldn't worry about this as there are system counters that are automatically incremented for you. If you do have a job that takes very long, you can still use counters with Python (Hadoop Streaming) by sending something like this to standard error output:
reporter:counter:MyGroup,MyCounter,1
For more info on counters with Hadoop Streaming see this

hadoop - How can i use data in memory as input format?

I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.

Resources