Job wide custom cleanup after all the map tasks are completed - hadoop

While running a map-reduce job, that has only mapper, I have a counter that counts the number of failed documents .And after all the mappers are done, I want the job to fail if the total number of failed documents are above a fixed fraction. ( I need it in the end because I don't know the total number of documents initially). How can I achieve this without implementing a reduce just for this ?
I know that there are task level cleanup method. But is there any job level cleanup method, that can be used to perform this after all the tasks are done ?

This can be done very easily. Thats the beauty of latest mapreduce API.
The execution of a mapper can be controlled with the help of overriding run method in Mapper class, and same way for the reducer. I do not know the final outcome that you are expecting. But, i have prepared a small example for you. I have
in my mapper class, i have overriden run method and giving you a sample, it iterrupts the execution if the key value is greater than 200 in my code.
public class ReversingMapper extends Mapper<LongWritable, Text, ReverseIntWritable, Text>
{
public final LongWritable border = new LongWritable(100);
#Override
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
/* extra code to standard run method started here */
//if(context.getCounter(<ENUM>) > 200 ){} -- you can place your counter check here.
if(context.getCurrentKey().get() > 200 )
{
throw new InterruptedException();
}else
{
/* extra code to standard run method ended here */
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
}
and you need to handle in properly in the Driver as well.
} catch (InterruptedException e) {
e.printStackTrace();
System.exit(0);
}
You can have logger and even log a proper message that is required here..
I hope this solves your problem. Let me know if you need any more help.

Related

JobRunr Spring Boot: how to get notified if a recurring job - including retries - has failed

I'm using jobrunr 5.1.4 in my spring boot application. I have a simple service declaring a recurring job which allows for some retries. A single failing job run is not that relevant for me. Instead, I'm interested in getting notified after all jobs, i.e. the initial job including all the retries, have failed.
I thought JobRunr's JobServerFilter would be a good idea. But the onProcessed() method never gets triggered in case of an exception only in case of a successful job run. And the ApplyStateFilter gets triggered on every state change. Far too often for my requirement. Leaving me clueless, if a change to a FAILED state was the last in a series of jobs belonging together (initial job + allowed retried jobs).
A simple example would look like this:
#Service
public class JobScheduler {
#Job(name = "My Recurring Job", retries = 2, jobFilters = ExceptionFilter.class)
#Recurring(id = "my-recurring-job", cron = "*/10 * * * *")
public void recurringJob() {
throw new RuntimeException("foo");
}
}
A basic implementation of my JobFilter looks like this:
#Component
public class ExceptionFilter implements JobServerFilter, ApplyStateFilter {
#Override
public void onProcessing(Job job) {
log.info("onProcessing: {}", job.getJobName());
log.info(job.getJobState().getName().name());
}
#Override
public void onProcessed(Job job) {
log.info("onProcessed: {}", job.getJobName());
log.info(job.getJobState().getName().name());
}
#Override
public void onStateApplied(Job job, JobState jobState1, JobState jobState2) {
log.info("onStateApplied: {}", job.getJobName());
log.info("jobState1: {}", jobState1.getName().name());
log.info("jobState2: {}", jobState2.getName().name());
}
}
Is this use case even possible with JobRunr? Or does anyone have an idea how to solve this issue in a different way?
Thank you very much in advance for you support.
I think you're on the right track with onStateApplied from ApplyStateFilter.
You can use the following approach:
#Override
public void onStateApplied(Job job, JobState oldState, JobState newState) {
if (isFailed(newState) && maxAmountOfRetriesReached(job)) {
// your logic here
}
}
OnProcessed is not triggered as your job was not processed (due to the failure).

How can I see the current output of a running Storm topology?

Currently learning on how to use Storm (version 2.1.0), I am a bit confused on a specific aspect of this data streaming processing (DSP) engine: How is output data handled? Tutorials provide good explanations on system setup and running our first application. Unfortunately, I didn't find a page providing details on results generated by a topology.
With DSP applications, there are no final output because input data is a continuously incoming stream of data (or maybe we can say there is a final output when application is stopped). What I would like is to be able to see the state of current output (the actual output data generated at current time) of a running topology.
I'm able to run WordCountTopology. I understand the output of this topology is generated by the following snippet of code:
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) {
count = 0;
}
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
My misunderstanding is on the location of the <"word":string, "count":int> output. Is it only in memory, written in a database somewhere, written in a file?
Going further with this question: what are the existing possibilities for storing in-progress output data? What is the "good way" of handling such data?
I hope my question is not too naive. And thanks to the StackOverflow community for always providing good help.
A few days have passed since I posted this question. I am back to share with you what I have tried. Although I cannot tell if it is the right way of doing, the two following propositions answer my question.
Simple System.out.println()
The first thing I've tried is to make a System.out.println("Hello World!") directly within the prepare() method of my BaseBasicBolt. This method is called only once at the beginning of each Bolt's thread execution.
public void prepare(Map topoConf, TopologyContext context) {
System.out.println("Hello World!");
}
The big challenge was to figure out where the log is written. By default, it is written within <storm installation folder>/logs/workers-artifacts/<topology name>/<worker-port>/worker.log where <worker-port> is the port of a requested worker/slot.
For instance, with conf.setNumWorkers(3), the topology requests an access to 3 workers (3 slots). Therefore, values of <worker-port> will be 6700, 6701 and 6702. Those values are the port numbers of the 3 slots (defined in storm.yaml under supervisor.slots.ports).
Note: you will have as many "Hello World!" as the parallel size of your BaseBasicBolt. When the split bolt is instantiated with builder.setBolt("split", new SplitSentence(), 8), it results in 8 parallel threads, each one writing its own log.
Writing to a file
For research purpose I have to analyse large amounts of logs that I need in a specific format. The solution I found is to append the logs to a specific file managed by each bolt.
Hereafter is my own implementation of this file logging solution for the count bolt.
public static class WordCount extends BaseBasicBolt {
private String workerName;
private FileWriter fw;
private BufferedWriter bw;
private PrintWriter out;
private String logFile = "/var/log/storm/count.log";
private Map<String, Integer> counts = new HashMap<String, Integer>();
public void prepare(Map topoConf, TopologyContext context) {
this.workerName = this.toString();
try {
this.fw = new FileWriter(logFile, true);
this.bw = new BufferedWriter(fw);
this.out = new PrintWriter(bw);
} catch (Exception e) {
System.out.println(e);
}
}
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) {
count = 0;
}
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
out.println(this.workerName + ": Hello World!");
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
In this code, my log file is located in /var/log/storm/count.log and calling out.println(text) appends the text at this end of this file. As I am not sure if it is thread-safe, all parallel threads writing at the same time into the same file might result in data loss.
Note: if your bolts are distributed accros multiple machines, each machine is going to have its own log file. During my testings, I configured a simple cluster with 1 machine (running Nimbus + Supervisor + UI), therefore I had only 1 log file.
Conclusion
There are multiple ways to deal with output data and, more generally logging anything with Storm. I didn't find any official way of doing it and documentation very light on this subject.
While some of us would be satisfied with a simple sysout.println(), others might need to push large quantity of data into specific files, or maybe in a specialized database engine. Anything you can do with Java is possible with Storm because it's simple Java programming.
Any advices and additional comments to complete this answer will be gladly appreciated.

Hadoop: When does the setup method gets invoked in reducer?

As far as I understand, the reduce task has three phases.
Shuffle, Sort and actual reduce invocation.
So usually in hadoop job's output we see something like,
map 0% reduce 0%
map 20% reduce 0%
.
.
.
map 90% reduce 10%
.
.
.
So I assume that the reduce tasks start before all the maps are finished and this behavior is controlled by the slow start configuration.
Now I don't yet understand when does the setup method of the reducer is actually called.
In my use case, I have some files to parse in the setup method. The file is about 60MB in size and is picked up from the distributed cache. While the file is being parsed, there is another set of data from configuration that can update the just parsed record. After parsing and possible updation, the file is stored in a HashMap for fast lookups. So I would like this method to be invoked as soon as possible, possibly while the mappers are still doing their thing.
Is it possible to do this? Or is that what already happens?
Thanks
Setup is called right before it is able to read the first key/values pair from the stream.
Which is effectively after all mappers ran and all the merging for a given reducer partition is finished.
As explained in Hadoop docs, setup() method is called once at the start of the task. It should be used for the instantiating resources/variables or reading configurable params which in turn can be used in reduce() method. Think of it like a constructor.
Here is an example reducer:
class ExampleReducer extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
private int runId;
private ObjectMapper objectMapper;
#Override
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
this.runId = Integer.valueOf(conf.get("stackoverflow_run_id"));
this.objectMapper = new ObjectMapper();
}
#Override
protected void reduce(ImmutableBytesWritable keyFromMap, Iterable<ImmutableBytesWritable> valuesFromMap, Context context) throws IOException, InterruptedException {
// your code
var = objectMapper.writeValueAsString();
// your code
context.write(new ImmutableBytesWritable(somekey.getBytes()), put);
}
}

Find which data split caused the job to fail in hadoop

I was wondering if I can get some help on how to debug this situation?
Basically, I am reading the data from hdfs.. perform some basic computation.. and write the result back to hdfs..
But in job tracker.. I see that one of the task is always in initializing phase?
Task Complete Phase ..... Counter
task_201312040108_0001_m_003006 0 Initializing 0
And after few attempts (3) this task failed.. forcing whole job to fail.. while rest of the others.. succeed...
How do I debug this situation?
I was wondering if i can take a look at what data split this mapper is getting...?? Oh.. this is a map only task..
all my Java mappers extend a base mapper that has the following code:
// hook for subclasses
protected void doSetup( Context ctx ) throws IOException, InterruptedException {}
public final void setup( Context ctx )
throws IOException, InterruptedException {
String strSplitMsg = "Input split: " + ctx.getInputSplit();
LOG.info( strSplitMsg );
ctx.setStatus( strSplitMsg );
doSetup( ctx );
}
so that I never get bitten by that problem. However, your freeze might be happening before the call to setup(); perhaps you can look at the task tracker log on the host where the failures occurred or the task attempt log itself.

Working of RecordReader in Hadoop

Can anyone explain how the RecordReader actually works? How are the methods nextkeyvalue(), getCurrentkey() and getprogress() work after the program starts executing?
(new API): The default Mapper class has a run method which looks like this:
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
The Context.nextKeyValue(), Context.getCurrentKey() and Context.getCurrentValue() methods are wrappers for the RecordReader methods. See the source file src/mapred/org/apache/hadoop/mapreduce/MapContext.java.
So this loop executes and calls your Mapper implementation's map(K, V, Context) method.
Specifically, what else would you like to know?
org.apache.hadoop.mapred.MapTask
- runNewMapper()
Imp steps:
creates new mapper
get input split for the mapper
get recordreader for the split
initialize record reader
using record reader iterate through getNextKeyVal() and pass key,val to mappers map method
clean up

Resources