Why cannot Reducer.class used as a real reducer in Hadoop MapReduce? - hadoop

I noticed that Mapper.class can be used as a real mapper in a phase, together with a user-defined reducer. For example,
Phase 1:
Mapper.class -> WordCountReduce.class
This will work.
However, Reducer.class cannot be used the same way. Namely something like
Phase 2:
WordReadMap.class -> Reducer.class
will not work.
Why is that?

I don't see why it wouldn't as long as the outputs are of the same class as the inputs. The default in the new API just writes out whatever you pass into it, it's implemented as
#SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
For the old API, it's an interface, and you can't directly instantiate an interface. If you're using that, then that's the reason it fails. Then again, the Mapper is an interface as well, and you shouldn't be able to instantiate it...

Related

Multiple writers for different types in the same Spring Batch step

I am writing a Spring Batch application with the following workflow:
Read some items of type A (using a FlatFileItemReader<A>).
Process an item, transforming it from A to B.
Write the processed items of type B (using a JdbcBatchItemWriter<B>)
Eventually, I should call an external service (a RESTful API, but it could be a SimpleMailMessageItemWriter<A>) using data from the source type A.
How can I configure such a workflow?
So far, I have found the following workaround:
Configuring a CompositeItemWriter<B> which delegates to:
The actual ItemWriter<B>
A custom ItemWriter<B> implementation which converts B back to A and then writes an A
But this is a cumbersome solution because it forces me to either:
Duplicate processing logic: from A to B and back again.
Sneakily hide some attributes from the source object A inside B, polluting the domain model.
Note: since my custom item writer for A needs to invoke an external service, I would like to perform this operation after B has been successfully written.
Here are the relevant parts of the batch configuration code.
#Bean
public Step step(StepBuilderFactory steps, ItemReader<A> reader, ItemProcessor<A, B> processor, CompositeItemWriter<B> writer) {
return steps.get("step")
.<A, B>chunk(10)
.reader(reader)
.processor(processor)
.writer(writer)
.build();
}
#Bean
public CompositeItemWriter<B> writer(JdbcBatchItemWriter<B> jdbcBatchItemWriter, CustomItemWriter<B, A> customItemWriter) {
return new CompositeItemWriterBuilder<B>()
.delegates(jdbcBatchItemWriter, customItemWriter)
.build();
}
For your use case, I would encapsulate A and B in a wrapper type, such AB:
class AB {
private A originalItem;
private B transformedItem;
}
With that, you would have: ItemReader<A>, ItemProcessor<A, AB> and ItemWriter<AB>. The processor creates instances of AB in which it keeps a reference to the original item. The writer can then get access to both types and delegate to the JdbcBatchItemReader<B> and SimpleMailMessageItemWriter<A> as needed, something like:
class ABItemWriter implements ItemWriter<AB> {
private JdbcBatchItemWriter<B> jdbcBatchItemWriter;
private SimpleMailMessageItemWriter mailMessageItemWriter;
// constructor with delegates
#Override
public void write(List<? extends AB> items) throws Exception {
jdbcBatchItemWriter.write(getBs(items));
mailMessageItemWriter.write(getAs(items)); // this would not be called if the jdbc writer fails
}
}
The methods getAs and getBs would extract items of type A/B from AB. Encapsulation for the win! BTW, a Java record is a good option for type AB.

Could reducer class not be launched by any chance? Can't see Sytem.out.println statements in the reducer logs

I have a driver class, mapper class and reducer class. The mapreduce job runs fine. But the desired out is not coming. I have put System.out.println statements in the reducer. I looked at the logs of mapper and reducer. System.out.println statements that I put in mapper can be seen in the logs but println statements in the reducer are not seen in the logs. Could it be possible that reducer is not at all launched?
This is the log fine from reducer.
I assume this question is based on the code in your earlier question: mapreduce composite Key sample - doesn't show the desired output
public class CompositeKeyReducer extends Reducer<Country, IntWritable, Country, IntWritable> {
public void reduce(Country key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {
}
}
The reduce isn't running because the reduce method signature is wrong. You have:
public void reduce(Country key, Iterator<IntWritable> values, Context context)
It should be:
public void reduce(Country key, Iterable<IntWritable> values, Context context)
To make sure this doesn't happen again you should add the #Override annotation to the class. This will tell you if you've got the signature wrong.
No change in the code. It works now.
All I did was restarted my Hadoop Cloudera image and it works now. I can't believe this happended.

Are member variables of Hadoop Reducer class thread-safe?

I'm a newbie of the Hadoop ecosystem.
What I want to ask is that: "Are member variables of Reducer class thread-safe?"
Mapper passes data to Reducer with unique key.
There is a collection(ConcurrentLinkedQueue) which is a member variable in Reducer class.
The collection is initialized in the setup(Context) method of Reducer class.
Some Query objects(jOOQ) are created and appended into the collection in the reduce(...) method of Reducer class.
jooq.batch(collection).execute() method will be called in the last line of reduce(...) method within specified threshold(e.g 1000). And then the collection will be cleared by clear() method.
The remains of collection from step 4 will be processed as same as step 5 in cleanup(Context) method.
Question: Do I need to synchronize step 5?
Codes
public class SomeReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
private Queue<Query> queries;
#Override
protected void setup(Context context) {
...
queries = new ConcurrentLinkedQueue<>();
}
#Override
protected void cleanup(Context context) {
if (!queries.isEmpty()) db.batch(queries).execute();
...
}
#Override
public void reduce(Text key, Iterable<Session> sessions, Context context) {
for (...iteration...) { queries.add(...create Query object...); }
// Is this code snippet below should be synchronized?
if (queries.size() >= 1000) {
db.batch(queries).execute();
queries.clear();
}
}
}
A Reducer is threadsafe. You will most likely have multiple Reducers running in parallel, but they are completely isolated from each other and only see their own data and instance variables.
So to answer your qustion, you do not need to synchronize your code or even use a ConcurrentLinkedQueue, it could just be a normal ArrayList.

about context object in map-reduce

Can anyone explain why we are writing arguments in angle brackets in below statement and why we are defining output key/value pairs in arguments.
public static class Map extends Mapper <LongWritable, Text, Text, IntWritable>
What is context object and why we are using in the below statement.
public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException
To add to what #Vasu answered..
Context stores references to RecordReader and RecordWriter.
Whenever context.getCurrentKey() and context.getCurrentValue() are used to retrieve key and value pair, the request is assigned to RecordReader. And when context.write() is called, it is assigned to RecordWriter.
Here RecordReader and RecordWriter are actually abstract classes.
<> is used to indicate generics in Java.
Mapper <LongWritable, Text, Text, IntWritable> takes only <LongWritable,Text> as keys and <Text,IntWritable> as values. If you try to provide any other writable types to your mapper, this will throw an error.
Context context object is used to write output Key-Values as well as get configuration, counters, cacheFiles etc in the Mapper.

How to get command line arguments of an Eclipse 4 application from code

I need to somehow get command line arguments of a running Eclipse 4 application. I'm working on a small application based on the Eclipse 4 RCP, but I thing, this problem is more common. I'm unable to find out, how to get from code of a product respectively of a plug-in the command line argumnets, the application have been executed with.
I need to use a custom command line parameter to pass on information to my code. Do anybody know a hint?
Since E4 is using Equinox as runtime you can use the Platform class to get the application arguments.
Platform.getApplicationArgs()
See Javadoc:
http://help.eclipse.org/kepler/index.jsp?topic=%2Forg.eclipse.platform.doc.isv%2Freference%2Fapi%2Findex.html
I've got it. It is not so intuitive, but it works for me. There is an instance implementing the IApplicationContext interface. (The interface depends on the org.eclipse.equinox.app.) The instance is reachable by the injection mechanism. The method getArguments() returns a map. But it does not return a map of some command line parameters and their values. It returns some map, where it is under the key "application.args" stored an array. Exampli gratia:
#PostConstruct
public void createControls(Composite parent, HtmlEditorService editorService, IApplicationContext iac) {
System.out.println(iac.getArguments().get("application.args").getClass().getCanonicalName());
...
}
Then it prints out java.lang.String[]. However the array contains just my custom arguments instead all arguments. Fortunately it does not matter for me. I need to get my custom arguments only.
Additional hint for a plug-in activator
public class Aktivator implements BundleActivator {
#Override
public void start(BundleContext context) throws Exception {
ServiceReference<?> ser = context.getServiceReference(IApplicationContext.class);
IApplicationContext iac = (IApplicationContext)context.getService(ser);
System.out.println(iac.getArguments().get("application.args").getClass().getCanonicalName());
}
#Override
public void stop(BundleContext context) throws Exception {
}
}

Resources