Hadoop mapreduce design / Routing mapper and reducer on one job - hadoop

I want to make an mapreduce design like this inside one job.
Example :
I want on a job: *************************************************************
[Mapper A] ---> [Mapper C]
[Mapper B] ---> [Reducer B]
After that [Reducer B] ---> [Mapper C]
[Mapper C] ---> [Reducer C] ******************************************************************************
So [Mapper A] & [Reducer B] ---> [Mapper C]. And next [Mapper C] continue to [Reducer C]. I want all scenario above run on one job.
It's like a routing inside one mapreduce job. I can route many mappers to particular reducer and continue it to other mapper than reducer again inside one job. I need your suggest bro
Thanks.....

--Edit starts
To simplify the problem lets say you have three jobs JobA, JobB, JobC each comprising of a map and a reduce phase.
Now you want to use mapper output of JobA in mapper task of JobC, so JobC just needs to wait for JobA to finish its map task, you can use MultipleOutputs class in your JobA to preserve/write map phase output at a location which JobC can poll for.
--Edit ends
Programatically you can do something like below code, where getJob() should be defined in respective Map-reduce class, where you specify configuration, DistributedCache, input formats etc.
main () {
processMapperA();
processMapReduceB();
processMapReduceC();
}
processMapperA()
{
// configure the paths/inputs needed, for example sake I am taking two paths
String path1 = "path1";
String path2 = "path2";
String[] mapperApaths = new String[]{path1, path2};
Job mapperAjob = MapperA.getJob(mapperApaths, <some other params you want to pass>);
mapperAjob.submit();
mapperAjob.waitForCompletion(true);
}
processMapReduceB()
{
// init input params to job
.
.
Job mapReduceBjob = MapReduceB.getJob(<input params you want to pass>);
mapReduceBjob.submit();
mapReduceBjob.waitForCompletion(true);
}
processMapReduceC()
{
// init input params to job
.
.
Job mapReduceCjob = MapReduceC.getJob(<input params you want to pass like outputMapperA, outputReducerB>);
mapReduceCjob.submit();
mapReduceCjob.waitForCompletion(true);
}
To gain more control on the workflow, you can consider using Oozie or SpringBatch.
With Oozie you can define workflow.xml's and schedule execution of each job as required.
SpringBatch can also be used for same but would require some coding and understanding, if you have a background to it, it can be used straightaway.
--Edit starts
Oozie is a workflow management tool, it allows you to configure and schedule jobs.
--Edit Ends
Hope this helps.

Related

Multi-Module Task Dependencies

I want the outputs of one task to be available to an identical task in another submodule.
I'm trying to make yet-another plugin for compilation (of C/++, .hs, .coffee, .js et al) and source code generation.
So, I'm making a plugin and task/s that (so far) generate CMakeLists.txt, Android.mk, .vcxproj or whatever for each module to build the source code.
I have a multi-module build for this.
I can reach around and find the tasks from "other" submodules, but, I can't seem to enforce any execution order.
So, with ...
root project: RootModule
sub project: NativeCommandLine (requires SharedModule)
sub project: NativeGUI (requires SharedModule)
sub project: SharedModule
... I find that the NativeGUI tasks are executed before SharedModule which means that the SharedModule results aren't ready.
Bad.
Since the dependency { ... } stuff happens after plugins are installed (AFAIK) ... I'm guessing that the dependencies are connected after.
I need my tasks executed in order based on the dependency relations ... right? How can I do that?
I have created a (scala) TaskBag that lazily registers a collection of all participating Task instances.
I add instances of my task to this, along with a handler for when a new task appears.
During configure, any task can include logic in the lambda to filter and act on other tasks and it will be executed as soon as both tasks are participating.
package peterlavalle
import java.util
import org.gradle.api.Task
object TaskBag {
class AnchorExtension extends util.LinkedList[(Task, Task => Unit)]()
/**
* connect to the group of tasks
*/
def apply(task: Task)(react: Task => Unit): Unit =
synchronized {
// lazily create the central anchor ... thing ...
val anchor: AnchorExtension =
task.getProject.getRootProject.getExtensions.findByType(classOf[AnchorExtension]) match {
case null =>
task.getProject.getRootProject.getExtensions.create(classOf[AnchorExtension].getName, classOf[AnchorExtension])
case anchor: AnchorExtension =>
anchor
}
// show us off to the old ones
anchor.foreach {
case (otherTask, otherReact) =>
require(otherTask != task, "Don't double register a task!")
otherReact(task)
react(otherTask)
}
// add us to the list
anchor.add(task -> react)
}
}

Gradle Plugin: Task C should run if Task A or Task B:

I am creating Custom Plugin tasks and got into a situation where I need some help.
I create 3 tasks. Task A, B, C in gradle.
Task C should get executed only if A or B got success.
Note that A and B are 2 separate tasks and are not related.
class A extends DefaultTask { }
class B extends DefaultTask { }
class C extends DefaultTask { }
If I try C.dependsOn(A); C.dependsOn(B);, then I think C is dependent on both A and B (not A or B). Is there any way to specify A or B condition here.
Gradle offers four methods (and related containers) for task dependencies and ordering:
Style: t1.<method>(t2)
dependsOn - Ensures, that t2 is executed if and before t1 is executed.
finalizedBy - Ensures, that t2 is executed if and after t2 is executed.
mustRunAfter - Ensures, that if both t2 and t1 are executed (caused by other triggers), t1 is executed after t2.
shouldRunAfter - Basically the same as mustRunAfter, but may be ignored for special cases (check the docs).
Your requirement is special and won't be solved by a simple method like above. If I understand your question right, you want to ensure that task C gets executed after tasks A and B, but only if it will be executed anyhow (and not trigger it automatically). You can use mustRunAfter for this first part of the requirement. However, you also want to ensure that either task A or task B was executed before. I suggest to use the onlyIf method to skip task execution for task C, if neither A or B were executed before. Example:
task A { }
task B { }
task C {
mustRunAfter A, B
onlyIf { A.state.executed || B.state.executed }
}
You can add multiple dependencies.
task B << {
println 'Hello from B'
}
task C << {
println 'Hello from C'
}
task D(dependsOn: ['B', 'C'] << {
println 'Hello from D'
}
The output is:
> gradle -q D
Hello from B
Hello from C
Hello from D
I think the answer to the PM question in gradle 7.x+ would be something like:
A.finalizedBy('C')
B.finalizedBy('C')
tasks.register('C'){
onlyIf{
(gradle.taskGraph.hasTask('A') &&
tasks.A.getState().getExecuted() &&
tasks.A.getState().getFailure() == null)
||
(gradle.taskGraph.hasTask('B') &&
tasks.B.getState().getExecuted() &&
tasks.B.getState().getFailure() == null)
}
}
With this condition you can't execute C by itself as in 'gradle C'.
Another option could be a Task listener:
A.finalizedBy('C')
B.finalizedBy('C')
gradle.taskGraph.afterTask { Task task, TaskState taskState ->
if (['A','B'].contains(task.getName())){
if(taskState.getFailure() == null){
project.tasks.getByName('C').setEnabled(true)
}else{
project.tasks.getByName('C').setEnabled(false)
}
}
}
With this option it is possible to just execute C with 'gradle C'.
Yes another option could be to declare inputs to C and then in a task listener change C's input as necessary to influence its upToDate check.
Then A and B are executed directly or by some other trigger. if both A and B are in the tasks graph then C will probably be executed twice. if A and B are executed in parallel the result IMO is unpredictable.

Apache storm final bolt that should not emit tuples?

Assuming we have the following topology
spout A -> bolt B -> bolt C -> bolt E
and bolt E is the final one, that persists info in the database, therefore no needs to emit any tuple. How to implement such solution,
if I define no output_fields - then I get exception
Exception in thread "main" java.io.IOException: org.apache.storm.thrift.protocol.TProtocolException: Required field 'output_fields' is unset! Struct:StreamInfo(output_fields:null, direct:false)
at storm.petrel.ThriftReader.read(ThriftReader.java:77)
at storm.petrel.GenericTopology.readTopology(GenericTopology.java:36)
at storm.petrel.GenericTopology.main(GenericTopology.java:53)
Caused by: org.apache.storm.thrift.protocol.TProtocolException: Required field 'output_fields' is unset! Struct:StreamInfo(output_fields:null, direct:false)
at org.apache.storm.generated.StreamInfo.validate(StreamInfo.java:407)
at org.apache.storm.generated.StreamInfo$StreamInfoStandardScheme.read(StreamInfo.java:485)
at org.apache.storm.generated.StreamInfo$StreamInfoStandardScheme.read(StreamInfo.java:441)
at org.apache.storm.generated.StreamInfo.read(StreamInfo.java:377)
at org.apache.storm.generated.ComponentCommon$ComponentCommonStandardScheme.read(ComponentCommon.java:681)
at org.apache.storm.generated.ComponentCommon$ComponentCommonStandardScheme.read(ComponentCommon.java:636)
at org.apache.storm.generated.ComponentCommon.read(ComponentCommon.java:552)
at org.apache.storm.generated.Bolt$BoltStandardScheme.read(Bolt.java:451)
at org.apache.storm.generated.Bolt$BoltStandardScheme.read(Bolt.java:427)
at org.apache.storm.generated.Bolt.read(Bolt.java:358)
at org.apache.storm.generated.StormTopology$StormTopologyStandardScheme.read(StormTopology.java:727)
at org.apache.storm.generated.StormTopology$StormTopologyStandardScheme.read(StormTopology.java:683)
at org.apache.storm.generated.StormTopology.read(StormTopology.java:595)
at storm.petrel.ThriftReader.read(ThriftReader.java:75)
... 2 more
Please re-check bolt E, make sure it was not set by any others bolt (it's meant bolt E was not used by any methods TopologyBuilder.setBolt, e.g. : TopologyBuilder.setBolt("mybolt",new MyBolt()).fieldsGrouping("bolt E",
new Fields(new String[] { "user_id" }));

How do I log to file in Scalding?

In my Scalding map reduce code, I want to log out certain steps that are happening so that I can debug the map-reduce jobs if something goes wrong.
How can I add logging to my scalding job?
E.g.
import com.twitter.scalding._
class WordCountJob(args: Args) extends Job(args) {
//LOG: Starting job at time blah..
TextLine( args("input") )
.read
.flatMap('line -> 'word) {
line: String =>
line.trim.toLowerCase.split("\\W+")
}
.groupBy('word) { group => group.size('count) }
}
.write(Tsv(args("output")))
//LOG - ending job at time...
}
Any logging framework will do. You can obviously also use println() - it will appear in your job's stdout log file in the job history of your hadoop cluster (in hdfs mode) or in your console (in local mode).
Also consider defining a trap with the addTrap() method for catching erroneous records.

How to trace function call in Erlang ?

I have a function in my_sup.erl like this:
init([ems_media_sup]) ->
{ok, {{simple_one_for_one, ?MAX_RESTART, ?MAX_TIME}, [
{ems_media_sup, {ems_media, start_link, []}, temporary, 2000, worker, [ems_media]}]
}};
But there is no function named start_link/1 in ems_media.erl, I want to know why there is no error when run
supervisor:start_link(?MODULE, [ems_media_sup])
So, How to know what happened next after call init([ems_media_sup])
That's because my_sup is of type simple_one_for_one - so it will only start child processes when explicitly asked to do so through supervisor:start_child/2.
If the supervisor had been of any other type (one_for_one, one_for_all or rest_for_one) it would have attempted to start all children in the child specification at startup, but a simple_one_for_one supervisor is for creating large numbers of children that only vary by their argument list, so in that case the child specification in the init function only plays the role of a template.

Resources