I would like to import my Hadoop job output to Hive table. How do I implement a post-hooking in a map-reduce job/flow? Or any other automated options?
Also I would a notification after the job is done, such as sending a email to user. I found this: https://issues.apache.org/jira/browse/HADOOP-1111, but I don't quite understand how to do it since I'm new to map-reducing.
Thanks.
conf.set("mapreduce.job.end-notification.url","url")
would do. The url should be an http url where you would receive the callback.
From javadocs :
Set the uri to be invoked in-order to send a notification after the job has completed (success/failure).
The uri can contain 2 special parameters: $jobId and $jobStatus. Those, if present, are replaced by the job's identifier and completion-status respectively.
This is typically used by application-writers to implement chaining of Map-Reduce jobs in an asynchronous manner.
Note that older hadoop versions use job.end.notification.url.
It has been deprecated in newer versions in favour of mapreduce.job.end-notification.url.
Reference mapred-default.xml#mapreduce.job.end-notification.url.
Related
I use rest api in my program,I made a processor group for convent a mongodb collection to json file:
I want to run the scheduling only one time,so I set the "Run schedule" to 10000 sec.Then I will stop the group when the data flow have ran one time,and I made a Notify processor and add a DistributedMapCacheService.But the DistributedMapCacheClientService of the Notify processor only comunicates with the DistributedMapCacheService in nifi itself,It never nofity my program.
I try to use my own socket server,but I only get a message "nifi" but no more message.
My question is:If I only want scheduling run once and stop it,how do I know when shall I stop it?Or is there some other way to achieve my purpose,like detect if the json file exists or use incremental data(If the scheduling run twice,the data will be repeated twice)?
As #daggett said you can do it in a synchronous way you can use HandleHttpRequest as trigger and HandleHttpResponse to manage the response.
For an asynchronous was you have several options for the notification like PutTCP, PostHTTP, GetHTTP, use FTP, file system, XMPP or whatever.
If the scheduling run twice the duplicated elements depends on the processors you use, some of them have state others no, but if you are facing problems with repeated elements you can use the DetectDuplicate processor.
Running a streaming beam pipeline where i stream files/records from gcs using avroIO and then create minutely/hourly buckets to aggregate events and add it to BQ. In case the pipeline fails how can i recover correctly and process the unprocessed events only ? I do not want to double count events .
One approach i was thinking was writing to spanner or bigtable but it may be the case the write to BQ succeeds but the DB fails and vice versa ?
How can i maintain a state in reliable consistent way in streaming pipeline to process only unprocessed events ?
I want to make sure the final aggregated data in BQ is the exact count for different events and not under or over counting ?
How does spark streaming pipeline solve this (I know they have some checkpointing directory for managing state of query and dataframes ) ?
Are there any recommended techniques to solve accurately these kind of problem in streaming pipelines ?
Based on clarification from the comments, this question boils down to 'can we achieve exactly-once semantics across two successive runs of a streaming job, assuming both runs are start from scratch?'. Short answer is no. Even if the user is willing store some state in external storage, it needs to be committed atomically/consistently with streaming engine internal state. Streaming engines like Dataflow, Flink store required state internally, which is needed for to 'resume' a job. With Flink you could resume from latest savepoint, and with Dataflow you can 'update' a running pipeline (note that Dataflow does not actually kill your job even when there are errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update.
Some what relaxed guarantees would be feasible with careful use of external storage. The details really depend on specific goals (often it is is no worth the extra complexity).
What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.
The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.
What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.
The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.
I want to chain 2 Map/Reduce jobs. I am trying to use JobControl to achieve the same. My problem is -
JobControl needs org.apache.hadoop.mapred.jobcontrol.Job which in turn needs org.apache.hadoop.mapred.JobConf which is deprecated. How do I get around this problem to chain my Map/Reduce?
Anyone has any better ideas for chaining (other than Cascading).
You could use Riffle, it allows you to chain arbitrary processes together (anything you stick its Annotations on).
It has a rudimentary dependency scheduler, so it will order and execute your jobs for you. And it's Apache licensed. Its also on the Conjars repo if you're a maven user.
I'm the author, and wrote it so Mahout and other custom applications would be able to have a common tool that was also compatible with Cascading Flows.
I'm also the author of Cascading. But MapReduceFlow + Cascade in Cascading works quite well for most raw MR job chaining.
Cloudera has a workflow tool called Oozie that can help with this sort of chaining. Might be overkill for just getting one job to run after another.