What is Keyword Context in Hadoop programming world? - hadoop

What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.

Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.

The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.

Related

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

How Context object is working in hadoop? [duplicate]

What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.
The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.

Comparison with mainstream workflow engines

I'd like to use Spring SM in my next future that has very simple workflows, 3-4 states, rule based transitions, and max actors.
The WF is pretty fixed, so storing its definition in java config is quite ok.
I'd prefer to use SM than WF engine which comes with the whole machinery, but I couldnt find out if there is a notion of Actor.
Meaning, only one particular user (determined by login string) can trigger a transition between states.
Also, can I run the same State machine definition in parallel. Is there a notion of instance, like process instance in WF jargon?
Thanks,
Milan
Actor with a security is an interesting concept but we don't have anything build in right now. I'd say that this can be accomplished via Spring Security i.e. https://spring.io/blog/2013/07/04/spring-security-java-config-preview-method-security/ and there's more in its reference doc.
I could try to think if there's something what we could do to make this easier with Spring Security.
Parallel machines are on my todo list. It is a big topic so takes while to implement. Follow https://github.com/spring-projects/spring-statemachine/issues/35 and other related tickets. That issue is a foundation of making distributed state machines.

Hadoop map/reduce chaining

I want to chain 2 Map/Reduce jobs. I am trying to use JobControl to achieve the same. My problem is -
JobControl needs org.apache.hadoop.mapred.jobcontrol.Job which in turn needs org.apache.hadoop.mapred.JobConf which is deprecated. How do I get around this problem to chain my Map/Reduce?
Anyone has any better ideas for chaining (other than Cascading).
You could use Riffle, it allows you to chain arbitrary processes together (anything you stick its Annotations on).
It has a rudimentary dependency scheduler, so it will order and execute your jobs for you. And it's Apache licensed. Its also on the Conjars repo if you're a maven user.
I'm the author, and wrote it so Mahout and other custom applications would be able to have a common tool that was also compatible with Cascading Flows.
I'm also the author of Cascading. But MapReduceFlow + Cascade in Cascading works quite well for most raw MR job chaining.
Cloudera has a workflow tool called Oozie that can help with this sort of chaining. Might be overkill for just getting one job to run after another.

ETL , Esper or Drools?

The question environment relates to JavaEE, Spring
I am developing a system which can start and stop arbitrary TCP (or other) listeners for incoming messages. There could be a need to authenticate these messages. These messages need to be parsed and stored in some other entities. These entities model which fields they store.
So for example if I have property1 that can have two text fields FillLevel1 and FillLevel2, I could receive messages on TCP which have both fill levels specified in text as F1=100;F2=90
Later I could add another filed say FillLevel3 when I start receiving messages F1=xx;F2=xx;F3=xx. But this is a conscious decision on the part of system modeler.
My question is what do you think is better to use for parsing and storing the message. ETL (using Pantaho, which is used in other system) where you store the raw message and use task executor to consume them one by one and store the transformed messages as per your rules.
One could use Espr or Drools to do the same thing , storing rules and executing them with timer, but I am not sure how dynamic you could get with making rules (they have to be made by end user in a running system and preferably in most user friendly way, ie no scripts or code, only GUI)
The end user should be capable of changing the parse rules. It is also possible that end user might want to change the archived data as well (for example in the above example if a new value of FillLevel is added, one would like to put a FillLevel=-99 in the previous values to make the data consistent).
Please ask for explanations, I have the feeling that I need to revise this question a bit.
Thanks
Well Esper is a great CEP engine, but drools has it's own implementation Drools Fusion which integrates really well with jBpm. That would be a good choice.

Resources