How to pull data in the Map/Reduce functions? - hadoop

According to the Hadoop : The Definitive Guide.
The new API supports both a “push” and a “pull” style of iteration. In both APIs, key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. The same goes for the reducer. An example of how the “pull” style can be useful is processing records in batches, rather than one by one.
Has anyone pulled data in the Map/Reduce functions? I am interested in the API or example for the same.

I posted a query # mapreduce-user#hadoop.apache.org and got the answer.
The next key value pair can be retrieved from the context object which is passed to the map, by calling nextKeyValue() on it. So you will be able to pull the next data from it in the new API.
Is the performance of pull better than push in this scenario? Also, what are the scenarios in which the pull will be useful?

Related

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

How Context object is working in hadoop? [duplicate]

What exactly is this keyword Context in Hadoop MapReduce world in new API terms?
Its extensively used to write output pairs out of Maps and Reduce, however I am not sure if it can be used somewhere else and what's exactly happening whenever I use context. Is it a Iterator with different name?
What is relation between Class Mapper.Context, Class Reducer.Context and Job.Context?
Can someone please explain this starting with Layman's terms and then going in detail. Not able understand much from Hadoop API documentations.
Thanks for your time and help.
Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.
Applications can use the Context:
to report progress
to set application-level status messages
update Counters
indicate they are alive
to get the values that are stored in job configuration across map/reduce phase.
The new API makes extensive use of Context objects that allow the user code to communicate with MapRduce system.
It unifies the role of JobConf, OutputCollector, and Reporter from old API.

Aggregating processor or aggregating reader

I have a requirement, which is like, I read items from a DB, if possible in a paging way where the items represent the later "batch size", I do some processing steps, like filtering etc. then I want to accumulate the items to send it to a rest service where I can send it to in batches, e.g. n of them at once instead one by one.
Parallelising it on the step level is what I am doing but I am not sure on how to get the batching to work, do I need to implement a reader that returns a list and a processor that receives a list? If so, I read that you will have not a proper account of number items processed.
I am trying to find a way to do it in the most appropriate spring batch way instead of hacking a fix, I also assume that I need to keep state in the reader and wondered if there is a better way not to.
You cannot have something like an aggregating processor. Every single item that is read is processed as single item.
However, you can implement a Reader that groups items and forwards them as a whole group. to get an idea, how this could be done have a look at my answer to this question Spring Batch Processor or Dean Clark's answer here Spring Batch-How to process multiple records at the same time in the processor?.
Both use a SpringBatch's SingleItemPeekableItemReader.

How to construct DStream from continued RDDs?

I`m reading data from ElasticSearch to Spark every 5min. So there will be a RDD every 5 minutes.
I hope to construct a DStream based on these RDDs, so that I can get report for data within last 1 day, last 1 hour , last 5 minutes and so on.
To construct the DStream, I was thinking about create my own receiver, but the official documents of spark only give information using scala or java to do so. And I use python.
So do you know any way to do it? I know we can. After all the DStream is a series of RDDs, of course we should be about create DStream from continued RDDs. I just do not know how. Please give some advice
Writing your own receiver would be one way as you mentioned but seems like a lot of overhead. What you can do is to use a QueueReceiver which creates QueueInputDStream like in this example. It's Scala but you should also be able to do a similar thing in Python:
val rddQueue = new Queue[RDD[Map[String, Any]]]()
val inputStream = ssc.queueStream(rddQueue)
Afterwards you simply query your ES instance every X sec/min/h/day/whatever and you put the results into that queue.
With Python I guess it would be something like this:
rddQueue = []
rddQueue += es_rdd() // method that returns an RDD from ES
inputStream = ssc.queueStream(rddQueue)
// some kind of loop that adds to rddQueue new RDDS
Apparently you need to have something in the queue before you use it inside queueStream (or at least I'm getting exceptions in pyspark if it's empty).
It's not necessary to use receivers. You can directly override the InputDStream class to implement your elasticsearch data pulling logic. It's a better approach to not rely on receivers when your data already benefits from a replicated and replayable storage.
See : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.InputDStream
Though, I'm not sure you can easily create InputDStream classes directly from python.

Parse.com. Execute backend code before response

I need to know the relative position of an object in a list. Lets say I need to know the position of a certain wine of all wines added to the database, based in the votes received by users. The app should be able to receive the ranking position as an object property when retrieving a "wine" class object.
This should be easy to do in the backend side but I've seen Cloud Code and it seems it only is able to execute code before or after saving or deleting, not before reading and giving response.
Any way to do this task?. Any workaround?.
Thanks.
I think you would have to write a Cloud function to perform this calculation for a particular wine.
https://www.parse.com/docs/cloud_code_guide#functions
This would be a function you would call manually. You would have to provide the "wine" object or objectId as a parameter and then get have your cloud function return the value you need. Keep in mind there are limitations on cloud functions. Read the documentation about time limits. You also don't want to make too many API calls every time you run this. It sounds like your computation could be fairly heavy if your dataset is large and you aren't caching at least some of the information.

Resources