How do i input an array to a Map Reduce Job? - hadoop

I have a service that is continuously retrieving some data .I am dumping this data into an array, this data has to be further processed. Is it possible to create a dynamic array that keeps getting updated by serivice, and side by side i can execute the Map Reduce Job?
Also how what class do i use to simply take an array input(instead of a file) ?
PS I'm new to Hadoop/Map Reduce
I'm coding in Java.

Hadoop is for batch processing, so it's powerful only when you have stored data like files and it needs to be processed and the job finishes. You might have a look at Storm. I think it will suit your use case better.

Related

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

Network shuffle in streaming

So,keyBy or groupBy causes a network shuffle that repartitions the stream. It is said that it is pretty expensive, since it involves network communication along with serialization and deserialization etc.
For an example, if I run the following operators:
map(Mapper1).keyBy(0).map(Mapper2)
with a parallelism of 2, I would get something like this:
Mapper1(1) -\-/- Mapper2(1)
X
Mapper1(2) -/-\- Mapper2(2)
And in the end all records with the same key within the Mapper1 are assigned to the same partition in Mapper2.
My question is:
I want to know what happens during the keyBy or groupBy in streaming. Every processed element is serialized and deserialized by every sub task ? How can I compare the cost of keyBy or groupBy with an another operation ?
Also, I am familiar with the concept of partitioner in batch systems, but I am getting a bit confused when I am trying to apply that in streaming.
Thank you !
So Apache Flink buffers the outgoing of a task and after that sends it to the next task for processing. setBufferTimeout is a parameter on the job-level which can be configured via the StreamExecutionEnvironment and the default value for this timeout is 100 ms. After this time, the buffers are sent automatically even if they are not full.
Also the following links are really helpful to understand the details:
https://flink.apache.org/2019/06/05/flink-network-stack.html
https://flink.apache.org/2019/07/23/flink-network-stack-2.html

How to construct DStream from continued RDDs?

I`m reading data from ElasticSearch to Spark every 5min. So there will be a RDD every 5 minutes.
I hope to construct a DStream based on these RDDs, so that I can get report for data within last 1 day, last 1 hour , last 5 minutes and so on.
To construct the DStream, I was thinking about create my own receiver, but the official documents of spark only give information using scala or java to do so. And I use python.
So do you know any way to do it? I know we can. After all the DStream is a series of RDDs, of course we should be about create DStream from continued RDDs. I just do not know how. Please give some advice
Writing your own receiver would be one way as you mentioned but seems like a lot of overhead. What you can do is to use a QueueReceiver which creates QueueInputDStream like in this example. It's Scala but you should also be able to do a similar thing in Python:
val rddQueue = new Queue[RDD[Map[String, Any]]]()
val inputStream = ssc.queueStream(rddQueue)
Afterwards you simply query your ES instance every X sec/min/h/day/whatever and you put the results into that queue.
With Python I guess it would be something like this:
rddQueue = []
rddQueue += es_rdd() // method that returns an RDD from ES
inputStream = ssc.queueStream(rddQueue)
// some kind of loop that adds to rddQueue new RDDS
Apparently you need to have something in the queue before you use it inside queueStream (or at least I'm getting exceptions in pyspark if it's empty).
It's not necessary to use receivers. You can directly override the InputDStream class to implement your elasticsearch data pulling logic. It's a better approach to not rely on receivers when your data already benefits from a replicated and replayable storage.
See : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.InputDStream
Though, I'm not sure you can easily create InputDStream classes directly from python.

Amazon Elastic Map Reduce Hadoop Jobs

Im new to Amazon Web Services and Map Reduce staff. My basic problem is I am trying to make an academic project were basically I am processing a large bunch of images and I need to detect a particular object in them. After I need a Map filled by objects made of key = averageRGB and value = BufferedImage of the object detected. I managed to do this application single threaded and that was not a problem. My questions are : If I make a map reduce job can I achieve the Map mentioned earlier? If this is possible..can I use the Map to do something with it before the job finishes so I get the final results? And 1 last question...If I upload my sample data in a single folder in S3 bucket, will the Elastic Map Reduce of Amazon take care to split that data onto the cluster and parallelize the process or I have to split the data myself over the cluster?
Excuse my ignorance but I cannot find the right answers on the net.
Thanks
Yes you can use map as you have mentioned.
In reducer again you will get map for key and values there you can do more calculations before final results are sent.
when you upload you data to s3bucekt. You can use path as s3n for you input. Also specify s3bucket path to store output using s3n
When you provide input path using s3n, the EMR will automatically download files to EMR nodes and split them and distribute over all nodes. We need not do any thing for that purpose.

How to pull data in the Map/Reduce functions?

According to the Hadoop : The Definitive Guide.
The new API supports both a “push” and a “pull” style of iteration. In both APIs, key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. The same goes for the reducer. An example of how the “pull” style can be useful is processing records in batches, rather than one by one.
Has anyone pulled data in the Map/Reduce functions? I am interested in the API or example for the same.
I posted a query # mapreduce-user#hadoop.apache.org and got the answer.
The next key value pair can be retrieved from the context object which is passed to the map, by calling nextKeyValue() on it. So you will be able to pull the next data from it in the new API.
Is the performance of pull better than push in this scenario? Also, what are the scenarios in which the pull will be useful?

Resources