Amazon Elastic Map Reduce Hadoop Jobs - hadoop

Im new to Amazon Web Services and Map Reduce staff. My basic problem is I am trying to make an academic project were basically I am processing a large bunch of images and I need to detect a particular object in them. After I need a Map filled by objects made of key = averageRGB and value = BufferedImage of the object detected. I managed to do this application single threaded and that was not a problem. My questions are : If I make a map reduce job can I achieve the Map mentioned earlier? If this is possible..can I use the Map to do something with it before the job finishes so I get the final results? And 1 last question...If I upload my sample data in a single folder in S3 bucket, will the Elastic Map Reduce of Amazon take care to split that data onto the cluster and parallelize the process or I have to split the data myself over the cluster?
Excuse my ignorance but I cannot find the right answers on the net.
Thanks

Yes you can use map as you have mentioned.
In reducer again you will get map for key and values there you can do more calculations before final results are sent.
when you upload you data to s3bucekt. You can use path as s3n for you input. Also specify s3bucket path to store output using s3n
When you provide input path using s3n, the EMR will automatically download files to EMR nodes and split them and distribute over all nodes. We need not do any thing for that purpose.

Related

Apache nifi: Difference between the flowfile State and StateManagement

From what I've read here and there, the flowfile repository serves as a Write Ahead Log for apache Nifi.
When walking the configuration files, I've seen that there is a state-management configuration section. When in a Standalone mode, a local-provider is used and writes the state (by default) to .state/local/.
It seems like both the flowfile repo and the state are used both, for example, to recover from a system failure.
Would someone please explain what's the difference between them? Do they work together ?
Also, it's a best practice to have the flowfile repo and the content repo on two separate disks. What about the local state ? Should we avoid using the "boot" disk and offload to another one ? Which one: a dedicated ? Co-locate with another one (I'm co-locating database and flowfile repos).
Thanks.
The flow file repository keeps track of all the flow files in the system, which content they point to, which attributes they have, and where they are in the flow.
State Management is an API provided to processors/services that can be used to store and retrieve key/value pairs, typically for remembering where something left off. For example, a source processor that pulls data since some timestamp would want to store the last timestamp it used so that if NiFi restarts it can retrieve this value and start from there again.

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

Nutch as stand-by spider with custom processing pipelines

I would like to use Apache Nutch as a spider which only fetches given url list (no crawling). The urls are going to be stored in Redis and I want Nutch to take constantly pop them from the list and fetch html. The spider needs to be in stand-by mode - it always waits for the new urls coming into Redis until the user decides to stop the job. Also, I would like to apply my own processing pipelines to the extracted html files (not only text extraction). Is it possible to do with Nutch?
StormCrawler would be a much better fit for achieving this - it was designed to be able to cater for scenarios like the one you described. You'd need to write a custom spout t connect to redis, reuse the fetcher and parser bolts then add bolts with your own processing. Some of SC's early users were doing exactly that

How do i input an array to a Map Reduce Job?

I have a service that is continuously retrieving some data .I am dumping this data into an array, this data has to be further processed. Is it possible to create a dynamic array that keeps getting updated by serivice, and side by side i can execute the Map Reduce Job?
Also how what class do i use to simply take an array input(instead of a file) ?
PS I'm new to Hadoop/Map Reduce
I'm coding in Java.
Hadoop is for batch processing, so it's powerful only when you have stored data like files and it needs to be processed and the job finishes. You might have a look at Storm. I think it will suit your use case better.

How to pull data in the Map/Reduce functions?

According to the Hadoop : The Definitive Guide.
The new API supports both a “push” and a “pull” style of iteration. In both APIs, key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. The same goes for the reducer. An example of how the “pull” style can be useful is processing records in batches, rather than one by one.
Has anyone pulled data in the Map/Reduce functions? I am interested in the API or example for the same.
I posted a query # mapreduce-user#hadoop.apache.org and got the answer.
The next key value pair can be retrieved from the context object which is passed to the map, by calling nextKeyValue() on it. So you will be able to pull the next data from it in the new API.
Is the performance of pull better than push in this scenario? Also, what are the scenarios in which the pull will be useful?

Resources