How to use common data in MapReduce? - hadoop

I want to load data in memory and have each Mapper use these data.
How do i do it?
Should I just use the setup method in Mapper?
Then, will each Mapper be able to use a common data in once the data is loaded ?

Yes, that is exactly the way to go.
You read the stuff on setup() and keep it in your tasks memory.

Related

How to save kedro dataset in azure and still have it in memory

I want to save Kedro memory dataset in azure as a file and still want to have it in memory as my pipeline will be using this later in the pipeline. Is this possible in Kedro. I tried to look at Transcoding datasets but looks like not possible. Is there any other way to acheive this?
This may be a good opportunity to use CachedDataSet this allows you to wrap any other dataset, but once it's read into memory - make it available to downstream nodes without re-performing the IO operations.
I would try explicitly saving the dataset to Azure as part of your node logic, i.e. with catalog.save(). Then you can feed the dataset to downstream nodes in memory using the standard node inputs and outputs.

Camel bindy: paralelle writing

I'm having a collection of objects that I would like to serialize into a same CSV file.
What is the fastest way to write these objects into the same file ?
Is using parallelProcessing() a safe approach ?
I'd rather implement it with two routes: one with parallel processing collecting data and the second one for writing the collected data into CSV. The latter will be not parallel of course.

Key Value in Map Reduce

I am just beginner in Hadoop framework. I would like to understand few concepts here and i browsed so many links but I would like to get clear answers
1) Why Map reduce works only with Key Value pairs.I also read that I can create a Map reduce job with out actually using reduce
2) Key for the input of Mapping phase is fileoffset key. Can I use explicit key value? or custom input ?
Good, you are digging hadoop concepts.
1) Can I use explicit key value? or custom input?: Yes, write your own (overwrite) RecordReader to do so.
2) Why Map reduce works only with Key Value pairs?:
MapReduce, as name suggests, program just maps(filters) required data to Reduce(Combine based on unique keys) from the data set fed to the program.
Now, why key-value pair?: Since you are processing on unstructured data, one would not like to get the same as output too. We will require some manipulations on data. Think of using Map in java, it helps to uniquely identify the pair, so does in hadoop with the help of Sort & Shuffle.
create a Map reduce job with out actually using reduce?:
Ofcourse, completely depends but recommended for only small operations and in a scenario where your mapper outputs are not required to be combined for expected output.
Reason: Here is where Distributed concept, commodity hardware to be given a priority. For example: i have a large data set to process upon. While processing the data set using a java program(just java, not hadoop), we store the required in Collection objects (As simple as using RAM space). Hadoop is introduced to do the same job in different fashion: store required data in context. Context in mapper refers to Intermediate data (Local FS), in reducer refers to Output(HDFS). Ofcourse, Context in both the cases store in HardDisk.
Hadoop helps doing all the calculations in HardDisk instead of RAM.
I suggest read Hadoop Defenitive Guide, Data Algorithms book for better understanding.

Is it possible to obtain objects from distributed cache in a Hadoop Partitioner?

Hadoop's Distributed Cache lets the developer add small files to the MR context which can be used to obtain additional information during Map or Reduce phases. However, I did not find a way to access this cache in a Partitioner. I need the contents of a small file (the output of an earlier MR job) in a custom Partitioner to determine how the keys are sent to the reducers.
Unfortunately, I cannot find any useful documentation on this, and my only idea is currently a somewhat "hackish" approach, which involves serializing the contents of the file to a Base64 string and putting it into the Configuration. Configurations can be used in a partitioner by letting it implement Configurable. While the file is small enough for this approach (around 50KB) I suppose the distributed cache is better suited for this.
EDIT:
I found another approach which I consider slightly better. Since the file I need to access in the partitioner is on HDFS, I put its fully-qualified URI into the Configuration. In my Partitioner's setConf method I can then re-create the Path via new Path(new URI(conf.get("some.file.key"))) and read it with the help of the Configuration. Still hackish though...

Extending Hive: writing a UDF that does both Map and Reduce operations

I am working on a project to extend Hive to support some image processing functions.
To do this, we need to read in an image, break it up into multiple files, pass each into a separate Map task that does some processing on it and then reduce them back into one image to be returned to the user.
To do this, we had planned to implement a UDF that would call a MapReduce task in Hadoop. However, from what we understand the UDF would only operate either on the Map side OR the Reduce side of the HQL query, while we need it to ideally 'bridge the gap' between the Map and the Reduce side.
The Hive documentation isn't the most helpful, and I was looking for some pointers on where to start looking for more information about this.
Please feel free to ask more questions if I haven't been clear enough in the question.
Looking into HIPI (Hadoop Image Processing Inteface) might give you a start.
Particularly, the example on computing the Principal Components of a bunch of images might be of interest.
Use UDAF (User Defined Aggragate Function). Which has sort of map and reduce phase.

Resources