Custom Partitioner vs MultipleOutputFormat - hadoop

I am new to map reduce, I would like to know what is the difference between creating multiple outputs based on certain conditions using a custom partitioner and MultipleOutputs concept in Map reduce.

Using a custom partitions, you will sent the data to a different reducer and each reducer will write one file with all the data processed by it.
part-r-00001, part-r-00002 . . .
With MiltipleOutputs each reducer will be available to write different files (Multiple outputs) with a custom name.
Tag1-r-00001, Tag2-r-00001, Tag1-r-00002, Tag2-r-00002 . . .
Customer partition is used to group related data together before the processing, and multiple outputs is to split the data in the output after the processing.
Using MultipleOutputs you will be able to identify the data without need to keep the track of the reducer number and in the future if you need to increment or reduce the number of reduers (as the data change), you still will be able to identify the old data by the prefix.

Related

Hadoop Multi Output maintaining record order emitted from map

I am trying to achieve Multiple Output from the reducer in hadoop. The files are created properly, the problem is the header and footer of the file does not comes in proper places(i..e, the ordering of the records that are emitted from the map are changed).I am having one mapper and multiple reducers.
I tried to add an index(like an integer) to each map records and remove it from the reducer keys, but it was giving as file already exists exception. I was using a custom comparator to sort the keys based on the index values.
Any ideas on what am i missing.

MapReduce filter before reduce

I have a Hadoop MapReduce Job that splits documents of different kinds (Places, People, Organisations, Algorithms, etc...). For each document I have a tag that identify the type of document and links to other documents, however I don't know which kind is the document of the link until the page of the link is reached in the task.
In the Map phase I identify, the links and the kind of the current page and then Emmit as values the information of the links and the current document with his tag to a single reducer, Key NullWritable Value "CurrentDoc::Type::Link".
In the reducer phase it is grouped all the documents by type using the "CurrentDoc::Type" of the values, and then emit a relation between "Document::Link" of only ones that belongs to certain Types.
However I have a memory issue because all the final step is performed only in one reducer.
It is a way, to perform a grouping task after the map process and before the reduce task for identify all the documents with its tags and then distribute them to different reducers.
I mean group all document/tag as "CurrentDoc::Type" in an ArrayWritable Text. Then emit to reducers as key the "CurrentDoc::Link" tuple and as value the ArrayWritable to perform some filtering in the reduce phase in a parallel way.
Thanks for your help!
Unfortunately the system does not work in the way you expect.
We can't change Mapper,Reducer & Combiner functionality.
Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function. In other words, calling the combiner function zero,one, or many times should produce the same output from the reducer.
Combiner can't combine data from multiple maps. Let's leave the job to Reducer.
For your problem,
1) Use Customer Partitioner and decide which reducer should be used to process a specific key (CurrentDoc::Type)
2) Combiner will combine the data with-in a Mapper
3) Outfrom Mapper will be redirected a specific Reducer depending on Key Partition (shuffling)
4) Reducer will combine data for key received from respective Mappers
Working code of Partitioner & Combiner

Merge multiple document categorizer models in OpenNLP

I am trying to write a map-reduce implementation of Document Categorizer using OpenNLP.
During the training phase, I am planning to read a large amount of files and create a model file as result of the map-reduce computation(may be a chain of jobs). I will distribute the files to different mappers, I would create a number of model files as result of this step. Now, I wish to reduce these model files to a single model file to be used for classification.
I understand that this is not the most intuitive of use cases, but I am ready to get my hands dirty and extend/modify the OpenNLP source code, assuming it is possible to tweak the maxent algorithm to work this way.
In case this seems too far fetched, I request for suggestions to do this by generating document samples corresponding to the input files as output of map-reduce step and reducing them to model files by feeding them to document categorizer trainer.
Thanks!
I've done this before, and my approach was to not have each reducer produce the model, but rather only produce the properly formatted data.
Rather than use a category as a key, which separates all the categories Just use a single key and make the value the proper format (cat sample newline) then in the single reducer you can read in that data as (a string) a bytearrayinputstream and train the model. Of course this is not the only way. You wouldn't have to modify opennlp at all to do this.
Simply put, my recommendation is to use a single job that behaves like this:
Map: read in your data, create category label and sample pair. Use a key called 'ALL' and context.write each pair with that key .
Reduce: use a stringbuilder to concat all the cat: sample pairs into the proper training format. Convert the string into a bytearrayinputstream and feed the training API . Write the model somewhere.
Problem may occur that your samples data is too huge to send to one node. If so, you can write the values to A nosql db and read then in from a beefier training node. Or you can use randomization in your mapper to produce many keys and build many models, then at classification time write z wrapper that tests data across them all and Getz The best from each one..... Lots of options.
HTH

Increasing mapper in pig

I am using pig to load data from Cassandra using CqlStorage. i have 4 data nodes each can have 7 mappers, there is ~30 million data in Cassandra. When i run like this
LOAD 'cql://keyspace/columnfamily' using CqlStorage it takes 27 mappers to run .
But if i give where clause in the load function like
LOAD 'cql://keyspace/columnfamily?where_clause=id%3D100' using CqlStorage it always takes one mapper.
Can any one help me in increasing mapper
It looks from your WHERE clause like your map input will only be a single key, which would be the reason why you only get one mapper. Hadoop will allocate mappers based on the number of input keys. If you have only one input key, additional mappers will do nothing.
The bottom line is that if you specify your partition key in the where clause, you will get one mapper (since that's the way it gets distributed). Based on the comments I presume you are doing analysis for more than just one student, so there's no reason you'd be specifying the partition key. You also don't seem to have any columns that make sense for a secondary index. So I'm not sure why you even have a where clause.
It looks from your data model like you'll have to map over all your data to get aggregate marks for a combination of student and time range. It's possible you could change to a time-series data model and successfully filter in the where clause, but your current model doesn't support this.

Using Hadoop to process data from multiple datasources

Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.
In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.
Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.
Any suggestions are much appreciated.
I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.
Yes, you can join multiple datasets in a mapreduce job. I would recommend getting a copy of the book/ebook Hadoop In Action which addresses joining data from multiple sources.
When you have multiple input files you can use MapReduce API FileInputFormat.addInputPaths() in which can take a comma separated list of multiple files, as below:
FileInputFormat.addInputPaths("dir1/file1,dir2/file2,dir3/file3");
You can also pass multiple inputs into a Mapper in hadoop using Distributed Cache, more info is described here: multiple input into a Mapper in hadoop
If i am not misunderstanding you are trying to normalize the structured data in records, coming in from several inputs and then process it. Based on this, i think you really need to look at this article which helped me in past. It included How To Normalize Data Using Hadoop/MapReduce as below:
Step 1: Extract the column value pairs from the original data.
Step 2: Extract column-value Pairs Not In Master ID File
Step 3: Calculate the Maximum ID for Each Column in the Master File
Step 4: Calculate a New ID for the Unmatched Values
Step 5: Merge the New Ids with the Existing Master IDs
Step 6: Replace the Values in the Original Data with IDs
Using MultipleInputs we can do this.
MutlipleInputs.addInputPath(job, Mapper1.class, TextInputFormat.class,path1);
MutlipleInputs.addInputPath(job, Mapper2.class, TextInputFormat.class,path2);
job.setReducerClass(Reducer1.class);
//FileOutputFormat.setOutputPath(); set output path here
If both classes have a common key, then they can be joined in reducer and do the necessary logics

Resources