I have a map reduce job (say Job1) in which the mapper extends
Mapper < Object, Object, KeySet, ValueSet >
Lets say I want to do summation of all values in ValueSet in the reduction step.
After reducing (key, Iterable), I want to write the final reduced values to HBase table instead of HDFS, in reducer of Job1 The table in HBase will be used for future jobs.
I know I can write a mapper only Job2 which reads the reduced file in HDFS (written by Job1) and import the data to HBase table, but I want to avoid two redundant I/O operations.
I don't want to change the Mapper class of Job1 to write to HBase because there are only specific values that I want to write to the HBase table, others I want to continue writing to HDFS.
Has anyone tried something similar and can provide pointers?
I've looked at HBase mapreduce: write into HBase in Reducer but my question is different since I don't want to write anything to HBase in mapper.
Related
I have gone through the sqoop documentation and did not find the information on why sqoop-1 does not have reducer phase. Can someone please explain this.
The purpose of the Reducer is to aggregate the input values and return a single output value.
Look at the simple example of WordCount in MapReduce. The Reducer is used to aggregate the number of occurrences of a single word.
Since the nature of a Sqoop job is to fetch the input records from the given RDBMS and put the records into the given output directory in HDFS or into a Hive table, the job does not require any aggregation and therefore no Reduce phase is needed.
Reduce phase is not needed when all tasks can be executed in parallel.
Sqoop does not need reducer because it imports/exports data between RDBMS and HDFS file system (or Hive tables.).
since RDBMS consists of structured data there is not need shuffle or sort and aggregation can be done in mapper it self.
I have a need to send only selected records from mapper to reducer and rest filter record to write to hdfs from mapper itself. Reducer will write the records send to reducer. My job is processing huge data in 20TBs, it uses 30K mappers, so I believe I cannot write from mapper's cleanup method as well, because to load that data from 30K mapper's output files(30k files) will be a another problem for the next job. I am using CDH4. Has anyone implemented a similar scenario with any other different approach?
When you want to write the data to HDFS, is it through java client and to HDFS? If yes, then you can write conditional logic to write to HDFS and write to output location, from where reducer picks up. Records not meeting the condition can then use mapper to write to output location, and later be picked up by reducer.
By default the output location is also a HDFS location, but you have to see which way you want the data to be in HDFS as per your case.
I have written a map-reduce job for the data in HBase. It contains multiple mappers and just a single reducer. The Reducer method takes in the data supplied from the mapper and do some analytic on it. After the processing is complete for all the data in HBase I wanted to write the data back to a file in HDFS through the single Reducer. Presently I am able to write the data to HDFS every time I get new one but unable to figure how to write the final conclusion to HDFS only at last.
So, if you trying to write a final result from a single reducer to HDFS, you can try any one of the approaches below -
Use Hadoop API FileSystem's create() function to write to HDFS from the reducer.
Emit a single key and value from reducer after the final calculation
Override Reducers cleanup() function and do point (1) there.
Details on 3:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html#cleanup-org.apache.hadoop.mapreduce.Reducer.Context-
Hope this helps.
I am very new for Hbase and Map Reduce API.I am very confused with Map Reduce concepts. I need to Load text file into Hbase table using MAPReduce API. I googled some Examples but in that I can find MAPPER () not reducer method. I am confused with when to use mapper and when to use Reducer (). I am thinking in the way like :
TO write data to a Hbase we use mapper
TO read data from
HBASE we use mapper and reducer(). please can any one clear me with
detail explanation.
I am trying to load data from text file into
HBASE table. I googled and tried some code but i dont know, how to
load the text file and read in HBASE mapreduce API.
I really thank full for certain help
With regard to your questions:
The Mapper receives splits of data and returns a pair key, set<values>
The Reducer receives the output of from the Mapper and generates a pair <key, value>
Generally, will be your Reducer task which will write results (to the filesystem or to HBase), but the Mapper can do that too. There are MapReduce jobs which don't require a Reducer. With regard to reading from HBase, it's the Mapper class that has the configuration from which table to read from. But there's nothing related a Mapper is a reader and Reducer a writer. This article "HBase MapReduce Examples" provides good examples about how to read from and write into HBase using MapReduce.
In any case, if what you need is to bulk import some .csv files into HBase, you don't really need to do it with a MapReduce job. You can do it directly with the HBase API. In pseudocode:
table = hbase.createTable(tablename, fields);
foreach (File file: dir) {
content = readfile(file);
hbase.insert(table, content);
}
I wrote an importer of .mbox files into HBase. Take a look at the code, it may give you some ideas.
Once your data is imported into HBase, then you do need to code a MapReduce job to operate with that data.
Using HFileOutputFormat with CompleteBulkLoad is best and fastest way to load data in HBase.
You will find sample code here
Here are a couple responses of mine that address loading data into HBASE.
What is the fastest way to bulk load data into HBase programmatically?
Writing to HBase in MapReduce using MultipleOutputs
EDIT: Adding additional link based on comment
This link might help make the file available for processing.
Import external libraries in an Hadoop MapReduce script
I have several Hbase tables. I wish to run a map task on each table (each map being a different Mapper class since each table contains heterogeneous data) followed by one reduce.
I cannot work out if this is possible without explictly reducing the data after each map into an interim SequenceFile.
Any help would be gratefully received.
It seems you can only run an MR on one table at a time (see TableMapReduceUtil). So most probably, your best bet is as you suspected: save the output of each table into an interim location (e.g. SequenceFile or a tmp hbase table) and then write a final MR job that takes that location as an input and merges the results. Also, if each MR job outputs data in a common format, you may not even need the last MR merge job.