I am very new to Hbase and Hadoop. I am getting confused with MapReduce concept, I want to know the flow of execution in Mapreduce F/w. I tried searching google for a way to read data from a file and load the data into Htable using the Put class using reducer. I have a file in HDFS which I need to read from Hbase Mapreducer and load the data into Htable.
Can any one show me where I went wrong?
You can use Mapper with out reducer. As reducer can be used for sorting and you just need the file data to be stored in Hbase directly.
Don't use the reduce step. In your map class, when you get a record, directly insert it into HBase. There is no need to shuffle / sort your puts before sending them to HBase. This means that all you have to do is create an instance variable for your HTable and initialize it in the setup method; then in your map method, create a put for your record, and add it to your HTable. Finally, in your cleanup method, make sure you flush your HTable.
Related
I have a need to send only selected records from mapper to reducer and rest filter record to write to hdfs from mapper itself. Reducer will write the records send to reducer. My job is processing huge data in 20TBs, it uses 30K mappers, so I believe I cannot write from mapper's cleanup method as well, because to load that data from 30K mapper's output files(30k files) will be a another problem for the next job. I am using CDH4. Has anyone implemented a similar scenario with any other different approach?
When you want to write the data to HDFS, is it through java client and to HDFS? If yes, then you can write conditional logic to write to HDFS and write to output location, from where reducer picks up. Records not meeting the condition can then use mapper to write to output location, and later be picked up by reducer.
By default the output location is also a HDFS location, but you have to see which way you want the data to be in HDFS as per your case.
I have a map reduce job (say Job1) in which the mapper extends
Mapper < Object, Object, KeySet, ValueSet >
Lets say I want to do summation of all values in ValueSet in the reduction step.
After reducing (key, Iterable), I want to write the final reduced values to HBase table instead of HDFS, in reducer of Job1 The table in HBase will be used for future jobs.
I know I can write a mapper only Job2 which reads the reduced file in HDFS (written by Job1) and import the data to HBase table, but I want to avoid two redundant I/O operations.
I don't want to change the Mapper class of Job1 to write to HBase because there are only specific values that I want to write to the HBase table, others I want to continue writing to HDFS.
Has anyone tried something similar and can provide pointers?
I've looked at HBase mapreduce: write into HBase in Reducer but my question is different since I don't want to write anything to HBase in mapper.
I have written a map-reduce job for the data in HBase. It contains multiple mappers and just a single reducer. The Reducer method takes in the data supplied from the mapper and do some analytic on it. After the processing is complete for all the data in HBase I wanted to write the data back to a file in HDFS through the single Reducer. Presently I am able to write the data to HDFS every time I get new one but unable to figure how to write the final conclusion to HDFS only at last.
So, if you trying to write a final result from a single reducer to HDFS, you can try any one of the approaches below -
Use Hadoop API FileSystem's create() function to write to HDFS from the reducer.
Emit a single key and value from reducer after the final calculation
Override Reducers cleanup() function and do point (1) there.
Details on 3:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html#cleanup-org.apache.hadoop.mapreduce.Reducer.Context-
Hope this helps.
I am very new for Hbase and Map Reduce API.I am very confused with Map Reduce concepts. I need to Load text file into Hbase table using MAPReduce API. I googled some Examples but in that I can find MAPPER () not reducer method. I am confused with when to use mapper and when to use Reducer (). I am thinking in the way like :
TO write data to a Hbase we use mapper
TO read data from
HBASE we use mapper and reducer(). please can any one clear me with
detail explanation.
I am trying to load data from text file into
HBASE table. I googled and tried some code but i dont know, how to
load the text file and read in HBASE mapreduce API.
I really thank full for certain help
With regard to your questions:
The Mapper receives splits of data and returns a pair key, set<values>
The Reducer receives the output of from the Mapper and generates a pair <key, value>
Generally, will be your Reducer task which will write results (to the filesystem or to HBase), but the Mapper can do that too. There are MapReduce jobs which don't require a Reducer. With regard to reading from HBase, it's the Mapper class that has the configuration from which table to read from. But there's nothing related a Mapper is a reader and Reducer a writer. This article "HBase MapReduce Examples" provides good examples about how to read from and write into HBase using MapReduce.
In any case, if what you need is to bulk import some .csv files into HBase, you don't really need to do it with a MapReduce job. You can do it directly with the HBase API. In pseudocode:
table = hbase.createTable(tablename, fields);
foreach (File file: dir) {
content = readfile(file);
hbase.insert(table, content);
}
I wrote an importer of .mbox files into HBase. Take a look at the code, it may give you some ideas.
Once your data is imported into HBase, then you do need to code a MapReduce job to operate with that data.
Using HFileOutputFormat with CompleteBulkLoad is best and fastest way to load data in HBase.
You will find sample code here
Here are a couple responses of mine that address loading data into HBASE.
What is the fastest way to bulk load data into HBase programmatically?
Writing to HBase in MapReduce using MultipleOutputs
EDIT: Adding additional link based on comment
This link might help make the file available for processing.
Import external libraries in an Hadoop MapReduce script
I have several Hbase tables. I wish to run a map task on each table (each map being a different Mapper class since each table contains heterogeneous data) followed by one reduce.
I cannot work out if this is possible without explictly reducing the data after each map into an interim SequenceFile.
Any help would be gratefully received.
It seems you can only run an MR on one table at a time (see TableMapReduceUtil). So most probably, your best bet is as you suspected: save the output of each table into an interim location (e.g. SequenceFile or a tmp hbase table) and then write a final MR job that takes that location as an input and merges the results. Also, if each MR job outputs data in a common format, you may not even need the last MR merge job.