Load data into Hbase table using HBASE MAP REDUCE API - hadoop

I am very new for Hbase and Map Reduce API.I am very confused with Map Reduce concepts. I need to Load text file into Hbase table using MAPReduce API. I googled some Examples but in that I can find MAPPER () not reducer method. I am confused with when to use mapper and when to use Reducer (). I am thinking in the way like :
TO write data to a Hbase we use mapper
TO read data from
HBASE we use mapper and reducer(). please can any one clear me with
detail explanation.
I am trying to load data from text file into
HBASE table. I googled and tried some code but i dont know, how to
load the text file and read in HBASE mapreduce API.
I really thank full for certain help

With regard to your questions:
The Mapper receives splits of data and returns a pair key, set<values>
The Reducer receives the output of from the Mapper and generates a pair <key, value>
Generally, will be your Reducer task which will write results (to the filesystem or to HBase), but the Mapper can do that too. There are MapReduce jobs which don't require a Reducer. With regard to reading from HBase, it's the Mapper class that has the configuration from which table to read from. But there's nothing related a Mapper is a reader and Reducer a writer. This article "HBase MapReduce Examples" provides good examples about how to read from and write into HBase using MapReduce.
In any case, if what you need is to bulk import some .csv files into HBase, you don't really need to do it with a MapReduce job. You can do it directly with the HBase API. In pseudocode:
table = hbase.createTable(tablename, fields);
foreach (File file: dir) {
content = readfile(file);
hbase.insert(table, content);
}
I wrote an importer of .mbox files into HBase. Take a look at the code, it may give you some ideas.
Once your data is imported into HBase, then you do need to code a MapReduce job to operate with that data.

Using HFileOutputFormat with CompleteBulkLoad is best and fastest way to load data in HBase.
You will find sample code here

Here are a couple responses of mine that address loading data into HBASE.
What is the fastest way to bulk load data into HBase programmatically?
Writing to HBase in MapReduce using MultipleOutputs
EDIT: Adding additional link based on comment
This link might help make the file available for processing.
Import external libraries in an Hadoop MapReduce script

Related

What does Spark's API newHadoopRDD really do?

I know internally it uses MapReduce to get inputs from Hadoop, but who can explain this with more details?
Thanks.
What you are thinking that is right.HadoopRDD RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS,
* sources in HBase, or S3).
it uses HadoopPartition.
When an HadoopRDD is computed you can see the logs Input split:
example: INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:0+1784
properties are set upon partition execution:
task id of this task’s attempt mapred.tip.id
task attempt’s id mapred.task.id
mapred.task.is.map true
mapred.task.partition split id
mapred.job.id
This HadoopRDD cant do nothing when checkpoint() called.
you can see the comment section in HadoopRDD.scala each and every properties are pretty explanatory.
New Hadoop RDD provide core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).
It also provide various other methods for finding out the configurations details about the partitions, inputsplits etc.
You can visit the documentation for more detailed overview
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/rdd/NewHadoopRDD.html
Hope this will solve your query

Mapper writing to HDFS and Reducer writing to HBase table

I have a map reduce job (say Job1) in which the mapper extends
Mapper < Object, Object, KeySet, ValueSet >
Lets say I want to do summation of all values in ValueSet in the reduction step.
After reducing (key, Iterable), I want to write the final reduced values to HBase table instead of HDFS, in reducer of Job1 The table in HBase will be used for future jobs.
I know I can write a mapper only Job2 which reads the reduced file in HDFS (written by Job1) and import the data to HBase table, but I want to avoid two redundant I/O operations.
I don't want to change the Mapper class of Job1 to write to HBase because there are only specific values that I want to write to the HBase table, others I want to continue writing to HDFS.
Has anyone tried something similar and can provide pointers?
I've looked at HBase mapreduce: write into HBase in Reducer but my question is different since I don't want to write anything to HBase in mapper.

Reading Text File in to Hbase MapReduce and store it to HTable

I am new to HBaseMapReduce and Hadoop Data Base. I need to read a raw text file from mapreduce job and store the retrieved data into Htable using HBase MapReduce API.
I am googling from may days but I am not able to understand the extact flow. Can any one please provide me with some sample Code of reading data from A file.
I need to read Data From a Text/csv files. I can find some examples of reading data from command prompt. Which method can we use to read an xml file FileInputFormat or, please help me in learning Mapreduce API and please provide me with simple read and write examples.
You can import your csv data to HBase using importtsv and completebulkupload tools. importtsv loads csvs to hadoop files and completebulkupload loads them to a specified HTable. You can use these tools both from command line and Java code. If this can help you inform me to provide sample code or command

Unable to load data into Htable using Mapreduce

I am very new to Hbase and Hadoop. I am getting confused with MapReduce concept, I want to know the flow of execution in Mapreduce F/w. I tried searching google for a way to read data from a file and load the data into Htable using the Put class using reducer. I have a file in HDFS which I need to read from Hbase Mapreducer and load the data into Htable.
Can any one show me where I went wrong?
You can use Mapper with out reducer. As reducer can be used for sorting and you just need the file data to be stored in Hbase directly.
Don't use the reduce step. In your map class, when you get a record, directly insert it into HBase. There is no need to shuffle / sort your puts before sending them to HBase. This means that all you have to do is create an instance variable for your HTable and initialize it in the setup method; then in your map method, create a put for your record, and add it to your HTable. Finally, in your cleanup method, make sure you flush your HTable.

Can Hadoop MapReduce can run over other filesystems?

I heard like for mapreduce jobs input need not in HDFS. It can be on other file system.. Can someone please provide me more inputs on this..
I am litle confused on this? In standalone mode, data can be on local file system. But in cluster mode how can we point to mapreduce jobs to some other file system?
No it does not need to be in HDFS. For instance jobs which target HBase using its TableInputFormat pull records over the network from HBase nodes as inputs to its map jobs. The DbInputFormat can be used to pull data from a SQL database into a job. You could build an input format that did something like read data off of an NFS mount.
In practice you want to avoid pulling data over the network if you can. MR performance is much better if you can have your data locally on the nodes where the job is being run since Disk Throughput > Network Throughput.
Based in the InputFormat set on the job, Hadoop can read from any source. Hadoop provides a couple of InputFormats. It's not difficult to write a custom InputFormat also, let's say to provide a proprietary format as input to a Job.
On the same lines Hadoop provides a couple of OutputFormats and it shouldn't be difficult to write a custom OutputFormat also.
Here is a nice article on the DBInputFormat.
Another way to achieve it is to put into HDFS files with information where the real data is. Mapper will get this information and pull real data for the processing.
For example we can have several files with URLs of data to be processed.
What we will loose in this case is data locality - otherwise it is fine.

Resources