Mutiple maps followed by one reduce with Hadoop and HBase - hadoop

I have several Hbase tables. I wish to run a map task on each table (each map being a different Mapper class since each table contains heterogeneous data) followed by one reduce.
I cannot work out if this is possible without explictly reducing the data after each map into an interim SequenceFile.
Any help would be gratefully received.

It seems you can only run an MR on one table at a time (see TableMapReduceUtil). So most probably, your best bet is as you suspected: save the output of each table into an interim location (e.g. SequenceFile or a tmp hbase table) and then write a final MR job that takes that location as an input and merges the results. Also, if each MR job outputs data in a common format, you may not even need the last MR merge job.

Related

HBase aggregation, Get And Put operation, Bulk Operation

I would like to know how can I map a value of a key.
I know that it can be done with Get and then Put operations. Is there any other way to do it efficiently? 'checkAndPut' is not ver helpful
can it be done with something like :
(key,value) => value+g()
I have read the book HBase the Definitive Guide and it seems like Map Reduce Job interpreted to Put/Get operations on top of HBase. Does it means that it is not a 'Bulk Operation' (since it's an operation per key) ?
How /Does Spark relevant here ?
HBase has scans (1) to retrieve multiple rows; and MapReduce jobs can and do use this command (2).
For HBase 'bulk' is mostly [or solely] is 'bulk load'/'bulk import' where one adds data via constructing HFiles and 'injecting' them to HBase cluster (as opposed to PUT-s) (3).
Your task can be implemented as a MapReduce Job as well as a Spark app (4 being one of examples, maybe not the best one), or a Pig script, or a Hive query if you use HBase table from Hive (5); pick your poison.
If you set up a Table with a counter then you can use an Increment to add a certain amount to the existing value in an atomic operation.
From a MapReduce job you would aggregate your input in micro batches (wherever you have your incremental counts), group them by key/value, sum them up, and then issue a Put from your job (1 Put per key).
What I mentioned above is not a 'bulk' operation but it would probably work just fine if the amount of rows that you modify in each batch is relatively small compared to the total number or rows in your table.
IFF you expect to modify your entire table at each batch then you should look at Bulk Loads. This will require you to write a job that reads your existing values in HBase, your new values from the incremental sources, add them together, and write them back to HBase (In a 'bulk load' fashion, not directly)
A Bulk Load writes HFiles directly to HDFS without going through the HBase 'write pipeline' (Memstore, minor compactions, major compactions, etc), and then issue a command to swap the existing files with the new ones. The swap is FAST! Note, you could also generate the new HFile outside the HBase cluster (not to overload it) and then copy them over and issue the swap command.

Apache Sqoop-1 reducer phase

I have gone through the sqoop documentation and did not find the information on why sqoop-1 does not have reducer phase. Can someone please explain this.
The purpose of the Reducer is to aggregate the input values and return a single output value.
Look at the simple example of WordCount in MapReduce. The Reducer is used to aggregate the number of occurrences of a single word.
Since the nature of a Sqoop job is to fetch the input records from the given RDBMS and put the records into the given output directory in HDFS or into a Hive table, the job does not require any aggregation and therefore no Reduce phase is needed.
Reduce phase is not needed when all tasks can be executed in parallel.
Sqoop does not need reducer because it imports/exports data between RDBMS and HDFS file system (or Hive tables.).
since RDBMS consists of structured data there is not need shuffle or sort and aggregation can be done in mapper it self.

How to use map reduce output as an input for another map reduce job?

In the first map reduce job I am processing an HBase table and outputting a smaller list of the rowkeys. I need to use this list of strings in order to process another map reduce job which is pulling from a different HBase table and outputting to another Hbase table. What is the proper way to store and access the ouput of the first map reduce job?
Hadoop doesn't support streaming the output of one MR job to another. So, the output of the first MR job has to be stored in HDFS (or some other persistent storage) and then read in the second MR job. Create a DAG of jobs using Oozie or Azkaban. For a simple work flow use Hadoop's JobControl API.
Apache Tez which is still in the incubator phase allows streaming of data across MR tasks. As mentioned, Tez is still in the Incubator stage, so use it with a bit of caution.

Reduce job pending in HFileOutputFormat

I am using
Hbase:0.92.1-cdh4.1.2, and
Hadoop:2.0.0-cdh4.1.2
I have a mapreduce program that will load data from HDFS to HBase using HFileOutputFormat in cluster mode.
In that mapreduce program i'm using HFileOutputFormat.configureIncrementalLoad() to bulk load a 800000 record
data set which is of 7.3GB size and it is running fine, but it's not running for 900000 record data set which is of 8.3GB.
In the case of 8.3GB data my mapreduce program have 133 maps and one reducer,all maps completed successfully.My reducer status is always in Pending for a long time. There is nothing wrong with the cluster since other jobs are running fine and this job also running fine upto 7.3GB of data.
What could i be doing wrong?
How do I fix this issue?
I ran into the same problem. Looking at the DataTracker logs, I noticed there was not enough free space for the single reducer to run on any of my nodes:
2013-09-15 16:55:19,385 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_slave01.mydomain.com:localhost/127.0.0.1:43455 has 503,777,017,856 bytes free; but we expect reduce input to take 978136413988
This 503gb refers to the free space available on one of the hard drives on the particular slave ("tracker_slave01.mydomain.com"), thus the reducer apparently needs to copy all the data to a single drive.
The reason this happens is your table only has one region when it is brand new. As data is inserted into that region, it'll eventually split on its own.
A solution to this is to pre-create your regions when creating your table. The Bulk Loading Chapter in the HBase book discusses this, and presents two options for doing this. This can also be done via the HBase shell (see create's SPLITS argument I think). The challenge though is defining your splits such that the regions get an even distribution of keys. I've yet to solve this problem perfectly, but here's what I'm doing currently:
HTableDescriptor desc = new HTableDescriptor();
desc.setName(Bytes.toBytes(tableName));
desc.addFamily(new HColumnDescriptor("my_col_fam"));
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100);
An alternative solution would be to not use configureIncrementalLoad, and instead: 1) just generate your HFile's via MapReduce w/ no reducers; 2) use completebulkload feature in hbase.jar to import your records to HBase. Of course, I think this runs into the same problem with regions, so you'll want to create the regions ahead of time too (I think).
Your job is running with single reduces, means 7GB data getting processed on single task.
The main reason of this is HFileOutputFormat starts reducer that sorts and merges data to be loaded in HBase table.
here, Num of Reducer = num of regions in HBase table
Increase the number of regions and you will achieve parallelism in reducers. :)
You can get more details here:
http://databuzzprd.blogspot.in/2013/11/bulk-load-data-in-hbase-table.html

Using Hadoop to process data from multiple datasources

Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.
In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.
Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.
Any suggestions are much appreciated.
I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.
Yes, you can join multiple datasets in a mapreduce job. I would recommend getting a copy of the book/ebook Hadoop In Action which addresses joining data from multiple sources.
When you have multiple input files you can use MapReduce API FileInputFormat.addInputPaths() in which can take a comma separated list of multiple files, as below:
FileInputFormat.addInputPaths("dir1/file1,dir2/file2,dir3/file3");
You can also pass multiple inputs into a Mapper in hadoop using Distributed Cache, more info is described here: multiple input into a Mapper in hadoop
If i am not misunderstanding you are trying to normalize the structured data in records, coming in from several inputs and then process it. Based on this, i think you really need to look at this article which helped me in past. It included How To Normalize Data Using Hadoop/MapReduce as below:
Step 1: Extract the column value pairs from the original data.
Step 2: Extract column-value Pairs Not In Master ID File
Step 3: Calculate the Maximum ID for Each Column in the Master File
Step 4: Calculate a New ID for the Unmatched Values
Step 5: Merge the New Ids with the Existing Master IDs
Step 6: Replace the Values in the Original Data with IDs
Using MultipleInputs we can do this.
MutlipleInputs.addInputPath(job, Mapper1.class, TextInputFormat.class,path1);
MutlipleInputs.addInputPath(job, Mapper2.class, TextInputFormat.class,path2);
job.setReducerClass(Reducer1.class);
//FileOutputFormat.setOutputPath(); set output path here
If both classes have a common key, then they can be joined in reducer and do the necessary logics

Resources