HBase aggregation, Get And Put operation, Bulk Operation - hadoop

I would like to know how can I map a value of a key.
I know that it can be done with Get and then Put operations. Is there any other way to do it efficiently? 'checkAndPut' is not ver helpful
can it be done with something like :
(key,value) => value+g()
I have read the book HBase the Definitive Guide and it seems like Map Reduce Job interpreted to Put/Get operations on top of HBase. Does it means that it is not a 'Bulk Operation' (since it's an operation per key) ?
How /Does Spark relevant here ?

HBase has scans (1) to retrieve multiple rows; and MapReduce jobs can and do use this command (2).
For HBase 'bulk' is mostly [or solely] is 'bulk load'/'bulk import' where one adds data via constructing HFiles and 'injecting' them to HBase cluster (as opposed to PUT-s) (3).
Your task can be implemented as a MapReduce Job as well as a Spark app (4 being one of examples, maybe not the best one), or a Pig script, or a Hive query if you use HBase table from Hive (5); pick your poison.

If you set up a Table with a counter then you can use an Increment to add a certain amount to the existing value in an atomic operation.
From a MapReduce job you would aggregate your input in micro batches (wherever you have your incremental counts), group them by key/value, sum them up, and then issue a Put from your job (1 Put per key).
What I mentioned above is not a 'bulk' operation but it would probably work just fine if the amount of rows that you modify in each batch is relatively small compared to the total number or rows in your table.
IFF you expect to modify your entire table at each batch then you should look at Bulk Loads. This will require you to write a job that reads your existing values in HBase, your new values from the incremental sources, add them together, and write them back to HBase (In a 'bulk load' fashion, not directly)
A Bulk Load writes HFiles directly to HDFS without going through the HBase 'write pipeline' (Memstore, minor compactions, major compactions, etc), and then issue a command to swap the existing files with the new ones. The swap is FAST! Note, you could also generate the new HFile outside the HBase cluster (not to overload it) and then copy them over and issue the swap command.

Related

Hive or Hbase when we need to pull more number of columns?

I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.

What is the fastest and most efficient way to Hbase bulk delete

What is the fastest and most efficient way to bulk delete hbase records? Hbase client API or a MapReduce job?
The HBase Client API does not allow to do bulk deletions unless you know the row keys of the cells you want to delete.
The BulkDeleteEndpoint can be leveraged to do bulk deletes based on the results of a scanner.
The fastest and most efficient way for large contiguous datasets is to drop entire regions by deleting their HDFS directories and removing them from the META table. This incurs virtually no IO so it's arguably almost free.
Note, however that this is not yet available directly through the high level APIs, so you have to script / code it in order to get it done.
Here's an example, from the HBase mailing lists, of how you could do it using the shell.
Close the region from the shell (read up on how this works using shell
help -- don't do unassign)
Then just delete the content of the region in HDFS once the region is
closed (the region dir name in HDFS is the same as the region encoded name,
the last portion of a region name -- check refguide).
After the delete in HDFS, call assign region.
Source http://search-hadoop.com/m/YGbbl9ZaSQ2HLT&subj=Re+Delete+a+region+from+hbase
HBase Client API is faster because you perform operations directly on the database while using MapReduce which means that tasks will run over jobs and that's take time according to my experience.
Over than that HBase will allow you to run specific operations in Column families that MapReduce can't do.

Reduce job pending in HFileOutputFormat

I am using
Hbase:0.92.1-cdh4.1.2, and
Hadoop:2.0.0-cdh4.1.2
I have a mapreduce program that will load data from HDFS to HBase using HFileOutputFormat in cluster mode.
In that mapreduce program i'm using HFileOutputFormat.configureIncrementalLoad() to bulk load a 800000 record
data set which is of 7.3GB size and it is running fine, but it's not running for 900000 record data set which is of 8.3GB.
In the case of 8.3GB data my mapreduce program have 133 maps and one reducer,all maps completed successfully.My reducer status is always in Pending for a long time. There is nothing wrong with the cluster since other jobs are running fine and this job also running fine upto 7.3GB of data.
What could i be doing wrong?
How do I fix this issue?
I ran into the same problem. Looking at the DataTracker logs, I noticed there was not enough free space for the single reducer to run on any of my nodes:
2013-09-15 16:55:19,385 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_slave01.mydomain.com:localhost/127.0.0.1:43455 has 503,777,017,856 bytes free; but we expect reduce input to take 978136413988
This 503gb refers to the free space available on one of the hard drives on the particular slave ("tracker_slave01.mydomain.com"), thus the reducer apparently needs to copy all the data to a single drive.
The reason this happens is your table only has one region when it is brand new. As data is inserted into that region, it'll eventually split on its own.
A solution to this is to pre-create your regions when creating your table. The Bulk Loading Chapter in the HBase book discusses this, and presents two options for doing this. This can also be done via the HBase shell (see create's SPLITS argument I think). The challenge though is defining your splits such that the regions get an even distribution of keys. I've yet to solve this problem perfectly, but here's what I'm doing currently:
HTableDescriptor desc = new HTableDescriptor();
desc.setName(Bytes.toBytes(tableName));
desc.addFamily(new HColumnDescriptor("my_col_fam"));
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100);
An alternative solution would be to not use configureIncrementalLoad, and instead: 1) just generate your HFile's via MapReduce w/ no reducers; 2) use completebulkload feature in hbase.jar to import your records to HBase. Of course, I think this runs into the same problem with regions, so you'll want to create the regions ahead of time too (I think).
Your job is running with single reduces, means 7GB data getting processed on single task.
The main reason of this is HFileOutputFormat starts reducer that sorts and merges data to be loaded in HBase table.
here, Num of Reducer = num of regions in HBase table
Increase the number of regions and you will achieve parallelism in reducers. :)
You can get more details here:
http://databuzzprd.blogspot.in/2013/11/bulk-load-data-in-hbase-table.html

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Mutiple maps followed by one reduce with Hadoop and HBase

I have several Hbase tables. I wish to run a map task on each table (each map being a different Mapper class since each table contains heterogeneous data) followed by one reduce.
I cannot work out if this is possible without explictly reducing the data after each map into an interim SequenceFile.
Any help would be gratefully received.
It seems you can only run an MR on one table at a time (see TableMapReduceUtil). So most probably, your best bet is as you suspected: save the output of each table into an interim location (e.g. SequenceFile or a tmp hbase table) and then write a final MR job that takes that location as an input and merges the results. Also, if each MR job outputs data in a common format, you may not even need the last MR merge job.

Resources