HBase bulk load usage - hadoop

I am trying to import some HDFS data to an already existing HBase table.
The table I have was created with 2 column families, and with all the default settings that HBase comes with when creating a new table.
The table is already filled up with a large volume of data, and it has 98 online regions.
The type of row keys it has, are under the form of(simplified version) :
2-CHARS_ID + 6-DIGIT-NUMBER + 3 X 32-CHAR-MD5-HASH.
Example of key: IP281113ec46d86301568200d510f47095d6c99db18630b0a23ea873988b0fb12597e05cc6b30c479dfb9e9d627ccfc4c5dd5fef.
The data I want to import is on HDFS, and I am using a Map-Reduce process to read it. I emit Put objects from my mapper, which correspond to each line read from the HDFS files.
The existing data has keys which will all start with "XX181113".
The job is configured with :
HFileOutputFormat.configureIncrementalLoad(job, hTable)
Once I start the process, I see it configured with 98 reducers (equal to the online regions the table has), but the issue is that 4 reducers got 100% of the data split among them, while the rest did nothing.
As a result, I see only 4 folder outputs, which have a very large size.
Are these files corresponding to 4 new regions which I can then import to the table? And if so, why only 4, while 98 reducers get created?
Reading HBase docs
In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.
confused me even more as to why I get this behaviour.
Thanks!

The number of maps you'd get doesn't depend on the number of regions you have in the table but rather how the data is split into regions (each region contains a range of keys). since you mention that all your new data start with the same prefix it is likely it only fit into a few regions.
You can pre split your table so that the new data would be divided between more regions

Related

HBase HFiles size generation

I am working on an HBase cluster with 28 region servers.
I have a table, which uses a wide-table definition. The row key is a Hex string, while each row has exactly one column family, which in turn has 80 qualifiers.
Each qualifier name is an int (starting from 1 to 80) and each value is a long.
The table has been presplited into 28 regions, using the classic getHexSplits method defined in the HBase manual here.
I have a Map-Reduce job which creates the table, and has to load about 1.8 TB of data in it.
I am using HFileOutputStream to create the HFiles. The problem is that, despite the fact that the job is configured with 28 reducers, and hbase.hregion.max.filesize is set to the default (10GB), I get a lot more(1149 of aprox 1.61 GB each!) HFiles that I expect.
The problem is that, once the table gets created, and the HFiles are being loaded, the table immediately starts both MAJOR and MINOR compactions, which triggers lots of I/O and affect my next Map-Reduce job which does reads from the table. I suppose this happens since there are multiple HFiles per region, and HBase tries to compact them to optimize the reads?
How can I make sure I get a lesser number of HFiles, in order to avoid the compactions? What would be ideal to set as the number of regions for the table, and what other parameters can I set to make sure I get no compactions?
My table is written only once, and then used just for reads.

Increasing mapper in pig

I am using pig to load data from Cassandra using CqlStorage. i have 4 data nodes each can have 7 mappers, there is ~30 million data in Cassandra. When i run like this
LOAD 'cql://keyspace/columnfamily' using CqlStorage it takes 27 mappers to run .
But if i give where clause in the load function like
LOAD 'cql://keyspace/columnfamily?where_clause=id%3D100' using CqlStorage it always takes one mapper.
Can any one help me in increasing mapper
It looks from your WHERE clause like your map input will only be a single key, which would be the reason why you only get one mapper. Hadoop will allocate mappers based on the number of input keys. If you have only one input key, additional mappers will do nothing.
The bottom line is that if you specify your partition key in the where clause, you will get one mapper (since that's the way it gets distributed). Based on the comments I presume you are doing analysis for more than just one student, so there's no reason you'd be specifying the partition key. You also don't seem to have any columns that make sense for a secondary index. So I'm not sure why you even have a where clause.
It looks from your data model like you'll have to map over all your data to get aggregate marks for a combination of student and time range. It's possible you could change to a time-series data model and successfully filter in the where clause, but your current model doesn't support this.

How does HBase enable Random Access to HDFS?

Given that HBase is a database with its files stored in HDFS, how does it enable random access to a singular piece of data within HDFS? By which method is this accomplished?
From the Apache HBase Reference Guide:
HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.
Scanning both chapters didn't reveal a high-level answer for this question.
So how does HBase enable random access to files stored in HDFS?
HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key.
For example: a table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.
hbase acess hdfs file by using hfile . you can check the url to get the detail: http://hbase.apache.org/book/hfilev2.html

HBase as Input -> unable to balance load over available map tasks

I want each hadoop mapper to process a separate portion of data at a M/R job and I would like to test on a pseudo-distributed (single-node) setup the case where many mappers would be necessary to exist as a result of a bigger input-data size. Given the size of my current input and the standalone mode I am experimenting on, I can only see 1 map task.
My input comes from an hbase table and I thought that the number of regions per hbase table is equal to the number of mappers used to process the table's data.
So, as to reproduce a case where many mappers would process the input data, I predefined regions of table through shell like this :
create 't1', 'f1', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
or setting 'UniformSplit' as SPLITALGO, but even if mappers indeed increase to the specified number of regions (after importing data to the respective table), all the input data (at a subsequent test job where I try to read from this table) pass through only one mapper - with the others processing none of the input rows.
I work on a pseudo-distributed (single-node) setup and I really don't know how to solve this. Does anyone have any ideas? Thanks!
Are you scanning the entire table or just a section of it? If you are scanning a section of the table, then that might be the cause of your problem as your data source isn't big enough to trigger multiple mappers.
You can try to decrease the region size in your hbase-size.xml configuration and restart hbase to achieve the desired effect.
Lastly, in your mapred-site.xml configuration, how many mapper slots do you have? If it is just 1, this will not limit the number of map jobs, but it will limit the number of map jobs that can be run at a time on that server.
Other than that, I don't think you have much control over specifying the number of mappers per job- not like you do with the number of reducers.

write data in hbase

i have a problem while write data in hbase.I have 4 region server.when i write data and use radom key ,data write to any region but they are in one region server.One server are busy, three server are free.How do write regularity in all region server.
HBase partitions it's tables across region servers. See :
How HBase partitions table across regionservers?
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
I am not sure how random or far apart your random key should be to be able to write to different partitions.
See discussions on hbase.hregion.max.filesize and base.hregion.maxfilesize which suggests that tables are split to new regions when the appropriate data size has been reached.

Resources