HBase region over region servers load not balanced - hadoop

I'm running a small cluster with two region servers of HBase 0.94.7. I find that the load request over region servers is very unbalanced. From the Web UI, I got:
Region1: numberOfOnlineRegions=1, usedHeapMB=26, maxHeapMB=3983
Region2: numberOfOnlineRegions=22, usedHeapMB=44, maxHeapMB=3983
The region2 is servered as master. I checked that the load balancer is on. And I find some logs in the master log:
INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 regions=1 average=0.5 mostloaded=1 leastloaded=0
DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Balance parameter: numRegions=10, numServers=2, max=5, min=5
INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 12ms. Moving 5 regions off of 1 overloaded servers onto 1 less loaded servers
DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Balance parameter: numRegions=8, numServers=2, max=4, min=4
INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 0ms. Moving 4 regions off of 1 overloaded servers onto 1 less loaded servers
INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 regions=1 average=0.5 mostloaded=1 leastloaded=0
INFO org.apache.hadoop.hbase.master.HMaster: balance hri=LogTable,\x00\x00\x01\xE8\x00\x00\x01#\x09\xB2\xBA4$\xC3Oe,1374591174086.65391b7a54e9c8e85a3d94bf7627fd20., src=region2,60020,1374587851008, dest=region1,60020,1374587851018
DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region LogTable,\x00\x00\x01\xE8\x00\x00\x01#\x09\xB2\xBA4$\xC3Oe,1374591174086.65391b7a54e9c8e85a3d94bf7627fd20. (offlining)
It seems that the load cannot be balanced from the region2 to region1. I don't know if it's a configuration problem? What parameter should I check on region1?
Thanks

Are you using sequential rowkeys, like timestamp?If that is the case you might end up with RegionServer Hotspotting, putting uneven load on the servers. Avoid using sequential keys, if you can. If it is not possible create pre-splitted tables.

if your rowkey is composed of ID, a date and a hash value, you could make the rowkey :hashvalue+date.

Related

Performance of OpenTSDB

It takes 6 seconds to return json of 9000 datapoints.
I have approximately 10GB of Data in 12 metrics say x.open, x.close...
Data Storage pattern:
Metric : x.open
tagk : symbol
tagv : stringValue
Metric : x.close
tagk : symbol
tagv : stringValue
My Configurations are on Cluster Setup as follows
Node 1: (Real 16GB ActiveNN) JournalNode, Namenode, Zookeeper, RegionServer, HMaster, DFSZKFailoverController, TSD
Node 2: (VM 8GB StandbyNN) JournalNode, Namenode, Zookeeper, RegionServer
Node 3: (Real 16GB) Datanode, RegionServer, TSD
Node 4: (VM 4GB) JournalNode, Datanode, Zookeeper, RegionServer
the setup is for POC/ Dev not for production.
Wideness of timestamp is like, one datapoint each for a day for each symbol under easch metric from 1980 to today..
If the above statement is confusing ( My 12 metrics would get 3095 datapoints added everyday in a continuous run one for each symbol.)
Cardinality for tag values in current scenario is 3095+ symbols
Query Sample:
http://myIPADDRESS:4242/api/query?start=1980/01/01&end=2016/02/18&m=sum:stock.Open{symbol=IBM}&arrays=true
Debugger Result:
8.44sec; datapoints retrieved 8859; datasize: 55kb
Data writing speed is also slow, it takes 6.5 hours to write 2.2 million datapoints.
Am I wrong somewhere with Configurations or expecting much ?
Method for Writing: Json objects via Http
Salting Enabled: Not yet
too much data in one metric will cause performance down. The result may be 9000 data point but the raw data set may be too big. The performance of retrieving 9000 data points from one million will be very different from retrieving 9000 data points from one billion.

HBase Data Access performance improvement using HBase API

I am trying to scan some rows using prefix filter from the HBase table. I am on HBase 0.96.
I want to increase the throughput of each RPC call so as to reduce the number of request hitting the region.
I tried getCaching(int) and setCacheBlocks(true) on the scan object. I also tried adding resultScanner.next(int). Using all these combination I am still not able to reduce the number of RPC calls. I am still hitting HBase region for each key instead of bringing the multiple keys per RPC call.
The HBase region server/ Datanode has enough CPU and Memory allocated. Also my data is evenly distributed across different region servers. Also the data that I am bring back per key is not a lot.
I observed that when I add more data to the table the time taken for the request increases. It also increases when the number of request increases.
Thank you for your help.
R
Prefix filter is usually a performance killer because they perform full table scan, always use a start and stop row in your scans rather than prefix filter.
Scan scan = new Scan(Bytes.toBytes("prefix"),Bytes.toBytes("prefix~"));
when iterate over the Result from the ResultScanner, every iteration is an RPC call, you can call resultScanner.next(n) to get a batch of results in one go.

How to run Hue Hive Queries sequentially

I have set up Cloudera Hue and have a cluster of master node of 200 Gib and 16 Gib RAM and 3 datnodes of each 150 Gib and 8 Gib Ram.
I have database of size 70 Gib approx. The problem is when I try to run Hive queries from hive editor(HUE GUI). If I submit 5 to 6 queries(for execution) Jobs are started but they hang and never run. How can I run the queries sequentially. I mean even though I can submit queries but the new query should only start when previous is completed. Is there any way so that I can make the queries run one by one?
You can run all your queries in one go and by separating them using ';' in HUE.
For example:
Query1;
Query2;
Query3
In this case query1, query2 and query3 will run sequentially one after another
Hue submits all the queries, if they hang, it means that you are probably hitting a misconfiguration in YARN, like gotcha #5 http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
so the entire flow of YARN/MR2 is as follow
query is submitted from HUE Hive query editor
job is started and resource manager starts an application master on one of datanode
this application master asks for the resources to resource manager(eg 2 * 1Gib/ 1 Core)
resource manager provides these resources( called nodemanagers which then runs the map and
reduce tasks) to application master.
so now resource allocation is handled by YARN.in case of cloudera cluster, Dynamic resource pools(kind of a queue) is the place where jobs are submitted and and then resource allocation is done by yarn for these jobs. by default the value of maximum concurrent jobs is set in such a way that resource manager allocates all the resource to all the jobs/Application masters leaving no space for task containers(which is required at later stage for running tasks by application masters.)
http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/introduction-to-yarn-and-mapreduce-2-slides.html
so if we submit large no of queries in HUE Hive editor for execution they will be submitted as jobs concurrently and application masters for them will be allocated resources leaving no space for task containers and thus all jobs will be in pending state.
Solution is as mentioned above by #Romain
set the value of max no of concurrent jobs accordingly to the size and capability of cluster. in my case it worked for the value of 4
now only 4 jobs will be run concurrently from the pool and they will be allocated resources by the resource manager.

How to balance load of HBase while loading file?

I am new to Apache-Hadoop. I have Apache-Hadoop cluster of 3 nodes. I am trying to load a file having 4.5 billion records,but its not getting distributed to all nodes. The behavior is kind of region hotspotting.
I have removed "hbase.hregion.max.filesize" parameter from hbase-site.xml config file.
I observed that if I use 4 node's cluster then it distributes data to 3 nodes and if I use 3 node's cluster then it distributes to 2 nodes.
I think, I am missing some configuration.
Generaly with HBase the main issue is to prepare rowkeys that are not monotonically.
If they are, only oneregion server is used at the time:
http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
This is HBase Reference Guide about RowKey Design:
http://hbase.apache.org/book.html#rowkey.design
And one more really good article:
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
In our case predefinition of Region servers also improved the loading time:
create 'Some_table', { NAME => 'fam'}, {SPLITS=> ['a','d','f','j','m','o','r','t','z']}
Regards
Pawel

Poor write Performance by HBase client

I'm using HBase client in my application server (-cum web-server) with HBase
cluster setup of 6 nodes using CDH3u4 (HBase-0.90). HBase/Hadoop services
running on cluster are:
NODENAME-- ROLE
Node1 -- NameNode
Node2 -- RegionServer, SecondaryNameNode, DataNode, Master
Node3 -- RegionServer, DataNode, Zookeeper
Node4 -- RegionServer, DataNode, Zookeeper
Node5 -- RegionServer, DataNode, Zookeeper
Node6 -- Cloudera Manager, RegionServer, DataNode
I'm using following optimizations for my HBase client:
auto-flush = false
ClearbufferOnFail=true
HTable bufferSize = 12MB
Put setWriteToWAL = false (I'm fine with loss of 1 data).
In order to be closely consistent between read and write, I'm calling
flush-commits on all the buffered tables at every 2 sec.
In my application, I place the HBase write call in a Queue (async manner) and
draining the queue using 20 Consumer threads. On hitting web-server locally
using curl, I'm able to see TPS of 2500 for HBase after curl completes, but
with Load-test where request is coming at high rate of 1200 hits per second
on 3 application servers,the Consumer(drain) threads which are responsible to
write to HBase are not writing data at a rate comparable to input rate. I'm
seeing not more than 600 TPS when request rate is 1200 hits per second.
Can anyone suggest what we can do to improve performance? I've tried with
reduced threads to 7 on each of 3 app server but still no effect. An expert
opinion would be helpful. As this is a production server, so not thinking
to swap the roles, unless someone point severe performance benefit.
[EDIT]:
Just to highlight/clarify our HBase writing pattern, our 1st Transaction checks the row in Table-A (using HTable.exists). It fails to find the row first time and so write to three tables. Subsequent 4 Transaction make exist check on Table-A and as it finds the row, it writes only to 1 Table.
So that's a pretty ancient version of HBase. As of Aug 18, 2013, I would recommend upgrading to something based off of 0.94.x.
Other than that it's really hard to tell you for sure. There are lots of tuning knobs. You should :
Make sure that HDFS has enough xceivers.
Make sure that HBase has enough heap space.
Make sure there is no swapping
Make sure there are enough handlers.
Make sure that you have compression turned on. [1]
Check disk io
Make sure that your row keys, column family names, column qualifiers, and values are as small as possible
Make sure that your writes are well distributed across your key space'
Make sure your regions are (pre-)split
If you're on a recent version then you might want to look at encoding [2]
After all of those things are taken care of then you can start looking at logs and jstacks.
https://hbase.apache.org/book/compression.html
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.html

Resources