It takes 6 seconds to return json of 9000 datapoints.
I have approximately 10GB of Data in 12 metrics say x.open, x.close...
Data Storage pattern:
Metric : x.open
tagk : symbol
tagv : stringValue
Metric : x.close
tagk : symbol
tagv : stringValue
My Configurations are on Cluster Setup as follows
Node 1: (Real 16GB ActiveNN) JournalNode, Namenode, Zookeeper, RegionServer, HMaster, DFSZKFailoverController, TSD
Node 2: (VM 8GB StandbyNN) JournalNode, Namenode, Zookeeper, RegionServer
Node 3: (Real 16GB) Datanode, RegionServer, TSD
Node 4: (VM 4GB) JournalNode, Datanode, Zookeeper, RegionServer
the setup is for POC/ Dev not for production.
Wideness of timestamp is like, one datapoint each for a day for each symbol under easch metric from 1980 to today..
If the above statement is confusing ( My 12 metrics would get 3095 datapoints added everyday in a continuous run one for each symbol.)
Cardinality for tag values in current scenario is 3095+ symbols
Query Sample:
http://myIPADDRESS:4242/api/query?start=1980/01/01&end=2016/02/18&m=sum:stock.Open{symbol=IBM}&arrays=true
Debugger Result:
8.44sec; datapoints retrieved 8859; datasize: 55kb
Data writing speed is also slow, it takes 6.5 hours to write 2.2 million datapoints.
Am I wrong somewhere with Configurations or expecting much ?
Method for Writing: Json objects via Http
Salting Enabled: Not yet
too much data in one metric will cause performance down. The result may be 9000 data point but the raw data set may be too big. The performance of retrieving 9000 data points from one million will be very different from retrieving 9000 data points from one billion.
Related
I am learning hadoop. I want to understand how dataset/database is setup for environments like Dev, Test and Pre-prod.
Of course in PROD environment we will be dealing with Terabytes of data, but having the same replica of tera bytes of data to other environments, i dont think it is possible.
For other environments how the datasets are replicated? only certain portions of data will be loaded and used in these non prod environments? if so how it is done?
How it is replicated, basically the concept of hdfs relevant to namenodes and datanodrs should give you some research. When you create a new file it goes to name node which updated the metadata and give you a blank block id once you write it finds the nearest datanodes base on the rack location. It replicates to the first datanodes, once its done replicating. Datanode first will replicate it to the next second then thirds and so fourth. It basically just re0licate on the very first node and the hdfs framework will handle the next preceedi g replication
Can somebody let me know what will happen if my Hadoop cluster (replication factor = 3) is only left with 15GB of space and I try to save a file which is 6GB in size?
hdfs dfs -put 6gbfile.txt /some/path/on/hadoop
Will the put operation fail giving error(probably cluster full) or will it save two replicas of the 6GB file and mark the blocks which it cannot save on the cluster as under-replicated and thereby occupying the whole of 15GB of leftover?
You should be able to store the file.
It will try and accommodate as many replicas as possible. When it fails to store all the replicas, it will throw a warning but not fail. As a result, you will land up with under-replicated blocks.
The warning that you would see is
WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replicas
When ever you fire the put command :
dfs utility is behaving like a client here .
client will contact namenode first , then namenode will guide client, where to write the blocks and will keep the maintain metadata for that file , then its client responsibility to break data in block as per configuration specified.
Then client will then make a direct connection with different data nodes , where it has to write different blocks as per namenode reply.
First copy of data would be written by client only on data nodes ,subsequent copies data node will create on each other with the guidance from namenode .
So you should be able to put the file of 6 gb if 15 gb space is there ,because initially the original copies gets created on hadoop , later on once the replication process will start then problem would get arise.
I am new to Apache-Hadoop. I have Apache-Hadoop cluster of 3 nodes. I am trying to load a file having 4.5 billion records,but its not getting distributed to all nodes. The behavior is kind of region hotspotting.
I have removed "hbase.hregion.max.filesize" parameter from hbase-site.xml config file.
I observed that if I use 4 node's cluster then it distributes data to 3 nodes and if I use 3 node's cluster then it distributes to 2 nodes.
I think, I am missing some configuration.
Generaly with HBase the main issue is to prepare rowkeys that are not monotonically.
If they are, only oneregion server is used at the time:
http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
This is HBase Reference Guide about RowKey Design:
http://hbase.apache.org/book.html#rowkey.design
And one more really good article:
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
In our case predefinition of Region servers also improved the loading time:
create 'Some_table', { NAME => 'fam'}, {SPLITS=> ['a','d','f','j','m','o','r','t','z']}
Regards
Pawel
I'm using HBase client in my application server (-cum web-server) with HBase
cluster setup of 6 nodes using CDH3u4 (HBase-0.90). HBase/Hadoop services
running on cluster are:
NODENAME-- ROLE
Node1 -- NameNode
Node2 -- RegionServer, SecondaryNameNode, DataNode, Master
Node3 -- RegionServer, DataNode, Zookeeper
Node4 -- RegionServer, DataNode, Zookeeper
Node5 -- RegionServer, DataNode, Zookeeper
Node6 -- Cloudera Manager, RegionServer, DataNode
I'm using following optimizations for my HBase client:
auto-flush = false
ClearbufferOnFail=true
HTable bufferSize = 12MB
Put setWriteToWAL = false (I'm fine with loss of 1 data).
In order to be closely consistent between read and write, I'm calling
flush-commits on all the buffered tables at every 2 sec.
In my application, I place the HBase write call in a Queue (async manner) and
draining the queue using 20 Consumer threads. On hitting web-server locally
using curl, I'm able to see TPS of 2500 for HBase after curl completes, but
with Load-test where request is coming at high rate of 1200 hits per second
on 3 application servers,the Consumer(drain) threads which are responsible to
write to HBase are not writing data at a rate comparable to input rate. I'm
seeing not more than 600 TPS when request rate is 1200 hits per second.
Can anyone suggest what we can do to improve performance? I've tried with
reduced threads to 7 on each of 3 app server but still no effect. An expert
opinion would be helpful. As this is a production server, so not thinking
to swap the roles, unless someone point severe performance benefit.
[EDIT]:
Just to highlight/clarify our HBase writing pattern, our 1st Transaction checks the row in Table-A (using HTable.exists). It fails to find the row first time and so write to three tables. Subsequent 4 Transaction make exist check on Table-A and as it finds the row, it writes only to 1 Table.
So that's a pretty ancient version of HBase. As of Aug 18, 2013, I would recommend upgrading to something based off of 0.94.x.
Other than that it's really hard to tell you for sure. There are lots of tuning knobs. You should :
Make sure that HDFS has enough xceivers.
Make sure that HBase has enough heap space.
Make sure there is no swapping
Make sure there are enough handlers.
Make sure that you have compression turned on. [1]
Check disk io
Make sure that your row keys, column family names, column qualifiers, and values are as small as possible
Make sure that your writes are well distributed across your key space'
Make sure your regions are (pre-)split
If you're on a recent version then you might want to look at encoding [2]
After all of those things are taken care of then you can start looking at logs and jstacks.
https://hbase.apache.org/book/compression.html
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.html
I'm running a small cluster with two region servers of HBase 0.94.7. I find that the load request over region servers is very unbalanced. From the Web UI, I got:
Region1: numberOfOnlineRegions=1, usedHeapMB=26, maxHeapMB=3983
Region2: numberOfOnlineRegions=22, usedHeapMB=44, maxHeapMB=3983
The region2 is servered as master. I checked that the load balancer is on. And I find some logs in the master log:
INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 regions=1 average=0.5 mostloaded=1 leastloaded=0
DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Balance parameter: numRegions=10, numServers=2, max=5, min=5
INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 12ms. Moving 5 regions off of 1 overloaded servers onto 1 less loaded servers
DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Balance parameter: numRegions=8, numServers=2, max=4, min=4
INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 0ms. Moving 4 regions off of 1 overloaded servers onto 1 less loaded servers
INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 regions=1 average=0.5 mostloaded=1 leastloaded=0
INFO org.apache.hadoop.hbase.master.HMaster: balance hri=LogTable,\x00\x00\x01\xE8\x00\x00\x01#\x09\xB2\xBA4$\xC3Oe,1374591174086.65391b7a54e9c8e85a3d94bf7627fd20., src=region2,60020,1374587851008, dest=region1,60020,1374587851018
DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region LogTable,\x00\x00\x01\xE8\x00\x00\x01#\x09\xB2\xBA4$\xC3Oe,1374591174086.65391b7a54e9c8e85a3d94bf7627fd20. (offlining)
It seems that the load cannot be balanced from the region2 to region1. I don't know if it's a configuration problem? What parameter should I check on region1?
Thanks
Are you using sequential rowkeys, like timestamp?If that is the case you might end up with RegionServer Hotspotting, putting uneven load on the servers. Avoid using sequential keys, if you can. If it is not possible create pre-splitted tables.
if your rowkey is composed of ID, a date and a hash value, you could make the rowkey :hashvalue+date.