My situation is the following: I have a 20-node Hadoop/HBase cluster with 3 ZooKeepers. I do a lot of processing of data from HBase tables to other HBase tables via MapReduce.
Now, if I create a new table, and tell any job to use that table as an output sink, all of its data goes onto the same regionserver. This wouldn't surprise me if there are only a few regions. A particular table I have has about 450 regions and now comes the problem: Most of those regions (about 80%) are on the same region server!
I was wondering now how HBase distributes the assignment of new regions throughout the cluster and whether this behaviour is normal/desired or a bug. I unfortunately don't know where to start looking in a bug in my code.
The reason I ask is that this makes jobs incredibly slow. Only when the jobs are completely finished the table gets balanced across the cluster but that does not explain this behaviour. Shouldn't HBase distibute new regions at the moment of the creation to different servers?
Thanks for you input!
I believe that this is a known issue. Currently HBase distributes regions across the cluster as a whole without regard for which table they belong to.
Consult the HBase book for background:
http://hbase.apache.org/book/regions.arch.html
It could be that you are on an older version of hbase:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/19155
See the following for a discussion of load balancing and region moving
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/12549
By default, it just balance regions on each RS without take table into account.
You can set hbase.master.loadbalance.bytable to get it.
Related
I our application we are having data from 3 different countries and we are persisting data in HBase.
In each country, we will be keeping data of all the 3 countries.
To achieve this, is it possible that we create our Hadoop cluster using data centers in all these 3 countries and we keep data replication as 3. So due to rack-awareness feature, our data will get auto replicated in all the 3 countries?
Any pointers will be of great help.
Thanks
You can’t have HBASE cluster across countries. This won’t work because of latency, failover problems, network issues, etc.
A good option would be to have 3 clusters, one HBase table per country and sync the tables between clusters as proposed above
As far as I know only Google has successfully implemented a multi-country database providing both consistency and availability: Spanner. But the key elements of the solution are: a private physical network between the Data Centers and their own implementation of NTP which guarantee that all servers across the world have the same clock with just a few milliseconds precision.
This solution looks theoretically feasible but writes may become pretty slow as data needs to replicated to 3 nodes located in different geographies. It needs to be tried out and check whether the latency is within tolerable limit.
Another option could be, to have three different HBase clusters at three locations and design tables in such a way that tables from one HBase cluster could be copied to another one during night hours to keep the data in sync daily. In this case, an HBase cluster will have current data from it's own location but the data from other two cities will lag by a day.
I have a large dataset of items in hbase that I want to load into a spark rdd for processing. My understanding is that hbase is optimized for low-latency single item searches on hadoop, so I am wondering if it's possible to efficiently query for 100 million items in hbase (~10Tb in size)?
Here is some general advice on making Spark and HBase work together.
Data colocation and partitioning
Spark avoids shuffling : if your Spark workers and HBase regions are located on the same machines, Spark will create partitions according to regions.
A good region split in HBase will map to a good partitioning in Spark.
If possible, consider working on your rowkeys and region splits.
Operations in Spark vs operations in HBase
Rule of thumb : use HBase scans only, and do everything else with Spark.
To avoid shuffling in your Spark operations, you can consider working on your partitions. For example : you can join 2 Spark rdd from HBase scans on their Rowkey or Rowkey prefix without any shuffling.
Hbase configuration tweeks
This discussion is a bit old (some configurations are not up to date) but still interesting : http://community.cloudera.com/t5/Storage-Random-Access-HDFS/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97
And the link below has also some leads:
http://blog.asquareb.com/blog/2015/01/01/configuration-parameters-that-can-influence-hbase-performance/
You might find multiple sources (including the ones above) suggesting to change the scanner cache config, but this holds only with HBase < 1.x
We had this exact question at Splice Machine. We found the following based on our tests.
HBase had performance challenges if you attempted to perform remote scans from spark/mapreduce.
The large scans hurt performance of ongoing small scans by forcing garbage collection.
There was not a clear resource management dividing line between OLTP and OLAP queries and resources.
We ended up writing a custom reader that reads the HFiles directly from HDFS and performs incremental deltas with the memstore during scans. With this, Spark could perform quick enough for most OLAP applications. We also separated the resource management so the OLAP resources were allocated via YARN (On Premise) or Mesos (Cloud) so they would not disturb normal OLTP apps.
I wish you luck on your endeavor. Splice Machine is open source and you are welcome to checkout out our code and approach.
I was wondering what would be the best way to replicate the data present in a Hadoop cluster H1 in data center DC1 to another Hadoop cluster H2 in data center DC2 (warm backup preferably). I know that Hadoop does data replication and the number of copies of the data created is decided by the replication factor set in hdfs-site.xml. I have a few questions related to this
Would it make sense to have the data nodes of one cluster be spread across both data centers so that the data nodes for H1 would be present in both DC1 and DC2. If this makes sense and is viable, then does it mean we do not need H2?
Would it make sense to have the namenodes and datanodes distributed across both data centers rather than having only the datanodes distributed across both data centers?
I have also heard people use distcp and many tools build on top of distcp. But distcp does lazy backups and would prefer warm backups over cold ones.
Some people suggest using Kafka for this but I am not sure how to go about using it.
Any help would be appreciated. Thanks.
It depends on what you are trying to protect against. If you want to protect against site failure, distcp seems to be the only option for cross datacenter replication. However, as you pointed out, distcp has limitations. You can use snapshots to protect against user mistakes or application corruptions because replication or multiple replicas will not protect against that. Other commercial tools are available for automating the backup process as well if you don't want to write code and maintain it.
I have a lot of data saved into Cassandra on a daily basis and I want to compare one datapoint with last 5 versions of data for different regions.
Lets say there is a price datapoint of a product and there are 2000 products in a context/region(say US). I want to show a heat map dash board showing when the price change happened for different regions.
I am new to hadoop, hive and pig. Which path would help me achieve my goal and some details appreciated.
Thanks.
This sounds like a good use case for either traditional mapreduce or spark. You have relatively infrequent updates, so a batch job running over the data and updating a table that in turn provides the data for the heatmap seems like the right way to go. Since the updates are infrequent, you probably don't need to worry about spark streaming- just a traditional batch job run a few times a day is fine.
Here's some info from datastax on reading from cassandra in a spark job: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSCcontext.html
For either spark or mapreduce, you are going to want to leverage the (spark or MR) framework's ability to partition the task- if you are manually connecting to cassandra and reading/writing the data like you would from a traditional RDBMS, you are probably doing something wrong. If you write your job correctly, the framework will be responsible for spinning up multiple readers (one for each node that contains the source data that you are interested in), distributing the calculation tasks, and routing the results to the appropriate machine to store them.
Some more examples are here:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkIntro.html
and
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/byoh/byohIntro.html
Either way, MapReduce is probably a little simpler, and Spark is probably a little more future proof.
I would like to avoid Impala nodes unnecessarily requesting data from other nodes over the network in cases when the ideal data locality or layout is known at table creation time. This would be helpful with 'non-additive' operations where all records from a partition are needed at the same place (node) anyway (for ex. percentiles).
Is it possible to tell Impala that all data in a partition should always be co-located on a single node for any HDFS replica?
In Impala-SQL, I am not sure if the "PARTITIONED BY" clause provide this feature. In my understanding, Impala chunks its partitions into separate files on HDFS but HDFS does not guarantee the co-location of related files nor blocks by default (rather tries to achieve the opposite).
Found some information about Impala's impact on HDFS development but not clear if these are already implemented or still in plans:
http://www.slideshare.net/deview/aaron-myers-hdfs-impala
(slides 23-24)
Thank you in advance for all.
About the slides you mention ("Co-located block replicas") - it's about an HDFS feature (HDFS-2576) implemented in Hadoop 2.1. It provides a Java API to give hints to HDFS as to where the blocks should be placed.
It's not used in Impala as of 2014, but it definitely seems like building some groundwork for that - as it would give Impala a performance equivalent of specifying distribution key in traditional MPP databases.
No, that completely defeats the purpose of having a distributed file system and MPP computing. It also creates a single point of failure and a bottleneck especially if you're talking about a 250GB table that is joined to itself. Exactly the kind of problems that Hadoop was designed to solve. Partitioning data creates sub-directories in HDFS on the namenode and that data is then replicated throughout the datanodes in the cluster.