HBase Replication - Replicate data in 3 data centers - hadoop

I our application we are having data from 3 different countries and we are persisting data in HBase.
In each country, we will be keeping data of all the 3 countries.
To achieve this, is it possible that we create our Hadoop cluster using data centers in all these 3 countries and we keep data replication as 3. So due to rack-awareness feature, our data will get auto replicated in all the 3 countries?
Any pointers will be of great help.
Thanks

You can’t have HBASE cluster across countries. This won’t work because of latency, failover problems, network issues, etc.
A good option would be to have 3 clusters, one HBase table per country and sync the tables between clusters as proposed above
As far as I know only Google has successfully implemented a multi-country database providing both consistency and availability: Spanner. But the key elements of the solution are: a private physical network between the Data Centers and their own implementation of NTP which guarantee that all servers across the world have the same clock with just a few milliseconds precision.

This solution looks theoretically feasible but writes may become pretty slow as data needs to replicated to 3 nodes located in different geographies. It needs to be tried out and check whether the latency is within tolerable limit.
Another option could be, to have three different HBase clusters at three locations and design tables in such a way that tables from one HBase cluster could be copied to another one during night hours to keep the data in sync daily. In this case, an HBase cluster will have current data from it's own location but the data from other two cities will lag by a day.

Related

Strange replication in Cassandra

I have configured locally 3 nodes in on 'Test Cluster' of Cassandra. When I run them and create some keyspace or table also on all three nodes the keyspace or the table appears.
The problem I'm dealing with is, when I'm importing from CSV millions of rows in the table I already built the whole data suddenly appears on all three nodes. I have the same data replicated over the three nodes.
As I'm familiar with, the data I'm importing should be replicated/distributed over the nodes but partially. One partition on the first node, second on third, third on second node, fourth again on first node and ...
Am I right or I'm missing something big?
Also, my write speed locally is about 10k rows / second for the multi-node cluster. Isn't that a little bit too low?
I want to create discussion so I can maybe learn something more from your experience and see where I'm messing things.
Thank you!
The number of nodes that data is written to in your cluster is determined by the Replication Factor for that keyspace. If you have 3 nodes and the data is being written to all the nodes, then this setting must be set to 3. If you only want the data the be replicated to two nodes, you'd set this value to two.
Your write speed will be affected by the consistency level you are specifying on the write. If you have it set to ALL then you have to wait until all the nodes that are going to write the data have written the data (in your case all 3 nodes based on your replication factor). Dropping your consistency level on the write will probably net you faster write times. There is a balance between your replication factor, write consistency level, and read consistency level that you can research further.

Hive query having 15 tables join is expected to generate 1 Billion records, on 3 datanodes, 16GB RAM each Is this the right way to do?

My name is Vitthal.
The Hortonworks HDP 2.4 Cluster on Amazon is 3 Datanodes, Masters on different Instances.
7 Instances 16GB RAM each.
Total 1TB HDD Space
3 Data Nodes
Hadoop version 2.7
I have pulled data from Postgres into Hadoop Distributed Environment.
The Data is 15 Tables, Among them 4 tables are having 15 Million Records, rest are Masters.
I've pulled them in HDFS, compressed as ORC, and SnappyCodec. Created Hive External Tables with schema.
Now I'm firing a query which joins all the 15 tables and selects the columns which I need in a final flat table. The records expected are more than 1.5 Billion.
I have optimized Hive, Yarn, MapReduce Engine viz. Parallel Execution, Vectorization, Optimized Joins, Small Table Condition, Heap Size etc.
The query is running on Cluster / Hive / Tez since 20 hours & it's reached 90% where the last reducer is running. The 90% is reached long back like since 18 hours it's stuck at 90%.
Am I doing it the right way ?
If I understand, you have effectively copied tables in their raw form from your RDBMs into Hadoop in order to create a flattened view into one or more new tables. You're using Hive to do this. All of this sounds fine.
There are many possibilities why this is taking so long, but several come to mind.
First, YARN will allocate containers (one per CPU core, typically) that mappers and reducers will use to run the parallelized parts of the query. This should allow you to utilize all of the resources you have available.
I use Cloudera, but I assume Hortonworks has similar tools that let you see how many containers are in use, how many mappers and reducers are created by Hive, and so on. You should see that most or all of your available CPUs are in use constantly. Jobs should be finishing at some reasonable rate (perhaps every minute, or every 15 minutes). Depending on the query, Hive is often able to break it into distinct "stages" that are executed distinctly from others, then reassembled at the end.
If this is the case, everything may be fine, but your cluster may be under-resourced. But before you throw more AWS instances at the problem, consider the query itself.
First, Hive has several tools that are essential for optimizing performance, most importantly, partitioning. When you create tables, you should find some means of partitioning the resulting datasets into roughly equal subsets. A common method is to use dates, for example year+month+day (perhaps 20160417), or if you expect to have lots of historical data, maybe just year+month. This will also allow you to dramatically optimize queries that can be constrained by date. I seem to recall that Hive (or maybe it's YARN) will allocate partitions to different containers, so if you don't see all your workers working, then this would be a possible cause. Use the PARTITIONED BY clause in your CREATE TABLE statement.
The reason to choose something like date is that presumably your data is relatively evenly distributed over time (dates). We had chosen a customer_id as a partition key in an early implementation but as we grew, so did our customers. Hundreds of smaller customers would finish in a few minutes, then hundreds of mid-sized customers would finish in an hour, then a couple of our largest customers would take 10 or more hours to complete. We would see complete utilization of the cluster for that first hour, then only a couple containers in use for the last couple of customers. Not good.
This phenomenon is known as "data skew", so you want to carefully choose partitions to avoid skew. There are some options involving SKEW BY and CLUSTER BY that can help deal with getting evenly sized or smaller data files that you could consider.
Note that the raw import data should also be partitioned, as partitions act like indexes in a RDBMS, so are important for performance. In this case, choose partitions that use the keys that your larger query joins on. It is possible and common to have multiple partitions, so a date-based top-level partition, with a sub-partition on the join key could be helpful ... maybe ... depends on your data.
We have also found that it's very important to optimize the query itself. Hive has some hinting mechanisms that can direct it to run the query differently. While quite rudimentary compared to RDBMS, EXPLAIN is very helpful for understanding how Hive will break up the query and when it needs to scan a full dataset. It's hard to read the explain output, so get comfortable with the Hive documentation :-).
Lastly, if you can't make Hive do things in a sensible manner (if its optimizer still results in imbalanced stages) you can create intermediate tables with an additional Hive query that runs to create a partially transformed dataset before building the final one. This seems expensive since you're adding an additional write, and read of new tables, but in the case you describe it may be much faster overall. Also, it's sometimes useful to have intermediate tables just to test or sample data.
Writing Hive is a lot less like writing regular software -- you can get the Hive query done pretty quickly in most cases. Getting it to run fast has taken us 10 or 15 tries in a few cases. Good luck, and I hope this is helpful.

Replicating data between multiple Hadoop clusters residing in different data centers

I was wondering what would be the best way to replicate the data present in a Hadoop cluster H1 in data center DC1 to another Hadoop cluster H2 in data center DC2 (warm backup preferably). I know that Hadoop does data replication and the number of copies of the data created is decided by the replication factor set in hdfs-site.xml. I have a few questions related to this
Would it make sense to have the data nodes of one cluster be spread across both data centers so that the data nodes for H1 would be present in both DC1 and DC2. If this makes sense and is viable, then does it mean we do not need H2?
Would it make sense to have the namenodes and datanodes distributed across both data centers rather than having only the datanodes distributed across both data centers?
I have also heard people use distcp and many tools build on top of distcp. But distcp does lazy backups and would prefer warm backups over cold ones.
Some people suggest using Kafka for this but I am not sure how to go about using it.
Any help would be appreciated. Thanks.
It depends on what you are trying to protect against. If you want to protect against site failure, distcp seems to be the only option for cross datacenter replication. However, as you pointed out, distcp has limitations. You can use snapshots to protect against user mistakes or application corruptions because replication or multiple replicas will not protect against that. Other commercial tools are available for automating the backup process as well if you don't want to write code and maintain it.

Does HBase use the compute capacity of all nodes in the cluster for query execution?

We are having a setup of 1 master and 2 slave nodes. The data is setup in postgres and in hbase and its a similar dataset (same number of rows) - 65 million rows. Yet, we dont find a measurable increase in performance from HBase for the same query.
My first thought is - does HBase use the compute capacity of all nodes to fork the query out? Perhaps this is why the performance is not measurably better.
Any other reasons for why the performance between Postgres and HBase would be about the same? Any specific configuration items to look for?
EDIT : Something I found while researching this : http://www.flurry.com/2012/06/12/137492485#.VaQP_5QpBpg
This is kind of a yes and no answer. Depending on what you are doing for your 'query' and your region distribution, you may or may not be using all the nodes. For example, if you are running a scan across the table it will run against each region (assuming more then one) in sequence. However if you are using a multi-get for keys that are in different regions, this will run in parallel.
The real benefit is going to come as the number of regions increase and you start parallelizing requests (multiple clients). Regions will be distributed across region servers by the Master as regions are split.

How does HBase distribute new regions from MapReduce across the cluster?

My situation is the following: I have a 20-node Hadoop/HBase cluster with 3 ZooKeepers. I do a lot of processing of data from HBase tables to other HBase tables via MapReduce.
Now, if I create a new table, and tell any job to use that table as an output sink, all of its data goes onto the same regionserver. This wouldn't surprise me if there are only a few regions. A particular table I have has about 450 regions and now comes the problem: Most of those regions (about 80%) are on the same region server!
I was wondering now how HBase distributes the assignment of new regions throughout the cluster and whether this behaviour is normal/desired or a bug. I unfortunately don't know where to start looking in a bug in my code.
The reason I ask is that this makes jobs incredibly slow. Only when the jobs are completely finished the table gets balanced across the cluster but that does not explain this behaviour. Shouldn't HBase distibute new regions at the moment of the creation to different servers?
Thanks for you input!
I believe that this is a known issue. Currently HBase distributes regions across the cluster as a whole without regard for which table they belong to.
Consult the HBase book for background:
http://hbase.apache.org/book/regions.arch.html
It could be that you are on an older version of hbase:
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/19155
See the following for a discussion of load balancing and region moving
http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/12549
By default, it just balance regions on each RS without take table into account.
You can set hbase.master.loadbalance.bytable to get it.

Resources