How to balance load of HBase while loading file? - hadoop

I am new to Apache-Hadoop. I have Apache-Hadoop cluster of 3 nodes. I am trying to load a file having 4.5 billion records,but its not getting distributed to all nodes. The behavior is kind of region hotspotting.
I have removed "hbase.hregion.max.filesize" parameter from hbase-site.xml config file.
I observed that if I use 4 node's cluster then it distributes data to 3 nodes and if I use 3 node's cluster then it distributes to 2 nodes.
I think, I am missing some configuration.

Generaly with HBase the main issue is to prepare rowkeys that are not monotonically.
If they are, only oneregion server is used at the time:
http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
This is HBase Reference Guide about RowKey Design:
http://hbase.apache.org/book.html#rowkey.design
And one more really good article:
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
In our case predefinition of Region servers also improved the loading time:
create 'Some_table', { NAME => 'fam'}, {SPLITS=> ['a','d','f','j','m','o','r','t','z']}
Regards
Pawel

Related

Clustered NIFI, Only one node is working

I'm using NIFI in a clustered mode with two nodes, and I have noticed that only one node that do all the work.
Any idea why is that ? and how can I make nifi2 do some of the processing of the dataflow ?
It depends how data is coming in to your cluster. It is up to you as the data flow designer to create an approach that allows the data to be partitioned across your cluster for processing.
See this post for an overview of strategies to do this:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

Are all the data with the same row key stored in the same node?

I have got a question regarding hbase databases. We access the data first by defining a row key, column family and in the last by column qualifier.
My question is will HBase store all column families with the same row key together in one node or not?
UPDATE: As an example, I want to multiply val1 and val2 in a map/reduce job. While val1 and val2 are stored in database like this: Row=00000 Column Family:M, m000001_1234567=val1, Row=00000 Column Family: R, r000001_1234567=val2. Can I make sure that I have access to both val1 and val2 in the same node running the map?
As you might be aware its actually the HFile that has the actual key value data stored and it would be distributed accross the datanodes. The zookeeper / HLog /Memestore help in locating the rowkey data and retrieve it.
The Key-value storage would be grouped and stored in each node , say keys [A-L] goes to one node and the rest [M-z] to another node , considering 2 node scenario.
Question 1: Will HBase store all column families with the same row key together in one node?
Yes, but there are a few special cases.
The recommened way to set up an HBase cluster is the collocated (or co-located) configuration: use the some machines for HDFS Data Nodes and HBase Region Servers (in contrast to dedicating the machines to specifically one of these roles, in which case all reads would be remote and performance would suffer). In such a setup, when a Region Server saves data to HDFS, the first replica of the data will always get saved to the local disk. However, the placement of any further replicas are not consistent - different parts may be placed on different nodes. This means that if a machine dies, no data will get lost, but the data of that region will not be found on any single machine any more, bit will be scattered all around the cluster instead. Even in this case, a single row will probably still to be stored on a single Data Node, but it won't be local to the new Region Server any more.
This is not the only way how data locality can get lost, previously even restarting HBase had this effect. A lot of older posts mention this, but this has actually been fixed since then in HBASE-2896.
Even if data locality gets lost, the next major compaction will restore it.
Sources and recommended reading:
How Scaling Really Works in Apache HBase
HBase and data locality
HBase File Locality in HDFS
Major compaction and data locality
Question 2: When reading an HBase table from a MapReduce job, does each mapper run on the node where the data it uses is stored?
My understanding is that apart from the special case mentioned above, the answer is yes, but I couldn't find this explicitly mentioned anywhere.
Sources and recommended reading:
Understanding Map Reduce on HTable
The MapReduce Integration section of Tutorial: HBase

Dataset for Hadoop Dev environment?

I am learning hadoop. I want to understand how dataset/database is setup for environments like Dev, Test and Pre-prod.
Of course in PROD environment we will be dealing with Terabytes of data, but having the same replica of tera bytes of data to other environments, i dont think it is possible.
For other environments how the datasets are replicated? only certain portions of data will be loaded and used in these non prod environments? if so how it is done?
How it is replicated, basically the concept of hdfs relevant to namenodes and datanodrs should give you some research. When you create a new file it goes to name node which updated the metadata and give you a blank block id once you write it finds the nearest datanodes base on the rack location. It replicates to the first datanodes, once its done replicating. Datanode first will replicate it to the next second then thirds and so fourth. It basically just re0licate on the very first node and the hdfs framework will handle the next preceedi g replication

Rethink DB Cross Cluster Replication

I have 3 different pool of clients in 3 different geographical locations.
I need configure Rethinkdb with 3 different clusters and replicate data between the (insert, update and deletes). I do not want to use shard, only replication.
I didn't found in documentation if this is possible.
I didn't found in documentation how to configure multi-cluster replication.
Any help is appreciated.
I think that multi cluster is just same a single clusters with nodes in different data center
First, you need to setup a cluster, follow this document: http://www.rethinkdb.com/docs/start-a-server/#a-rethinkdb-cluster-using-multiple-machines
Basically using below command to join a node into cluster:
rethinkdb --join IP_OF_FIRST_MACHINE:29015 --bind all
Once you have your cluster setup, the rest is easy. Go to your admin ui, select the table, in "Sharding and replication", click Reconfigure and enter how many replication you want, just keep shard at 1.
You can also read more about Sharding and Replication at http://rethinkdb.com/docs/sharding-and-replication/#sharding-and-replication-via-the-web-console

Running pig on a multi node Cassandra cluster

I am working on BI process that will read data from cassandra, create summaries using Map Reduce and write back to a different keyspace.
Starting with a single node, everything worked as i expected, but when moving to a multi-node, i am not sure I fully understand the topology and configuration.
I have a setup with 3 nodes. Each has a Cassandra node (version 1.1.9), data node and task tracker (version 0.20.2+923.421- CDH3U5) . The NameNode and job tracker are on a different server. At this point i am trying to run Pig script from the DataNode server.
The thing i am not sure of is the pig argument PIG_INITIAL_ADDRESS. I assumed the query would run on all Cassandra nodes, each task tracker would only query the local Cassandra node, and the reducer would handle any duplicates. Based on that assumption i thought the PIG_INITIAL_ADDRESS should be localhost. But when running the pig script it fails:
java.io.IOException: Unable to connect to server localhost:9160
My questions are- should the initial address be any one of the Cassandra nodes, and Splitting the map on the cluster is done from Cassandra keys partitions (will i get the distribution i need)?
IF I where to use java map reduce, will i still need to supply the initial address?
Is the current implementation assumes pig is running from a Cassandra node?
The PIG_INITIAL_ADDRESS is the address of one of the Cassandra nodes in your ring. In order to have the Hadoop job read data from or write data to Cassandra, it just needs to have some properties set. Those properties are also available to set in the job properties or in the default Hadoop configuration on the server that you're running the job from. Other than that, it's just like submitting a job to a job tracker.
For more information, I would look at the readme that's in the cassandra source download under examples/pig. There is a lot of explanation in there as well.

Resources