Are all the data with the same row key stored in the same node? - hadoop

I have got a question regarding hbase databases. We access the data first by defining a row key, column family and in the last by column qualifier.
My question is will HBase store all column families with the same row key together in one node or not?
UPDATE: As an example, I want to multiply val1 and val2 in a map/reduce job. While val1 and val2 are stored in database like this: Row=00000 Column Family:M, m000001_1234567=val1, Row=00000 Column Family: R, r000001_1234567=val2. Can I make sure that I have access to both val1 and val2 in the same node running the map?

As you might be aware its actually the HFile that has the actual key value data stored and it would be distributed accross the datanodes. The zookeeper / HLog /Memestore help in locating the rowkey data and retrieve it.
The Key-value storage would be grouped and stored in each node , say keys [A-L] goes to one node and the rest [M-z] to another node , considering 2 node scenario.

Question 1: Will HBase store all column families with the same row key together in one node?
Yes, but there are a few special cases.
The recommened way to set up an HBase cluster is the collocated (or co-located) configuration: use the some machines for HDFS Data Nodes and HBase Region Servers (in contrast to dedicating the machines to specifically one of these roles, in which case all reads would be remote and performance would suffer). In such a setup, when a Region Server saves data to HDFS, the first replica of the data will always get saved to the local disk. However, the placement of any further replicas are not consistent - different parts may be placed on different nodes. This means that if a machine dies, no data will get lost, but the data of that region will not be found on any single machine any more, bit will be scattered all around the cluster instead. Even in this case, a single row will probably still to be stored on a single Data Node, but it won't be local to the new Region Server any more.
This is not the only way how data locality can get lost, previously even restarting HBase had this effect. A lot of older posts mention this, but this has actually been fixed since then in HBASE-2896.
Even if data locality gets lost, the next major compaction will restore it.
Sources and recommended reading:
How Scaling Really Works in Apache HBase
HBase and data locality
HBase File Locality in HDFS
Major compaction and data locality
Question 2: When reading an HBase table from a MapReduce job, does each mapper run on the node where the data it uses is stored?
My understanding is that apart from the special case mentioned above, the answer is yes, but I couldn't find this explicitly mentioned anywhere.
Sources and recommended reading:
Understanding Map Reduce on HTable
The MapReduce Integration section of Tutorial: HBase

Related

Hive or Hbase when we need to pull more number of columns?

I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.

Dataset for Hadoop Dev environment?

I am learning hadoop. I want to understand how dataset/database is setup for environments like Dev, Test and Pre-prod.
Of course in PROD environment we will be dealing with Terabytes of data, but having the same replica of tera bytes of data to other environments, i dont think it is possible.
For other environments how the datasets are replicated? only certain portions of data will be loaded and used in these non prod environments? if so how it is done?
How it is replicated, basically the concept of hdfs relevant to namenodes and datanodrs should give you some research. When you create a new file it goes to name node which updated the metadata and give you a blank block id once you write it finds the nearest datanodes base on the rack location. It replicates to the first datanodes, once its done replicating. Datanode first will replicate it to the next second then thirds and so fourth. It basically just re0licate on the very first node and the hdfs framework will handle the next preceedi g replication

How does HBase enable Random Access to HDFS?

Given that HBase is a database with its files stored in HDFS, how does it enable random access to a singular piece of data within HDFS? By which method is this accomplished?
From the Apache HBase Reference Guide:
HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.
Scanning both chapters didn't reveal a high-level answer for this question.
So how does HBase enable random access to files stored in HDFS?
HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key.
For example: a table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.
hbase acess hdfs file by using hfile . you can check the url to get the detail: http://hbase.apache.org/book/hfilev2.html

Is there any way to control in Hadoop MapReduce framework on which node reducer will be started?

shortly speaking I need a way to give Hadoop MapRedice API hint on what host I'd like to run certain reducer based on its partition. Is there any way?
Somewhat longer story:
I have few mapper tasks which generate (or import from another source) records for certain HBase table. Emitted records have ImmutableBytesWritable as keys. Number of reducers for this job exactly matches number of table regions and custom partitioner is used to distribute records so records of every region gets to appropriate reducer.
Reducers are intended to generate HFile images, one image per region so later bulk load could be used on them. The only serious problem here is I'd like reducers at least to 'try to run' on the same hosts appropriate region servers are running. This is to get good probability of generated HFiles locality (in terms of HDFS) for appropriate HBase region servers.
Any idea how to get this behavior?
Alternative could be how to 'request' HDFS file to 'get local'. Having this I could start another MR job with mappers bound to region servers (through splits) and request corresponding HFile to get local.
There is no out-of-box way to do this yet, short of writing a custom scheduler, which would be an overkill.
An upstream ticket does track this feature request at https://issues.apache.org/jira/browse/MAPREDUCE-199.

write data in hbase

i have a problem while write data in hbase.I have 4 region server.when i write data and use radom key ,data write to any region but they are in one region server.One server are busy, three server are free.How do write regularity in all region server.
HBase partitions it's tables across region servers. See :
How HBase partitions table across regionservers?
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
I am not sure how random or far apart your random key should be to be able to write to different partitions.
See discussions on hbase.hregion.max.filesize and base.hregion.maxfilesize which suggests that tables are split to new regions when the appropriate data size has been reached.

Resources