I am learning hadoop. I want to understand how dataset/database is setup for environments like Dev, Test and Pre-prod.
Of course in PROD environment we will be dealing with Terabytes of data, but having the same replica of tera bytes of data to other environments, i dont think it is possible.
For other environments how the datasets are replicated? only certain portions of data will be loaded and used in these non prod environments? if so how it is done?

How it is replicated, basically the concept of hdfs relevant to namenodes and datanodrs should give you some research. When you create a new file it goes to name node which updated the metadata and give you a blank block id once you write it finds the nearest datanodes base on the rack location. It replicates to the first datanodes, once its done replicating. Datanode first will replicate it to the next second then thirds and so fourth. It basically just re0licate on the very first node and the hdfs framework will handle the next preceedi g replication


understanding how hbase uses hdfs

I’m trying to understand how hbase uses the hdfs.
so here is what I understand (please correct me if I'm wrong):
I know that hbase use hdfs to store data and that data is split into regions, and that each region server my serve many regions,so I guess that one region (exclusively) may communicate with many data node to get and put data, so If that is correct then if that region server fails then data stored in those data node, will not be accessible anymore
In general, a Regionserver runs on a datanode.
Due to how HDFS works, the Regionserver will perform its reads and writes to the local datanode when possible, and then HDFS will ensure that the data is replicated onto two other random datanodes. So at all times, the data written by that regionserver is stored on 3 nodes in HDFS.
While a regionserver is serving a region, only it will read / write the data for that region, but if the regionserver process crashes, the HBase master will select another regionsever to serve that region. The data will be unavailable for a few minutes, but HBase will recover quickly.
If the entire host fails, then as HDFS ensured the data was written onto two other nodes, the scenario is the same - the master will select a new regionserver to open the failed region and the data not be lost.

Are all the data with the same row key stored in the same node?

I have got a question regarding hbase databases. We access the data first by defining a row key, column family and in the last by column qualifier.
My question is will HBase store all column families with the same row key together in one node or not?
UPDATE: As an example, I want to multiply val1 and val2 in a map/reduce job. While val1 and val2 are stored in database like this: Row=00000 Column Family:M, m000001_1234567=val1, Row=00000 Column Family: R, r000001_1234567=val2. Can I make sure that I have access to both val1 and val2 in the same node running the map?
As you might be aware its actually the HFile that has the actual key value data stored and it would be distributed accross the datanodes. The zookeeper / HLog /Memestore help in locating the rowkey data and retrieve it.
The Key-value storage would be grouped and stored in each node , say keys [A-L] goes to one node and the rest [M-z] to another node , considering 2 node scenario.
Question 1: Will HBase store all column families with the same row key together in one node?
Yes, but there are a few special cases.
The recommened way to set up an HBase cluster is the collocated (or co-located) configuration: use the some machines for HDFS Data Nodes and HBase Region Servers (in contrast to dedicating the machines to specifically one of these roles, in which case all reads would be remote and performance would suffer). In such a setup, when a Region Server saves data to HDFS, the first replica of the data will always get saved to the local disk. However, the placement of any further replicas are not consistent - different parts may be placed on different nodes. This means that if a machine dies, no data will get lost, but the data of that region will not be found on any single machine any more, bit will be scattered all around the cluster instead. Even in this case, a single row will probably still to be stored on a single Data Node, but it won't be local to the new Region Server any more.
This is not the only way how data locality can get lost, previously even restarting HBase had this effect. A lot of older posts mention this, but this has actually been fixed since then in HBASE-2896.
Even if data locality gets lost, the next major compaction will restore it.
Backup Hadoop in order to install new cluster, best practice

I am building a new Hadoop cluster (expanding number of nodes and extending capacity of current nodes) and need to back up all of the existing data. Right now I am just tar-ing everything and sending it to another server.
Is there a smarter way of doing this which will allow me to easily deploy once the new cluster is set up?
Edit: I should also point out that I don't store any data on the cluster. I bring data to the cluster, process it, and then send the processed data back to the original server. Any temporary data on the cluster is the deleted.
Use Distcp to transfer the HDFS data to other cluster or any cloud inorder to store the data.
If you want to schedule the Backup process you may avail OOZIE-DISTCP for backup process!!

Reading operations on hadoop and consistency level

I am setting up distributed HBase on HDFS and I trying to understand behavior of the system during read operations.
This is how I understand high level steps of the read operation.
Client connects to NameNode to get list of DataNodes which contain replicas of the rows that he interested in.
From here Client caches list of DataNodes and start talking to chosen DataNode directly until it needs some other rows from other DataNode, in which case it asks NameNode again.
My questions are as follows:
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Hadoop maintains the same reading policy when supporting HBase?
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
The client is the one that decides who best to contact. It picks them in this order:
The file is on the same machine. In this case (if properly configured) it will short circuit the DataNode and go directly to the file as an optimization.
The file is in the same rack (if rack awareness is configured).
The file is somewhere else.
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
It's not that smart. It'll switch if it thinks the DataNode is down (meaning it times out) but in not any other situation that I know of. I believe that it will just go to the next one in the list, but it might contact the NameNode again-- I'm not 100% sure.
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Stale data is possible, but not in the situation you describe. Files are write-once and immutable (other than append, but don't append if you don't have to). The NameNode won't tell you the file is there until it is completely written. In the case of append, shame on you then. The behavior of reading from an actively-being-appended-to file on a local filesystem is unpredictable as well. You should expect the same in HDFS.
One way stale data could happen is if you retrieve your list of block locations and the NameNode decides to migrate all three of them at once before you access it. I don't know what would happen there. In the 5 years of using Hadoop, I've never had this be a problem. Even when running the balancer at the same time as doing stuff.
Hadoop maintains the same reading policy when supporting HBase?
HBase is not treated special by HDFS. There is some talk about using a custom block placement strategy with HBase to get better data locality, but that's in the weeds.

how does hdfs choose a datanode to store

As the title indicates, when a client requests to write a file to the hdfs, how does the HDFS or name node choose which datanode to store the file?
Does the hdfs try to store all the blocks of this file in the same node or some node in the same rack if it is too big?
Does the hdfs provide any APIs for applications to store the file in a certain datanode as he likes?
how does the HDFS or name node choose which datanode to store the file?
HDFS has a BlockPlacementPolicyDefault, check the API documentation for more details. It should be possible to extend BlockPlacementPolicy for a custom behavior.
Does the hdfs provide any APIs for applications to store the file in a certain datanode as he likes?
The placement behavior should not be specific to a particular datanode. That's what makes HDFS resilient to failure and also scalable.
The code for choosing datanode is in function ReplicationTargetChooser.chooseTarget().
The comment says that :
The replica placement strategy is that if the writer is on a
datanode, the 1st replica is placed on the local machine, otherwise
a random datanode. The 2nd replica is placed on a datanode that is on
a different rack. The 3rd replica is placed on a datanode which is on
the same rack as the first replica.
It doesn`t provide any API for applications to store the file in the datanode they want.
If someone prefers charts, here is a picture (source):
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
You can see that when namenode instructs datanode to store data. The first replica is stored in the local machine and other two replicas are made on other rack and so on.
If any replica fails, data is stored from other replica. Chances of failing every replica is just like falling of fan on your head while you were sleeping :p i.e. there is very less chance for that.
