Poor write Performance by HBase client - performance

I'm using HBase client in my application server (-cum web-server) with HBase
cluster setup of 6 nodes using CDH3u4 (HBase-0.90). HBase/Hadoop services
running on cluster are:
NODENAME-- ROLE
Node1 -- NameNode
Node2 -- RegionServer, SecondaryNameNode, DataNode, Master
Node3 -- RegionServer, DataNode, Zookeeper
Node4 -- RegionServer, DataNode, Zookeeper
Node5 -- RegionServer, DataNode, Zookeeper
Node6 -- Cloudera Manager, RegionServer, DataNode
I'm using following optimizations for my HBase client:
auto-flush = false
ClearbufferOnFail=true
HTable bufferSize = 12MB
Put setWriteToWAL = false (I'm fine with loss of 1 data).
In order to be closely consistent between read and write, I'm calling
flush-commits on all the buffered tables at every 2 sec.
In my application, I place the HBase write call in a Queue (async manner) and
draining the queue using 20 Consumer threads. On hitting web-server locally
using curl, I'm able to see TPS of 2500 for HBase after curl completes, but
with Load-test where request is coming at high rate of 1200 hits per second
on 3 application servers,the Consumer(drain) threads which are responsible to
write to HBase are not writing data at a rate comparable to input rate. I'm
seeing not more than 600 TPS when request rate is 1200 hits per second.
Can anyone suggest what we can do to improve performance? I've tried with
reduced threads to 7 on each of 3 app server but still no effect. An expert
opinion would be helpful. As this is a production server, so not thinking
to swap the roles, unless someone point severe performance benefit.
[EDIT]:
Just to highlight/clarify our HBase writing pattern, our 1st Transaction checks the row in Table-A (using HTable.exists). It fails to find the row first time and so write to three tables. Subsequent 4 Transaction make exist check on Table-A and as it finds the row, it writes only to 1 Table.

So that's a pretty ancient version of HBase. As of Aug 18, 2013, I would recommend upgrading to something based off of 0.94.x.
Other than that it's really hard to tell you for sure. There are lots of tuning knobs. You should :
Make sure that HDFS has enough xceivers.
Make sure that HBase has enough heap space.
Make sure there is no swapping
Make sure there are enough handlers.
Make sure that you have compression turned on. [1]
Check disk io
Make sure that your row keys, column family names, column qualifiers, and values are as small as possible
Make sure that your writes are well distributed across your key space'
Make sure your regions are (pre-)split
If you're on a recent version then you might want to look at encoding [2]
After all of those things are taken care of then you can start looking at logs and jstacks.
https://hbase.apache.org/book/compression.html
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.html

Related

Dataset for Hadoop Dev environment?

I am learning hadoop. I want to understand how dataset/database is setup for environments like Dev, Test and Pre-prod.
Of course in PROD environment we will be dealing with Terabytes of data, but having the same replica of tera bytes of data to other environments, i dont think it is possible.
For other environments how the datasets are replicated? only certain portions of data will be loaded and used in these non prod environments? if so how it is done?
How it is replicated, basically the concept of hdfs relevant to namenodes and datanodrs should give you some research. When you create a new file it goes to name node which updated the metadata and give you a blank block id once you write it finds the nearest datanodes base on the rack location. It replicates to the first datanodes, once its done replicating. Datanode first will replicate it to the next second then thirds and so fourth. It basically just re0licate on the very first node and the hdfs framework will handle the next preceedi g replication

Storing a file on Hadoop when not all of its replicas can be stored on the cluster

Can somebody let me know what will happen if my Hadoop cluster (replication factor = 3) is only left with 15GB of space and I try to save a file which is 6GB in size?
hdfs dfs -put 6gbfile.txt /some/path/on/hadoop
Will the put operation fail giving error(probably cluster full) or will it save two replicas of the 6GB file and mark the blocks which it cannot save on the cluster as under-replicated and thereby occupying the whole of 15GB of leftover?
You should be able to store the file.
It will try and accommodate as many replicas as possible. When it fails to store all the replicas, it will throw a warning but not fail. As a result, you will land up with under-replicated blocks.
The warning that you would see is
WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replicas
When ever you fire the put command :
dfs utility is behaving like a client here .
client will contact namenode first , then namenode will guide client, where to write the blocks and will keep the maintain metadata for that file , then its client responsibility to break data in block as per configuration specified.
Then client will then make a direct connection with different data nodes , where it has to write different blocks as per namenode reply.
First copy of data would be written by client only on data nodes ,subsequent copies data node will create on each other with the guidance from namenode .
So you should be able to put the file of 6 gb if 15 gb space is there ,because initially the original copies gets created on hadoop , later on once the replication process will start then problem would get arise.

Reading operations on hadoop and consistency level

I am setting up distributed HBase on HDFS and I trying to understand behavior of the system during read operations.
This is how I understand high level steps of the read operation.
Client connects to NameNode to get list of DataNodes which contain replicas of the rows that he interested in.
From here Client caches list of DataNodes and start talking to chosen DataNode directly until it needs some other rows from other DataNode, in which case it asks NameNode again.
My questions are as follows:
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Hadoop maintains the same reading policy when supporting HBase?
Thank you
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
The client is the one that decides who best to contact. It picks them in this order:
The file is on the same machine. In this case (if properly configured) it will short circuit the DataNode and go directly to the file as an optimization.
The file is in the same rack (if rack awareness is configured).
The file is somewhere else.
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
It's not that smart. It'll switch if it thinks the DataNode is down (meaning it times out) but in not any other situation that I know of. I believe that it will just go to the next one in the list, but it might contact the NameNode again-- I'm not 100% sure.
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Stale data is possible, but not in the situation you describe. Files are write-once and immutable (other than append, but don't append if you don't have to). The NameNode won't tell you the file is there until it is completely written. In the case of append, shame on you then. The behavior of reading from an actively-being-appended-to file on a local filesystem is unpredictable as well. You should expect the same in HDFS.
One way stale data could happen is if you retrieve your list of block locations and the NameNode decides to migrate all three of them at once before you access it. I don't know what would happen there. In the 5 years of using Hadoop, I've never had this be a problem. Even when running the balancer at the same time as doing stuff.
Hadoop maintains the same reading policy when supporting HBase?
HBase is not treated special by HDFS. There is some talk about using a custom block placement strategy with HBase to get better data locality, but that's in the weeds.

How is data written to HDFS?

I'm trying to understand how is data writing managed in HDFS by reading hadoop-2.4.1 documentation.
According to the following schema :
whenever a client writes something to HDFS, he has no contact with the namenode and is in charge of chunking and replication. I assume that in this case, the client is a machine running an HDFS shell (or equivalent).
However, I don't understand how this is managed.
Indeed, according to the same documentation :
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Is the schema presented above correct ? If so,
is the namenode only informed of new files when it receives a Blockreport (which can take time, I suppose) ?
why does the client write to multiple nodes ?
If this schema is not correct, how is file creation working with HDFs ?
As you said DataNodes are responsible for serving read/write requests and block creation/deletion/replication.
Then they send on a regular basis “HeartBeats” ( state of health report) and “BlockReport”( list of blocks on the DataNode) to the NameNode.
According to this article:
Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP
handshake, ... Every tenth heartbeat is a Block Report,
where the Data Node tells the Name Node about all the blocks it has.
So block reports are done every 30 seconds, I don't think that this may affect Hadoop jobs because in general they are independent jobs.
For your question:
why does the client write to multiple nodes ?
I'll say that actually, the client writes to just one datanode and tell him to send data to other datanodes(see this link picture: CLIENT START WRITING DATA ), but this is transparent. That's why your schema considers that the client is the one who is writing to multiple nodes

Hadoop Datanode, namenode, secondary-namenode, job-tracker and task-tracker

I am new in hadoop so I have some doubts. If the master-node fails what happened the hadoop cluster? Can we recover that node without any loss? Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails? The secondary namenode is the backup of namenode only not to datenode, right? If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
How can we restore the entire cluster data if anything happens?
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
Thanks in advance
Although, It is too late to answer your question but just It may help others..
First of all let me Introduce you with Secondary Name Node:
It Contains the name space image, edit log files' back up for past one
hour (configurable). And its work is to merge latest Name Node
NameSpaceImage and edit logs files to upload back to Name Node as
replacement of the old one. To have a Secondary NN in a cluster is not
mandatory.
Now coming to your concerns..
If the master-node fails what happened the hadoop cluster?
Supporting Frail's answer, Yes hadoop has single point of failure so
whole of your currently running task like Map-Reduce or any other that
is using the failed master node will stop. The whole cluster including
client will stop working.
Can we recover that node without any loss?
That is hypothetical, Without loss it is least possible, as all the
data (block reports) will lost which has sent by Data nodes to Name
node after last back up taken by secondary name node. Why I mentioned
least, because If name node fails just after a successful back up run
by secondary name node then it is in safe state.
Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
It is staright possible by an Administrator (User). And to switch it
automatically you have to write a native code out of the cluster, Code
to moniter the cluster that will cofigure the secondary name node
smartly and restart the cluster with new name node address.
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails?
It is about replication factor, We have 3 (default as best practice,
configurable) replicas of each file block all in different data nodes.
So in case of failure for time being we have 2 back up data nodes.
Later Name node will create one more replica of the data that failed
data node contained.
The secondary namenode is the backup of namenode only not to datenode, right?
Right. It just contains all the metadata of data nodes like data node
address,properties including block report of each data node.
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
HDFS will forcely try to continue the job. But again it depends on
replication factor, rack awareness and other configuration made by
admin. But if following Hadoop's best practices about HDFS then it
will not get failed. JobTracker will get replicated node address to
continnue.
How can we restore the entire cluster data if anything happens?
By Restarting it.
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
yes, you can use any programming language which support Standard file
read write operations.
I Just gave a try. Hope it will help you as well as others.
*Suggestions/Improvements are welcome.*
Currently hadoop cluster has a single point of failure which is namenode.
And about the secondary node isssue (from apache wiki) :
The term "secondary name-node" is somewhat misleading. It is not a
name-node in the sense that data-nodes cannot connect to the secondary
name-node, and in no event it can replace the primary name-node in
case of its failure.
The only purpose of the secondary name-node is to perform periodic
checkpoints. The secondary name-node periodically downloads current
name-node image and edits log files, joins them into new image and
uploads the new image back to the (primary and the only) name-node.
See User Guide.
So if the name-node fails and you can restart it on the same physical
node then there is no need to shutdown data-nodes, just the name-node
need to be restarted. If you cannot use the old node anymore you will
need to copy the latest image somewhere else. The latest image can be
found either on the node that used to be the primary before failure if
available; or on the secondary name-node. The latter will be the
latest checkpoint without subsequent edits logs, that is the most
recent name space modifications may be missing there. You will also
need to restart the whole cluster in this case.
There are tricky ways to overcome this single point of failure. If you are using cloudera distribution, one of the ways explained here. Mapr distribution has a different way to handle to this spof.
Finally, you can use every single programing language to write map reduce over hadoop streaming.
Although, It is too late to answer your question but just It may help others..firstly we will discuss role of Hadoop 1.X daemons and then your issues..
1. What is role of secondary name Node
it is not exactly a backup node. it reads a edit logs and create updated fsimage file for name node periodically. it get metadata from name node periodically and keep it and uses when name node fails.
2. what is role of name node
it is manager of all daemons. its master jvm proceess which run at master node. it interact with data nodes.
3. what is role of job tracker
it accepts job and distributes to task trackers for processing at data nodes. its called as map process
4. what is role of task trackers
it will execute program provided for processing on existing data at data node. that process is called as map.
limitations of hadoop 1.X
single point of failure
which is name node so we can maintain high quality hardware for the name node. if name node fails everything will be inaccessible
Solutions
solution to single point of failure is hadoop 2.X which provides high availability.
high availability with hadoop 2.X
now your topics ....
How can we restore the entire cluster data if anything happens?
if cluster fails we can restart it..
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
we have default 3 replicas of data(i mean blocks) to get high availability it depends upon admin that how much replicas he has set...so job trackers will continue with other copy of data on other data node
can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
basically mapreduce is execution engine which will solve or process big data problem in(storage plus processing) distributed manners. we are doing file handling and all other basic operations using mapreduce programming so we can use any language of where we can handle files as per the requirements.
hadoop 1.X architecture
hadoop 1.x has 4 basic daemons
I Just gave a try. Hope it will help you as well as others.
Suggestions/Improvements are welcome.

Resources