Hbase Too many bad rows in the VerifyReplication - hadoop

We have setup 2 hbase cluster with the replication between them. In which one of them act as a active cluser and another one as DR cluster. Some months before, we had error in the DR cluster due to which replication lag in the active cluster.
After some days, we had made DR Cluster up, and replication lag cleared. Though the replication lag cleared, we have considerable amount (3.5T) oldWALs in the active cluster.
I have tried to run VerifyReplication in hbase between clusters for only one day and it shows more number of bad rows (about 25%) between cluster. I suspect the replication has not been replayed between the cluster properly.
Is there any way to fix the difference between 2 clusters? Is there any way to replicate data in the peer by replaying oldWALs? What are the issues in replaying oldWALs.
Please help.
Hbase version - 1.4.11
Hadoop Version - 2.7.3

Related

Cloudera hadoop data storage into specific node

I have 10 amazon ec2 node cluster used for every day data processing and i want to use all the 10 nodes for each day batch process (2 hours process only) and once the reporting data points got generated then i want to shutdown 5 nodes and make only 5 nodes active rest of the day for cost optimization.
I have a replication factor of 3.
In some scenarios all the 3 data blocks(actual & replication blocks) got stored in those 5 nodes which i am shutting down. Because of which i am not able to read the data properly.
Can i make some settings in cloudera manager to persist specific Database or Specific tables into given nodes, so that i will not have any issue in reading the data with only 5 nodes active.
Or any other suggestions will be appreciated.
You can use rack awareness (virtually) to separate your cluster into 2 "racks", and place your 5 nodes that you shut down regularly on a separate "rack". Replication policy will require that the NN place the replicas on separate racks, if configured. Again, I'm referring to racks in the virtual sense here. That should get you what you want.

Ingesting data in elasticsearch from hdfs , cluster setup and usage

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)

1 big Hadoop and Hbase cluster vs 1 Hadoop cluster + 1 Hbase cluster

Hadoop will run a lot of jobs by reading data from Hbase and writing data to
Hbase. Suppose I have 100 nodes, then there are two ways that I can build my Hadoop/Hbase
cluster:
100 nodes hadoop & hbase cluster (1 big Hadoop&Hbase)
Separate the Database(Hbase), then we have two clusters:
60 nodes Hadoop cluster and 40 nodes Hbase cluster (1 Hadoop + 1 Hbase)
which option is better? Why?
Thanks.
I would say option 2 is better.My reasoning - even though your requirement is mostly of running lots of mapreduce jobs to read and write data out of hbase, there are a lot of things going behind scene for hbase to optimise those reads and write for your submitted jobs. Hmaster will have to do load balancing often , unless your region keys are perfectly balanced. Table hotspotting can be there. For Regionserver, there will be major-compactions and if your jvm skills are not that good, then occasionally Stop the World garbage collection can happen. All the regions may start splitting at the same time. Your regionserver can go down and so on. Moot point is - tuning hbase needs time. If you have just one node dedicated for hbase then probability of aforementioned problems are higher. It's always better to have more than one node, so all the performance pressure doesn't apply to just one node. And by the way , scoring point of hbase is it's inherently distributed nature, you wouldn't want to kill it. All said, you can experiment on the ratio of nodes between hadoop and hbase- May be 70:30 or 80:20. Mileage may vary according to your application requirements.
The main reason to separate HBase and Hadoop is when they have different usage scenarios - i.e. HBAse does random read-write in low latency and Hadoop does sequential batches. In this case the different access patterns can interfere with each other and it can be better to separate the clusters.
If you're just using HBase in batch mode you can use the same cluster (and probably rethink using HBase since it is slower than raw hadoop in batch).
Note that you would need to tune HBase along the lines mentioned by Chandra Kant regardless of the path you take

Unable to add nodes to existing Cassandra Cluster

We have cassandra cluster of 6 nodes on EC2,we have to double its capacity to 12 nodes.
So to add 6 more nodes i followed the following steps.
1 Calculated the tokens for 12 nodes and configured the new nodes accordingly.
2 With proper configuration started the new nodes so that they new nodes will bisect the
existing token ranges.
In the beginning all the new nodes were showing the streaming in
progress.
In ring status all the node were in "Joining" state
After 12 hours 2 nodes completed the streaming and came into the normal state.
But on the remaining 4 nodes after streaming some amount of data they are not showing any progress , look like they are stuck
We have installed Cassandra-0.8.2 and have around 500 GB of data on each existing nodes and storing data on EBS volume.
How can i resolve this issue and get the balanced cluster of 12 nodes?
Can i restart the nodes?
If i cleaned the data directory of stuck Cassandra nodes and restarted with fresh installation, will it cause any data loss?
There will not be any data loss if you replication factor 2 or greater.
Version 0.8.2 of Cassandra has several known issues - please upgrade to 0.8.8 on all original nodes as well as the new the nodes that came up and then start the procedure over for the nodes that did not complete.
Also, be aware that storing data on EBS volumes is a bad idea :
http://www.mail-archive.com/user#cassandra.apache.org/msg11022.html
While this won't answer your question directly, hopefully it points you in the right direction:
There is a fairly active #cassandra IRC channel on freenode.org.
So here is the answer why our some of the nodes were stuck.
1) We have upgraded from cassandra-0.7.2 to cassandra0.8.2
2) And we are loading the sstables with sstable-loader utility
3) But some data for some of the column families are directly inserted from hadoop job.
And the data of these column families are showing some other version as we have not upgraded the cassandra api in hadoop.
4) Because of this version mismatch cassandra throws 'version mismatch exception' and terminate the streaming
5) So the solution for this is to use "nodetool scrub keyspace columnfamily". I have used this and my issue is resolved
So the main thing here is if you are upgrading the cassandra cluster capacity u must do the nodetool scrub

Hadoop namenode : Single point of failure

The Namenode in the Hadoop architecture is a single point of failure.
How do people who have large Hadoop clusters cope with this problem?.
Is there an industry-accepted solution that has worked well wherein a secondary Namenode takes over in case the primary one fails ?
Yahoo has certain recommendations for configuration settings at different cluster sizes to take NameNode failure into account. For example:
The single point of failure in a Hadoop cluster is the NameNode. While the loss of any other machine (intermittently or permanently) does not result in data loss, NameNode loss results in cluster unavailability. The permanent loss of NameNode data would render the cluster's HDFS inoperable.
Therefore, another step should be taken in this configuration to back up the NameNode metadata
Facebook uses a tweaked version of Hadoop for its data warehouses; it has some optimizations that focus on NameNode reliability. Additionally to the patches available on github, Facebook appears to use AvatarNode specifically for quickly switching between primary and secondary NameNodes. Dhruba Borthakur's blog contains several other entries offering further insights into the NameNode as a single point of failure.
Edit: Further info about Facebook's improvements to the NameNode.
High Availability of Namenode has been introduced with Hadoop 2.x release.
It can be achieved in two modes - With NFS and With QJM
But high availability with Quorum Journal Manager (QJM) is preferred option.
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
Have a look at below SE questions, which explains complete failover process.
Secondary NameNode usage and High availability in Hadoop 2.x
How does Hadoop Namenode failover process works?
Large Hadoop clusters have thousands of data nodes and one name node. The probability of failure goes up linearly with machine count (all else being equal). So if Hadoop didn't cope with data node failures it wouldn't scale. Since there's still only one name node the Single Point of Failure (SPOF) is there, but the probability of failure is still low.
That sad, Bkkbrad's answer about Facebook adding failover capability to the name node is right on.

Resources