restore with AWS EBS' snapshots on separate cluster - amazon-ec2

I have a cluster with 3 nodes - say cluster1 on AWS EC2 instances. The cluster is up and running, took snapshot of the keyspace's volume.
Now I want to restore few tables/keyspaces from the snapshot volumes, so I created another cluster say cluster2 and attached the snapshot volumes on to the new cluster's ec2 nodes (same number of nodes). Cluster2 is not starting bcz the system keyspace in the snapshot taken was having cluster name as cluster1 and the cluster on which it is being restored is cluster2. How do I do a restore in this case? I do not want to do any modifications to the existing cluster.
Also when I do restore do I need to think about the token ranges of the old and new cluster's mapping?

Before starting the cluster2, it's important to ensure that none of the IP addresses of the cluster1 are included in the seed list of the cluster2 to ensure that they are kept unaware between them. Also, to remove from the path data_file_directories (as defined in the cassandra.yaml), the following directories:
system
system_auth
system_distributed
system_traces
system_schema should not be touched, as it contains the schema definition of the keyspaces and tables.
Start the cluster, one node at a time; the first node should include its own IP address at the beginning of the seed list; This will be a one time change, and the change should be removed once that the cluster is up and running.
At this moment you should have a separate cluster, with the information and structure of the original cluster at the time that the snapshot was taken. To test this, execute nodetool gossipinfo and only the nodes of the cluster2 should be listed, login into cqlsh describe keyspaces should list all your keyspaces, and executing queries of your application should retrieve your data. You will note that Cassandra already generated the system* keyspaces, as well as dealt with the token distribution.
The next step is to update the name of the restored cluster, in each one of the nodes:
Log into cqlsh
Execute UPDATE system.local SET cluster_name = 'cluster2' where key='local';
exit cqlsh
run nodetool flush
run nodetool drain
edit the cassandra.yaml file, update cluster_name with the name 'cluster2'
restart the cassandra service
wait until the node is reported as NORMAL with nodetool status or nodetool netstats
repeat with a different node
At this point you will have 2 independent clusters, with different name.

Related

Apache Ignite Backup cache Node identification

I have started using Apache Ignite for my current project. I have set up the ignite Cluster with 3 Server Nodes with Backup Cache count as 1. Ignite Client Node is able to create a primary Cache as well as Backup cache in the cluster. But here I want to know for a particular cache which is Primary node and on which Node the Backup Cache is stored. Is there any tool available or any Visor command to do so along with finding the size of each cache.
Thank you.
Visor CLI shows how many primary and backup partitions each node holds.
By default, a cache is split into 1024 partitions. You can change that by configuring affinity function.
You may take a look at control.sh and inspect some specific partition distribution.
--cache distribution nodeId|null [cacheName1,...,cacheNameN] [--user-attributes attrName1,...,attrNameN]
Prints the information about partition distribution.
This commands prints partition distribution across nodes.
Sample:
./control.sh --cache distribution null myCache
[groupId,partition,nodeId,primary,state,updateCounter,partitionSize,nodeAddresses]
[next group: id=1482644790, name=myCache]
1482644790,0,e27ad549,P,OWNING,0,0,[0:0:0:0:0:0:0:1, 10.0.75.1, 127.0.0.1, 172.23.45.97, 172.25.4.211]

How to migrate single datacenter cluster to multiple datacenter cluster in cassandra>

Provide recommended configuration to migrate the data from the single data center cassandra cluster to multiple data center cassandra cluster. Currenlty i have the single data center cluster environment with following configurations,
i) No of nodes: 3
ii) Replication Factor : 2
iii) Strategy: SimpleStrategy
iv) endpoint_snitch: SimpleSnitch
And now i am planning to add 2 more nodes which is in different location. So i thought of moving to Multiple data center cluster with following confiruations.
i) No of nodes: 5
ii) RF: dc1=2, dc2=2
iii) Strategy: NetworkTopolofyStrategy
iv). endpoint_snitch: PropertyFileSnitch (I have the cassandra.topolofy.properties file)
What is the procedure to migrate the data without losing any data?
Please let me know the recommended steps to follow or any guide which i can refer. Please let me know if further info is required.
Complete repairs on all nodes.
Take snapshot on all nodes to have a fall back point.
Decommission each node that is not a pure Cassandra workload. Repair the ring each time you decommission a node.
Update keyspaces with NetworkTopologyStrategy and replication factor to match the original RF
ALTER KEYSPACE keyspace_name
WITH REPLICATION =
{ 'class' : 'NetworkTopologyStrategy', 'datacenter_name' : 2 };
Change snitch on each node with restart.
Add nodes in a different datacenter. Make sure that when you add them you have auto_bootstrap: false in the cassandra.yaml
Run nodetool rebuild original_dc_name on each new node.
I just found this excellent tutorial on migrating Cassandra:
Cassandra Migration To EC2 by highscalability.com
Although the details will be found at the original article, an outline of the main steps are:
1. Cassandra Multi DC Support
Configure the PropertyFileSnitch
Update the replication strategy
Update the client connection setting
2. Setup Cassandra On EC2
Start the nodes
Stop the EC2 nodes and cleanup
Start the nodes
Place data replicas in the cluster on EC2
3. Decommission The Old DC And Cleanup
Decommission the seed node(s)
Update your client settings
Decommission the old data center

Hadoop Nodes and Roles

I've a Hadoop Cluster at work that has over 50 nodes, We occasionally face disk failures and require to decommission the datanode roles.
My Question is - if I were to only decommission the datanode and leave the tasktracker running, would this result in failed tasks/jobs on this node due to unavailability of HDFS Service on that node?
Does the TaskTracker on Node1 sit idle since there is no DataNode service on that Node? Correct, if the data node is disabled then the task tracker will not be able to process the data as the data will not be avaiable; it will be idle. 2. or Does the TaskTracker work on data from DataNodes on other Nodes? Nope, due to data locality principle, the task tracker will not process the data from other nodes.. 3. Do we get errors from TaskTracker Service on Node1 due to the DN on it's node being down? , Task tracker will not be able to process any data, so no errors.; 4. if I have services like Hive, Impala, etc running on HDFS - would those services throw error upon contact with TaskTracker on Node1? They will not be able to contact the task tracker on node 1. When client requests for the processing of the data, Name node tells the client about the data locations, so based on the data locations all other applications will communicate with data nodes
I would expect any task that tries to read from HDFS on the "dead" node to fail. This should result in the node being blacklisted by M/R after N failures (default is 3 I think). Also, I believe this happens each time a job runs.
However, jobs should still finish since the tasks that got routed to the bad node will simply be retried on other nodes.
Firstly, in order to run a job you need to have the input file. So when you load the input file to HDFS this will be split into 64 MB block size by default. Also there will be 3 replications with default settings. Now since one of your data node in the cluster is failed, Name node will not store the data in that node. Even if it tries to store also, it gets the frequent updates from data node about the status. So it will not choose that specific data node to store the data.
It should throw exception when you don't have the disk space and the only dead data node is left in the cluster. Then its time for you to replace the data node and scale up the cluster.
Hope this helps.

restore a cassandra cluster from snapshot failed

Hope someone can help. We are having issues restoring all nodes of a cassandra 2.0 cluster from a snapshot. I have reviewed the instructions [Restoring from a snapshot][1]
Specific steps done include:
All data had been flushed from the memtables.
All nodes were compacted down to 1 sstable
Snapshots were taken on all nodes and saved off elsewhere
New cluster stood up, install from sratch of identical cluster (less data)
keyspace and column families were created
All nodes were stopped
commitlogs were cleared on all nodes and verified no sstable files existed
snapshot sstables were copied to each corresponding node under the base table folder
All nodes were restarted
Nodetool repair was run on all nodes
Result of these steps that appear to match the documentation is:
For a 2 node cluster, nodetool cfstats on each node seems to report approximate number of keys each node would have. nodetool status shows correct division of data by host
logging into cqlsh and doing a select count(*) on one of the columnfamily with limit high enough to return all rows does not report back the correct/original number of rows. It appears to report just the results of one node.
Is there a step missing from the documentation? Why doesn't a select count(*) show all the rows?
Thanks,
dfgriffith

Hadoop Datanode, namenode, secondary-namenode, job-tracker and task-tracker

I am new in hadoop so I have some doubts. If the master-node fails what happened the hadoop cluster? Can we recover that node without any loss? Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails? The secondary namenode is the backup of namenode only not to datenode, right? If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
How can we restore the entire cluster data if anything happens?
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
Thanks in advance
Although, It is too late to answer your question but just It may help others..
First of all let me Introduce you with Secondary Name Node:
It Contains the name space image, edit log files' back up for past one
hour (configurable). And its work is to merge latest Name Node
NameSpaceImage and edit logs files to upload back to Name Node as
replacement of the old one. To have a Secondary NN in a cluster is not
mandatory.
Now coming to your concerns..
If the master-node fails what happened the hadoop cluster?
Supporting Frail's answer, Yes hadoop has single point of failure so
whole of your currently running task like Map-Reduce or any other that
is using the failed master node will stop. The whole cluster including
client will stop working.
Can we recover that node without any loss?
That is hypothetical, Without loss it is least possible, as all the
data (block reports) will lost which has sent by Data nodes to Name
node after last back up taken by secondary name node. Why I mentioned
least, because If name node fails just after a successful back up run
by secondary name node then it is in safe state.
Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
It is staright possible by an Administrator (User). And to switch it
automatically you have to write a native code out of the cluster, Code
to moniter the cluster that will cofigure the secondary name node
smartly and restart the cluster with new name node address.
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails?
It is about replication factor, We have 3 (default as best practice,
configurable) replicas of each file block all in different data nodes.
So in case of failure for time being we have 2 back up data nodes.
Later Name node will create one more replica of the data that failed
data node contained.
The secondary namenode is the backup of namenode only not to datenode, right?
Right. It just contains all the metadata of data nodes like data node
address,properties including block report of each data node.
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
HDFS will forcely try to continue the job. But again it depends on
replication factor, rack awareness and other configuration made by
admin. But if following Hadoop's best practices about HDFS then it
will not get failed. JobTracker will get replicated node address to
continnue.
How can we restore the entire cluster data if anything happens?
By Restarting it.
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
yes, you can use any programming language which support Standard file
read write operations.
I Just gave a try. Hope it will help you as well as others.
*Suggestions/Improvements are welcome.*
Currently hadoop cluster has a single point of failure which is namenode.
And about the secondary node isssue (from apache wiki) :
The term "secondary name-node" is somewhat misleading. It is not a
name-node in the sense that data-nodes cannot connect to the secondary
name-node, and in no event it can replace the primary name-node in
case of its failure.
The only purpose of the secondary name-node is to perform periodic
checkpoints. The secondary name-node periodically downloads current
name-node image and edits log files, joins them into new image and
uploads the new image back to the (primary and the only) name-node.
See User Guide.
So if the name-node fails and you can restart it on the same physical
node then there is no need to shutdown data-nodes, just the name-node
need to be restarted. If you cannot use the old node anymore you will
need to copy the latest image somewhere else. The latest image can be
found either on the node that used to be the primary before failure if
available; or on the secondary name-node. The latter will be the
latest checkpoint without subsequent edits logs, that is the most
recent name space modifications may be missing there. You will also
need to restart the whole cluster in this case.
There are tricky ways to overcome this single point of failure. If you are using cloudera distribution, one of the ways explained here. Mapr distribution has a different way to handle to this spof.
Finally, you can use every single programing language to write map reduce over hadoop streaming.
Although, It is too late to answer your question but just It may help others..firstly we will discuss role of Hadoop 1.X daemons and then your issues..
1. What is role of secondary name Node
it is not exactly a backup node. it reads a edit logs and create updated fsimage file for name node periodically. it get metadata from name node periodically and keep it and uses when name node fails.
2. what is role of name node
it is manager of all daemons. its master jvm proceess which run at master node. it interact with data nodes.
3. what is role of job tracker
it accepts job and distributes to task trackers for processing at data nodes. its called as map process
4. what is role of task trackers
it will execute program provided for processing on existing data at data node. that process is called as map.
limitations of hadoop 1.X
single point of failure
which is name node so we can maintain high quality hardware for the name node. if name node fails everything will be inaccessible
Solutions
solution to single point of failure is hadoop 2.X which provides high availability.
high availability with hadoop 2.X
now your topics ....
How can we restore the entire cluster data if anything happens?
if cluster fails we can restart it..
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
we have default 3 replicas of data(i mean blocks) to get high availability it depends upon admin that how much replicas he has set...so job trackers will continue with other copy of data on other data node
can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
basically mapreduce is execution engine which will solve or process big data problem in(storage plus processing) distributed manners. we are doing file handling and all other basic operations using mapreduce programming so we can use any language of where we can handle files as per the requirements.
hadoop 1.X architecture
hadoop 1.x has 4 basic daemons
I Just gave a try. Hope it will help you as well as others.
Suggestions/Improvements are welcome.

Resources