setting keyspace for replication strategy - cassandra-2.0

I am pretty new to Cassandra so forgive me when I have some fundamental misunderstanding of the concept of keyspaces. What I am trying to do is to set up a multi datacenter ring in different regions with data replication NetworkTopologyStrategy endpoint_snitch set to GossipingPropertyFileSnitch
hence as explained in the docs I need set the replication strategy for a keyspace
CREATE KEYSPACE "mykey"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 2};
i also read that in cql i can do "use mykey" to set the keyspace
would that be persistantly set then in the cassandra configurtation? As far as i understand each application client in a cluster uses its own keyspace right. Hence i would need to set this in the application??
The examples only show how to create a keyspace for configuring replication strategy options. I i think i managed to understand the basics behind it. What i am looking for is examples how i would tell cassandra to use a certain keyspace strategy (consistently and/or application dependent).
I digged some more in the cassandra docs and think i got a better aubderstanding about the use of keyspace. Am i correct in that for telling cassandra to use a certain keyspace i can create keyspace like so
CREATE KEYSPACE "MyKey" WITH replication = {'class':
'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
and then create tables in this keyspace like so
CREATE TABLE "MyKey"."TableName" (
...
Would this make cassandra to always use the configured replication strategy in the "MyKey" keyspace for that table?

"As far as i understand each application client in a cluster uses its own keyspace right. Hence i would need to set this in the application??"
No. You can think of a keyspace as just a collection of tables, which all your users would access. You would really only create multiple keyspaces if you had dramatically different replication needs for some reason, or if you had a multi-tenant application that required it for security purposes.
"Would this make cassandra to always use the configured replication strategy in the "MyKey" keyspace for that table?"
Yes. TableName table permanently lives in the MyKey keyspace and will inherit the properties of that keyspace.
Once you set your replication factor, you don't typically change it. You can but it would require a fairly IO intensive process in the background. Replication factor is used to determine how many copies of a singe piece of data lives in a particular datacenter and therefor will tell you how many nodes can fail before you have an outage. 3 is by far the most common setting here, but if you do not have 3 nodes in your data center, then a smaller number is fine.

Related

ClickHouse Distributed tables and insert_quorum

I'm trying to configure a cluster with both sharding and replication and have some doubts about how insert_quorum works with Distributed engine and internal replication.
insert_quorum controls synchronous insertion to multiple instances of Replicated* tables (if insert_quorum>=2 the client will return only after data was successfully inserted in insert_quorum replicas).
insert_distributed_sync controls synchronous insertion to Distributed table. if insert_distributed_sync=1 client will return only after data was successfully inserted in target tables (one replica if internal_replication is true).
But how do insert_distributed_sync, insert_quorum and internal_replication work together?
Is my understanding correct that if I execute insert into Distributed table with insert_distributed_sync=1 and insert_quorum=2 the statement will return only after the data was inserted in at least two replicas?
Or is insert_quorum ignored for Distributed engine and works only when writing directly with Replicated* tables?
As I understood
internal_replication and insert_distributed_sync apply to Distributed engine
insert_quorum applied to ReplicatedMergeTree
INSERT query to Distributed table which created over multiple *ReplicatedMergeTree with insert_distributed_sync=1, will invoke multiple inserts into ReplicatedMergeTree tables inside the initial clickhouse-server process use authentication from remote_servers config part.
It will one Insert for each Shard according to sharding key which you defined when create Distributed table.
If you define internal_replication=true, then only One *ReplicatedMergeTree table should be written, but when Distributed engine insert into ReplicatedMergeTree, initial clickhouse-server serves query as a client, so insert_quorum should apply on destination clickhouse-server and initial server will get an answer only after all inserted parts will replicate over ZK.
If you define internal_replication=false, then the Distributed engine should initiate insert to all *ReplicatedMergeTree, and insert_quorum also will apply, but replication conflicts should be resolved on over Zookeeper Queues on ReplicatedMergeTree side, cause inserted parts will have the same control sums and names.

Row Level Transactions in Hive

I'm a newbie in HiveQL. When I'm creating a table, I came to know we need to keep TRUE some of the properties of transactions. Then I have gone through what are those:
hive>set hive.support.concurrency = true;
hive>set hive.enforce.bucketing = true;
hive>set hive.exec.dynamic.partition.mode = nonstrict;
hive>set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
hive>set hive.compactor.initiator.on = true;
hive>set hive.compactor.worker.threads = a positive number on at least one instance of the Thrift metastore service;
What exactly Concurrency,bucketing,Dynamic.partition.mode = 'nonstrict'?
I have been trying to learn about those things but I'm getting information along with locking mechanisms and ZooKeeper and in memory concepts.
As I'm completely new to this area I'm unable to get a proper knowledge on this property.
Can any one throw some light on this?
From Hive documentation
hive.support.concurrency
Whether Hive supports concurrency or not. A ZooKeeper instance must be
up and running for the default Hive lock manager to support read-write
locks.
Set to true to support INSERT ... VALUES, UPDATE, and DELETE
transactions (Hive 0.14.0 and later). For a complete list of
parameters required for turning on Hive transactions
hive.enforce.bucketing
Whether bucketing is enforced. If true, while inserting into the
table, bucketing is enforced.
hive.exec.dynamic.partition.mode
In strict mode, the user must specify at least one static partition in
case the user accidentally overwrites all partitions. In nonstrict
mode all partitions are allowed to be dynamic.
hive.txn.manager
Set this to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager as part of
turning on Hive transactions. The default DummyTxnManager replicates
pre-Hive-0.13 behavior and provides no transactions.
hive.compactor.initiator.on
Whether to run the initiator and cleaner threads on this metastore
instance. Set this to true on one instance of the Thrift metastore
service as part of turning on Hive transactions. For a complete list
of parameters required for turning on transactions, see
hive.txn.manager.
It's critical that this is enabled on exactly one metastore service
instance (not enforced yet).
hive.compactor.worker.threads
How many compactor worker threads to run on this metastore instance.
Set this to a positive number on one or more instances of the Thrift
metastore service as part of turning on Hive transactions. For a
complete list of parameters required for turning on transactions, see
hive.txn.manager.
Worker threads spawn MapReduce jobs to do compactions. They do not do
the compactions themselves. Increasing the number of worker threads
will decrease the time it takes tables or partitions to be compacted
once they are determined to need compaction. It will also increase the
background load on the Hadoop cluster as more MapReduce jobs will be
running in the background.

Does Hbase have a replication policy of its own or is it inherited from HDFS?

Since HBase is built on top of HDFS which has a replication policy for fault tolerance, does this mean HBase is inherently fault tolerant and data stored in HBase will always be accessible thanks to the underlying HDFS? Or does HBase implement a replication policy of its own (e.g table replication over regions)?
Yes, you can create replica of regions in Hbase, as mentioned here. However, note that HBase high availability is for read only. It is not highly available for writes. If region server goes down, then until regions are assigned to a new region server, you will not be able to write.
To enable read replicas, you need to enable Async WAL replication by setting hbase.region.replica.replication.enabled to true. You will also need to enable high availability for the table at creation time by specifying REGION_REPLICATION value greater than 1, as in docs:
CREATE 't1', 'f1', {REGION_REPLICATION => 2}
More details can be found here.
The concept of replication in HBase is different than HDFS replication. Both are different in different context. HDFS is the file system and replicates data for fault tolerant and high availability features from the data file. While HBase replication is mainly around fault tolerant, high availability and data integrity from a database system perspective.
Of course, HDFS replication capability is used for file level replication for HBase. Along with it, HBase also maintains copies of its meta data into backup nodes (which are again replicated by default by HDFS).
HBase also have backup processes to monitor and recover from failure. like Primary and Secondary Region servers. But the data loss in the region server is protected by HDFS replication only.
Hence, the Hbase replication is mainly around recovery of failure and maintaining data integrity as a database engine. It is just like any other robust database system like Oracle.

Multidatacenter Replication with Rethinkdb

I have two servers in two different geographic locations (alfa1 and alfa2).
r.tableCreate('dados', {shards:1, replicas:{alfa1:1, alfa2:1}, primaryReplicaTag:'alfa1'})
I need to be able to write for both servers, but when I try to shutdown alfa1, and write to alfa2, rethinkdb only allow reads: Table test.dados is available for outdated reads, but not up-to-date reads or writes.
I need a way to write for all replicas, not only for Primary.
Is this possible ? Does rethinkdb allow multidatacenter replication ?
I think that multidatacenter replication need to permit write for both datacenters.
I tried to remove "primaryReplicaTag" but system don't accept !
Any help is welcome !!!
RethinkDB does support multi-datacenter replication/sharding.
I think the problem here is that you've setup a cluster of two, which means that when one fails you only have 50% of the nodes in the cluster which means you have less than 51%.
From the failover docs - https://rethinkdb.com/docs/failover/
To perform automatic failover for a table, the following requirements
must be met:
The cluster must have three or more servers
The table must be configured to have three or more replicas
A majority (greater thanhalf) of replicas for the table must be available
Try adding just one additional server and your problems should be resolved.

Rethink DB Cross Cluster Replication

I have 3 different pool of clients in 3 different geographical locations.
I need configure Rethinkdb with 3 different clusters and replicate data between the (insert, update and deletes). I do not want to use shard, only replication.
I didn't found in documentation if this is possible.
I didn't found in documentation how to configure multi-cluster replication.
Any help is appreciated.
I think that multi cluster is just same a single clusters with nodes in different data center
First, you need to setup a cluster, follow this document: http://www.rethinkdb.com/docs/start-a-server/#a-rethinkdb-cluster-using-multiple-machines
Basically using below command to join a node into cluster:
rethinkdb --join IP_OF_FIRST_MACHINE:29015 --bind all
Once you have your cluster setup, the rest is easy. Go to your admin ui, select the table, in "Sharding and replication", click Reconfigure and enter how many replication you want, just keep shard at 1.
You can also read more about Sharding and Replication at http://rethinkdb.com/docs/sharding-and-replication/#sharding-and-replication-via-the-web-console

Resources