system_auth replication in Cassandra - spring

I'm trying to configure authentication on Cassandra. It seems like because of replication strategy that is used for system_auth, it can't replicate user credentials to all the nodes in cluster, so I end up getting Incorrect credentials on one node, and getting successful connection on another.
This is related question. The guy there says you have to make sure credentials are always on all nodes.
How to do it? The option that is offered there says you have to alter keyspace to put replication factor equal to amount of nodes in cluster, then run repair on each node. That's whole tons of work to be done if you want your cassandra to be dynamically scalable. If I add 1 node today, 1 node another day, alter keyspace replication and then keep restarting nodes manually that will end up some kind of chaos.
Hour of googling actually leaded to slightly mentioned EverywhereStrategy, but I don't see anywhere in docs it mentioned as available. How do people configure APIs to work with Cassandra authentication then, if you can't be sure that your user actually present on node, that you're specifying as contact point?
Obviously, talking about true scale, when you can change the size of cluster without doing restarts of each node.

When you enable authentication in Cassandra, then Yes you have increase the system_auth keyspace replication_factor to N(total number of nodes) and run a complete repair, but you don't need to restart the nodes after you add a new Node.
If repair is consuming more time then you optimize your repair like repair only the system_auth keyspace
nodetool repair system_auth
(or)
nodetool repair -pr system_auth
As per Cassandra a complete repair should be done regularly. For more details on repair see the below links:
http://www.datastax.com/dev/blog/repair-in-cassandra
https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra/
http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
Answering your questions:
Question: How do people configure APIs to work with Cassandra authentication then, if you can't be sure that your user actually present on node, that you're specifying as contact point?
Answer: I'm using Cassandra 2.2 and Astyanax thrift API from my Spring project, using which I am able to handle the Cassandra authentication effectively. Specify what version of Cassandra you are using and what driver you are using to connect CQL driver or Astyanax thrift API?
Question: Obviously, talking about true scale, when you can change the size of cluster without doing restarts of each node.
Answer: Yes you can scale your Cassandra cluster without restarting nodes, please check the datastax documentation for Cassandra 2.2 version:
http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/operations/opsAddNodeToCluster.html
Check the datastax docs for the version you are using.

Related

how to configure and install a standby master in greenplum?

Ive installed a single node greenplum db with 2 segment hosts , inside them residing 2 primary and mirror segments , and i want to configure a standby master , can anyone help me with it?
It is pretty simple.
gpinitstandby -s smdw -a
Note: If you are using one of the cloud Marketplaces that deploys Greenplum for you, the standby master runs on the first segment host. The overhead of running the standby master is pretty small so it doesn't impact performance. The cloud Marketplaces also have self-healing so if that nodes fails, it is replaced and all services are automatically restored.
As Jon said, this is fairly straightforward. Here is a link to the documentation: https://gpdb.docs.pivotal.io/5170/utility_guide/admin_utilities/gpinitstandby.html
If you have follow up questions, post them here.

Can I use a SnappyData JDBC connection with only a Locator and Server nodes?

SnappyData documentation and architecture diagrams seem to indicate that a JDBC thin client connection goes from a client to a Locator and then it is routed to a direct connection to a Server.
If this is true, then I can run JDBC queries without a Lead node, correct?
Yes, that is correct. The locator provides load and connectivity information back to the client that is now able to connect to one or more servers either for direct access to a bucket for low latency queries but more importantly, is HA - can failover and failback.
So, yes, your connected clients will continue to function even when the locator goes away. Note that the "lead" plays a different role than the locator. Its primary function is to host Spark driver, orchestrate Spark Jobs and provide HA to Spark. With no lead, you won't be able to run such Jobs.
In addition to what #jagsr has mentioned, if you do not intend to run the lead nodes (and thus no Spark jobs or column store), then you can run the cluster as pure row store using snappy-start-all.sh rowstore (see rowstore docs)

alter table add column not always propagating in cassandra

I am using apache cassandra (v. 2.0.9) on a 4-node cluster, replication factor = 3 and Datastax Java Driver for Cassandra (v. 2.0.2). I am using CQL queries from inside my Java code to add columns to existing tables.
I observed this issue when my CREATE INDEX queries and SELECT queries on the newly added columns failed, reason being that the column was not found. No error was logged in cassandra logs.
Note that this issue did not appear when I ran cassandra on a single node, but occurs persistently on 4-node cluster. Currently I am working around it by retrying for at most 5 times and I notice that columns are added at most by third or fourth retry. Also I observed that higher the number of existing columns in a table, lesser are such failures.
I found a bug already reported at:
https://issues.apache.org/jira/browse/CASSANDRA-7186
It worked fine after I disabled all firewalls, so this may be happening due to cassandra using particular ports for communication between nodes & that being blocked due to firewall.

Cloudera installation Doubts?

I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)

Redis Cluster - production ready?

I was reading Redis documentation, and I am most interested in the partitioning feature.
Redis documentation states the following:
Data store or cache? Partitioning when using Redis ad a data store or
cache is conceptually the same, however there is a huge difference.
While when Redis is used as a data store you need to be sure that a
given key always maps to the same instance, when Redis is used as a
cache if a given node is unavailable it is not a big problem if we
start using a different node, altering the key-instance map as we wish
to improve the availability of the system (that is, the ability of the
system to reply to our queries). Consistent hashing implementations
are often able to switch to other nodes if the preferred node for a
given key is not available. Similarly if you add a new node, part of
the new keys will start to be stored on the new node. The main concept
here is the following: If Redis is used as a cache scaling up and down
using consistent hashing is easy. If Redis is used as a store, we need
to take the map between keys and nodes fixed, and a fixed number of
nodes. Otherwise we need a system that is able to rebalance keys
between nodes when we add or remove nodes, and currently only Redis
Cluster is able to do this, but Redis Cluster is not production ready.
From the last sentence I understand that Redis Cluster is not production ready. Does anyone knows whether this documentation is up to date, or Redis Cluster is already production ready?
[Update] Redis Cluster was released in Redis 3.0.0 on 1 Apr 2015.
Redis cluster is currently in active development. See this article from Redis author: Antirez.
So I can pause other incremental improvements for a bit to focus on Redis Cluster. Basically my plan is to work mostly to cluster as long as it does not reach beta quality, and for beta quality I mean, something that brave users may put into production.
Redis Cluster will support up to ~1000 nodes.
The first release will have the following features (extracted from Antirez post):
Automatic partition of key space.
Hot resharding.
Only single key operations supported (and it will always be that way).
As of today antirez is working on the first Redis cluster client (redis-rb-cluster) in order to be used as a reference implementation.
I'll update this answer as soon as Redis Cluster goes production ready.
[Update] 03/28/2014 Redis Cluster is already used on large cluster in production (source: antirez tweets).
Today the first Release Candidate for Redis 3.0.0 has been released, which includes a stable version of Clustering: http://redis.io/download.
See also this post by Antirez: http://antirez.com/news/79.
Redis Cluster is included in Redis 3.0.0, released 1 Apr 2015.
--[ Redis 3.0.0 ] Release date: 1 Apr 2015
What's new in Redis 3.0 compared to Redis 2.8?
Redis Cluster: a distributed implementation of a subset of Redis.
https://raw.githubusercontent.com/antirez/redis/3.0/00-RELEASENOTES

Resources