Using sstableloader to load migrate cassandra cluster - amazon-ec2

I am trying to migrate a cassandra cluster from a dedicated DC to EC2. I took a snapshot of the sstables as a start. To load the data, I tried sstableloader in the process, but I am not able to figure a way to to specify the hosts from my cassandra node (I am trying to run sstable loader from my cassandra node). Any guidance would be really helpful. I am using cassandra 1.0.1.

sstableloader in cassandra 1.0.1 uses gossip protocol to get schema and ring info needed to bulk load. This means that the tool have to listen on the same port as cassandra itself, thus it cannot be used in the same node with cassandra(this is not the case for cassandra > 1.1).
You need to bulk load from other host, or follow the direction in here(http://www.datastax.com/docs/1.0/operations/backup_restore) to load snapshot straight to new node.
Also, you can bulk load files in the node using existing cassandra node through JMX "bulkLoad" operation found under StorageService MBean.

Related

Is there a framework that can push bulk data?

Our MySQL and Hadoop platform are deployed in different network, Hadoop cannot visit MySQL but MySQL can connect to Hadoop.
So is there a tool that can push bulk data from RDS to HDFS like Sqoop fetch data? or is there any other method that can address this issue?
Any help is appreciated.
By the way, the Hadoop cluster consumes a Kafka topic in our network by deploying flume in both sides and connect together. But for bulk RDS data, how to achieve this?
Here for movement of bulk data from MySQL to Hadoop, you have another option/Solution i.e. Spark JDBC connection to MySQL database to bring data into Hadoop cluster.
Use of Sqoop for data movement is old technique to import data from RDBMS to hadoop cluster. But SparkJDBC solution is based on Spark in memory execution engine with well proven performance record and reliable to use.

Flink Cluster Performance is much worse than standalone

I use flink to process HDFS files or local files.
When I use standalone setup, the server can process the data at 500k/s.
But when I use cluster,the server can only process the data at 100k/s.
It is so weird,I can not figure out what is going on.
I found that when I use cluster(2 servers), there is always one server which has low speeds to read/write data. The flink cluster is based on hadoop.
Can anyone help me?

RethinkDB local and cloud clusters connection

Just thinking about app arhitecture and whant to know is it possible at all to create local cluster for specific tables and connect it with cloud cluster?
And additional question - is it possible to choose where to create shard (on what machine) for particular table (to show cloud cluster that for this table i need shards in local cluster)?
As example, I whant to have table db.localTable be sharded in local cluster to reduce latency and increase performance due to run queries in local cluster and also have ability to run queries in cloud cluster when local cluster is not accessible. All data between clusters should be consistent.
Tnx in advance.
Actually, I've found the solution: to set specific servers for replicas and servers for shards you should use server-tags and perform changes using ReQL and tables setting. For details see - RethinkDB - Scaling, sharding and replication and RethinkDB - Architecture FAQ

system_auth replication in Cassandra

I'm trying to configure authentication on Cassandra. It seems like because of replication strategy that is used for system_auth, it can't replicate user credentials to all the nodes in cluster, so I end up getting Incorrect credentials on one node, and getting successful connection on another.
This is related question. The guy there says you have to make sure credentials are always on all nodes.
How to do it? The option that is offered there says you have to alter keyspace to put replication factor equal to amount of nodes in cluster, then run repair on each node. That's whole tons of work to be done if you want your cassandra to be dynamically scalable. If I add 1 node today, 1 node another day, alter keyspace replication and then keep restarting nodes manually that will end up some kind of chaos.
Hour of googling actually leaded to slightly mentioned EverywhereStrategy, but I don't see anywhere in docs it mentioned as available. How do people configure APIs to work with Cassandra authentication then, if you can't be sure that your user actually present on node, that you're specifying as contact point?
Obviously, talking about true scale, when you can change the size of cluster without doing restarts of each node.
When you enable authentication in Cassandra, then Yes you have increase the system_auth keyspace replication_factor to N(total number of nodes) and run a complete repair, but you don't need to restart the nodes after you add a new Node.
If repair is consuming more time then you optimize your repair like repair only the system_auth keyspace
nodetool repair system_auth
(or)
nodetool repair -pr system_auth
As per Cassandra a complete repair should be done regularly. For more details on repair see the below links:
http://www.datastax.com/dev/blog/repair-in-cassandra
https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra/
http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/tools/toolsRepair.html
Answering your questions:
Question: How do people configure APIs to work with Cassandra authentication then, if you can't be sure that your user actually present on node, that you're specifying as contact point?
Answer: I'm using Cassandra 2.2 and Astyanax thrift API from my Spring project, using which I am able to handle the Cassandra authentication effectively. Specify what version of Cassandra you are using and what driver you are using to connect CQL driver or Astyanax thrift API?
Question: Obviously, talking about true scale, when you can change the size of cluster without doing restarts of each node.
Answer: Yes you can scale your Cassandra cluster without restarting nodes, please check the datastax documentation for Cassandra 2.2 version:
http://docs.datastax.com/en/archived/cassandra/2.2/cassandra/operations/opsAddNodeToCluster.html
Check the datastax docs for the version you are using.

Storm-zookeeper transactional logs extremlly large

I'm using a ZooKeeper cluster (3 mchines) for my Storm cluster (4 machines). The problem is that -because of the topologies deployed on the storm cluster- the zookeeper transactional logs grow to be extremly large making the zookeeper desk to be full and what is really strange that those logs are not devided into multiple files instead I'm having one big transactional file in every zookeeper machine! making the autopurge in my zookeeper configuration not to have any affect on those files.
Is there a way to solve this problem from zookeeper side, or can I change the way storm uses zookeeper to minimize the size of those logs?
Note: I'm using zookeeper 3.6.4 and Storm 0.9.6 .
I was able to resolve this problem by using Pacemarker to process heartbeats from workers instead of zookeeper; That allowed me to avoid writting to zookeeper disk in order to maintain consistency and use in-memory store instead. In order to be able to use Pacemaker I upgraded to Storm-1.0.2.

Resources