How to connect to multiple Cassandra in different dc - spring-boot

I'm setting up an application in which i am using spark session to read data from Cassandra. I am able to read the data from Cassandra if i am passing one Cassandra node from a dc.
But how can i connect to 3 different Cassandra nodes which belong to 3 different dc in spark session.
Here the code which i am using:
spark session
spark = SparkSession.builder().appName("SparkCassandraApp")
.config("spark.cassandra.connection.host", cassandraContactPoints)
.config("spark.cassandra.connection.port", cassandraPort)
.config("spark.cassandra.auth.username", userName).config("spark.cassandra.auth.password", password)
.config("spark.dynamicAllocation.enabled", "false").config("spark.shuffle.service.enabled", "false")
.master("local[4]").getOrCreate();
property file :
spring.data.cassandra.contact-points=cassandra1ofdc1, cassandra2ofdc2, cassandra3ofdc3
spring.data.cassandra.port=9042
when i try the above scenario i am getting the following exception:
Caused by:
java.lang.IllegalArgumentException: requirement failed: Contact points contain multiple data centers: dc1, dc2, dc3
Any help would be appreciated
Thanks in advance.

The Spark Cassandra Connector (SCC) allows to use only nodes from local data center, either defined by the spark.cassandra.connection.local_dc configuration parameter, or determined from the DC of the contact point(s) (that is performed by function LocalNodeFirstLoadBalancingPolicy.determineDataCenter). SCC newer will use nodes from other DCs...

Related

Getting NoNodeAvailableException running multiple tests in Spring Boot + Cassandra

Just upgraded to Spring boot 2.7 using jdk19 and decided to use Cassandra bitnami 3 running in docker with my tests "Junit-5", the error I'm getting is No node was available to execute the query, and it happens to the same test cases every time.
No node was available to execute the query; nested exception is \
com.datastax.oss.driver.api.core.NoNodeAvailableException: \
No node was available to execute the query
here is the code I'm using to connect
var loader = DriverConfigLoader.programmaticBuilder()
.withDuration(DefaultDriverOption.REQUEST_TIMEOUT, Duration.ofMinutes(1))
.withString(DefaultDriverOption.LOAD_BALANCING_POLICY_CLASS,
DcInferringLoadBalancingPolicy.class.getName())
.build();
if (session == null || session.isClosed()) {
var host = System.getenv("CASSANDRA_HOST") == null ? "localhost" : System.getenv("CASSANDRA_HOST");
var username = "localhost".equals(host)? "": "cassandra";
var password = "localhost".equals(host)? "": "cassandra";
LOG.info("Cassandra host '{}'.", host);
LOG.info("Cassandra username '{}'.", username);
LOG.info("Cassandra password '{}'.", password);
var sessionBuilder = new CqlSessionBuilder()
.addContactPoint(new InetSocketAddress(host, 9042))
.withLocalDatacenter("datacenter1")
.withConfigLoader(loader);
if (!username.isEmpty()) {
sessionBuilder.withAuthCredentials(username, password);
}
session = sessionBuilder.build();
It also important to mention that I have lots 170+ test cases distributed on different files, and with every file execution, am trying to clean and populate the DB again using this code
session.execute("create keyspace if not exists \"schema_x\" WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1};");
for (final String stmt : getCassaandraStatementsFromFile(CASSANDRA_SCHEMA_FILE)) {
session.execute(stmt);
LOG.info("Cassandra. Executed statement: '{}'.", stmt.replaceAll("\n", ""));
}
and the error exactly happening on that create keyspace line, I tried to apply some tuning from my side by
adapting the connection loader and using throttling but it didn't help
I also checked the local-datacenter value inside docker itself and matched mine.
Finally the complete error stacktrace in case it is required
org.springframework.data.cassandra.CassandraConnectionFailureException: \
Query; CQL [com.datastax.oss.driver.internal.core.cql.DefaultSimpleStatement#65b70f9e]; \
No node was available to execute the query; nested exception is \
com.datastax.oss.driver.api.core.NoNodeAvailableException: \
No node was available to execute the query
at org.springframework.data.cassandra.core.cql.CassandraExceptionTranslator.translate(CassandraExceptionTranslator.java:137)
at org.springframework.data.cassandra.core.cql.CassandraAccessor.translate(CassandraAccessor.java:422)
at org.springframework.data.cassandra.core.cql.CqlTemplate.translateException(CqlTemplate.java:764)
at org.springframework.data.cassandra.core.cql.CqlTemplate.query(CqlTemplate.java:300)
at org.springframework.data.cassandra.core.cql.CqlTemplate.query(CqlTemplate.java:320)
at org.springframework.data.cassandra.core.CassandraTemplate.select(CassandraTemplate.java:337)
at org.springframework.data.cassandra.repository.query.CassandraQueryExecution$CollectionExecution.execute(CassandraQueryExecution.java:136)
at
...
Caused by: com.datastax.oss.driver.api.core.NoNodeAvailableException: \
No node was available to execute the query
at com.datastax.oss.driver.api.core.NoNodeAvailableException.copy(NoNodeAvailableException.java:40)
at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
at org.springframework.data.cassandra.core.cql.CqlTemplate.query(CqlTemplate.java:298)
... 39 common frames omitted
I would appreciate your support on this, and thanks in advance
Spring uses the Cassandra Java driver to connect to Cassandra clusters.
For each query execution, the Java driver generates a query plan which contains alist of nodes to connect to to execute a query. The configured load-balancing policy (DcInferringLoadBalancingPolicy in your case) determines the nodes to be included in the query plan. The query plan will only contain nodes which are known to be available which means the policy will not include nodes which are known to be down or ignored (see Load balancing with the Java driver for details).
In a scenario where ALL the nodes have been marked "down" or "ignored", the driver has no choice since there are no nodes available to connect to so the driver throws NoNodeAvailableException. As the error message you posted above states, there is literally no node available to execute the query -- exactly as it says.
The driver marks nodes as "down" or "ignored" because they haven't responded in some time usually because they are overloaded. Consider throttling the load even further so nodes are not overloaded.
Additionally, schema changes do not follow the same path as regular writes (INSERT, UPDATE, DELETE). Each DDL change (CREATE, ALTER, DROP) is propagated to other nodes via the gossip protocol so depending on the size of the cluster, it can take some time for all nodes in the cluster to reach schema agreement.
When performing schema changes programmatically, don't fire off changes in quick succession or you will risk nodes getting out of sync. Your application should pause after each schema change and check that all nodes has reached schema agreement BEFORE executing the next schema change, for example with a call to isSchemaInAgreement() or asynchronously with checkSchemaAgreementAsync(). For details, see Schema agreement with the Java driver.
As a side note, the default cassandra superuser is not designed for general use. It should only ever be used to provision another superuser account then deleted.
The use of the default cassandra superuser account is expensive since it requires a QUORUM of nodes to authenticate. In contrast, all other accounts other than cassandra authenticate with a consistency of ONE. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

Clustered NIFI, Only one node is working

I'm using NIFI in a clustered mode with two nodes, and I have noticed that only one node that do all the work.
Any idea why is that ? and how can I make nifi2 do some of the processing of the dataflow ?
It depends how data is coming in to your cluster. It is up to you as the data flow designer to create an approach that allows the data to be partitioned across your cluster for processing.
See this post for an overview of strategies to do this:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

Rethink DB Cross Cluster Replication

I have 3 different pool of clients in 3 different geographical locations.
I need configure Rethinkdb with 3 different clusters and replicate data between the (insert, update and deletes). I do not want to use shard, only replication.
I didn't found in documentation if this is possible.
I didn't found in documentation how to configure multi-cluster replication.
Any help is appreciated.
I think that multi cluster is just same a single clusters with nodes in different data center
First, you need to setup a cluster, follow this document: http://www.rethinkdb.com/docs/start-a-server/#a-rethinkdb-cluster-using-multiple-machines
Basically using below command to join a node into cluster:
rethinkdb --join IP_OF_FIRST_MACHINE:29015 --bind all
Once you have your cluster setup, the rest is easy. Go to your admin ui, select the table, in "Sharding and replication", click Reconfigure and enter how many replication you want, just keep shard at 1.
You can also read more about Sharding and Replication at http://rethinkdb.com/docs/sharding-and-replication/#sharding-and-replication-via-the-web-console

How to balance load of HBase while loading file?

I am new to Apache-Hadoop. I have Apache-Hadoop cluster of 3 nodes. I am trying to load a file having 4.5 billion records,but its not getting distributed to all nodes. The behavior is kind of region hotspotting.
I have removed "hbase.hregion.max.filesize" parameter from hbase-site.xml config file.
I observed that if I use 4 node's cluster then it distributes data to 3 nodes and if I use 3 node's cluster then it distributes to 2 nodes.
I think, I am missing some configuration.
Generaly with HBase the main issue is to prepare rowkeys that are not monotonically.
If they are, only oneregion server is used at the time:
http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
This is HBase Reference Guide about RowKey Design:
http://hbase.apache.org/book.html#rowkey.design
And one more really good article:
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
In our case predefinition of Region servers also improved the loading time:
create 'Some_table', { NAME => 'fam'}, {SPLITS=> ['a','d','f','j','m','o','r','t','z']}
Regards
Pawel

Running pig on a multi node Cassandra cluster

I am working on BI process that will read data from cassandra, create summaries using Map Reduce and write back to a different keyspace.
Starting with a single node, everything worked as i expected, but when moving to a multi-node, i am not sure I fully understand the topology and configuration.
I have a setup with 3 nodes. Each has a Cassandra node (version 1.1.9), data node and task tracker (version 0.20.2+923.421- CDH3U5) . The NameNode and job tracker are on a different server. At this point i am trying to run Pig script from the DataNode server.
The thing i am not sure of is the pig argument PIG_INITIAL_ADDRESS. I assumed the query would run on all Cassandra nodes, each task tracker would only query the local Cassandra node, and the reducer would handle any duplicates. Based on that assumption i thought the PIG_INITIAL_ADDRESS should be localhost. But when running the pig script it fails:
java.io.IOException: Unable to connect to server localhost:9160
My questions are- should the initial address be any one of the Cassandra nodes, and Splitting the map on the cluster is done from Cassandra keys partitions (will i get the distribution i need)?
IF I where to use java map reduce, will i still need to supply the initial address?
Is the current implementation assumes pig is running from a Cassandra node?
The PIG_INITIAL_ADDRESS is the address of one of the Cassandra nodes in your ring. In order to have the Hadoop job read data from or write data to Cassandra, it just needs to have some properties set. Those properties are also available to set in the job properties or in the default Hadoop configuration on the server that you're running the job from. Other than that, it's just like submitting a job to a job tracker.
For more information, I would look at the readme that's in the cassandra source download under examples/pig. There is a lot of explanation in there as well.

Resources