Getting NoNodeAvailableException running multiple tests in Spring Boot + Cassandra - spring-boot

Just upgraded to Spring boot 2.7 using jdk19 and decided to use Cassandra bitnami 3 running in docker with my tests "Junit-5", the error I'm getting is No node was available to execute the query, and it happens to the same test cases every time.
No node was available to execute the query; nested exception is \
com.datastax.oss.driver.api.core.NoNodeAvailableException: \
No node was available to execute the query
here is the code I'm using to connect
var loader = DriverConfigLoader.programmaticBuilder()
.withDuration(DefaultDriverOption.REQUEST_TIMEOUT, Duration.ofMinutes(1))
.withString(DefaultDriverOption.LOAD_BALANCING_POLICY_CLASS,
DcInferringLoadBalancingPolicy.class.getName())
.build();
if (session == null || session.isClosed()) {
var host = System.getenv("CASSANDRA_HOST") == null ? "localhost" : System.getenv("CASSANDRA_HOST");
var username = "localhost".equals(host)? "": "cassandra";
var password = "localhost".equals(host)? "": "cassandra";
LOG.info("Cassandra host '{}'.", host);
LOG.info("Cassandra username '{}'.", username);
LOG.info("Cassandra password '{}'.", password);
var sessionBuilder = new CqlSessionBuilder()
.addContactPoint(new InetSocketAddress(host, 9042))
.withLocalDatacenter("datacenter1")
.withConfigLoader(loader);
if (!username.isEmpty()) {
sessionBuilder.withAuthCredentials(username, password);
}
session = sessionBuilder.build();
It also important to mention that I have lots 170+ test cases distributed on different files, and with every file execution, am trying to clean and populate the DB again using this code
session.execute("create keyspace if not exists \"schema_x\" WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1};");
for (final String stmt : getCassaandraStatementsFromFile(CASSANDRA_SCHEMA_FILE)) {
session.execute(stmt);
LOG.info("Cassandra. Executed statement: '{}'.", stmt.replaceAll("\n", ""));
}
and the error exactly happening on that create keyspace line, I tried to apply some tuning from my side by
adapting the connection loader and using throttling but it didn't help
I also checked the local-datacenter value inside docker itself and matched mine.
Finally the complete error stacktrace in case it is required
org.springframework.data.cassandra.CassandraConnectionFailureException: \
Query; CQL [com.datastax.oss.driver.internal.core.cql.DefaultSimpleStatement#65b70f9e]; \
No node was available to execute the query; nested exception is \
com.datastax.oss.driver.api.core.NoNodeAvailableException: \
No node was available to execute the query
at org.springframework.data.cassandra.core.cql.CassandraExceptionTranslator.translate(CassandraExceptionTranslator.java:137)
at org.springframework.data.cassandra.core.cql.CassandraAccessor.translate(CassandraAccessor.java:422)
at org.springframework.data.cassandra.core.cql.CqlTemplate.translateException(CqlTemplate.java:764)
at org.springframework.data.cassandra.core.cql.CqlTemplate.query(CqlTemplate.java:300)
at org.springframework.data.cassandra.core.cql.CqlTemplate.query(CqlTemplate.java:320)
at org.springframework.data.cassandra.core.CassandraTemplate.select(CassandraTemplate.java:337)
at org.springframework.data.cassandra.repository.query.CassandraQueryExecution$CollectionExecution.execute(CassandraQueryExecution.java:136)
at
...
Caused by: com.datastax.oss.driver.api.core.NoNodeAvailableException: \
No node was available to execute the query
at com.datastax.oss.driver.api.core.NoNodeAvailableException.copy(NoNodeAvailableException.java:40)
at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230)
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54)
at org.springframework.data.cassandra.core.cql.CqlTemplate.query(CqlTemplate.java:298)
... 39 common frames omitted
I would appreciate your support on this, and thanks in advance

Spring uses the Cassandra Java driver to connect to Cassandra clusters.
For each query execution, the Java driver generates a query plan which contains alist of nodes to connect to to execute a query. The configured load-balancing policy (DcInferringLoadBalancingPolicy in your case) determines the nodes to be included in the query plan. The query plan will only contain nodes which are known to be available which means the policy will not include nodes which are known to be down or ignored (see Load balancing with the Java driver for details).
In a scenario where ALL the nodes have been marked "down" or "ignored", the driver has no choice since there are no nodes available to connect to so the driver throws NoNodeAvailableException. As the error message you posted above states, there is literally no node available to execute the query -- exactly as it says.
The driver marks nodes as "down" or "ignored" because they haven't responded in some time usually because they are overloaded. Consider throttling the load even further so nodes are not overloaded.
Additionally, schema changes do not follow the same path as regular writes (INSERT, UPDATE, DELETE). Each DDL change (CREATE, ALTER, DROP) is propagated to other nodes via the gossip protocol so depending on the size of the cluster, it can take some time for all nodes in the cluster to reach schema agreement.
When performing schema changes programmatically, don't fire off changes in quick succession or you will risk nodes getting out of sync. Your application should pause after each schema change and check that all nodes has reached schema agreement BEFORE executing the next schema change, for example with a call to isSchemaInAgreement() or asynchronously with checkSchemaAgreementAsync(). For details, see Schema agreement with the Java driver.
As a side note, the default cassandra superuser is not designed for general use. It should only ever be used to provision another superuser account then deleted.
The use of the default cassandra superuser account is expensive since it requires a QUORUM of nodes to authenticate. In contrast, all other accounts other than cassandra authenticate with a consistency of ONE. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

Related

How to connect to multiple Cassandra in different dc

I'm setting up an application in which i am using spark session to read data from Cassandra. I am able to read the data from Cassandra if i am passing one Cassandra node from a dc.
But how can i connect to 3 different Cassandra nodes which belong to 3 different dc in spark session.
Here the code which i am using:
spark session
spark = SparkSession.builder().appName("SparkCassandraApp")
.config("spark.cassandra.connection.host", cassandraContactPoints)
.config("spark.cassandra.connection.port", cassandraPort)
.config("spark.cassandra.auth.username", userName).config("spark.cassandra.auth.password", password)
.config("spark.dynamicAllocation.enabled", "false").config("spark.shuffle.service.enabled", "false")
.master("local[4]").getOrCreate();
property file :
spring.data.cassandra.contact-points=cassandra1ofdc1, cassandra2ofdc2, cassandra3ofdc3
spring.data.cassandra.port=9042
when i try the above scenario i am getting the following exception:
Caused by:
java.lang.IllegalArgumentException: requirement failed: Contact points contain multiple data centers: dc1, dc2, dc3
Any help would be appreciated
Thanks in advance.
The Spark Cassandra Connector (SCC) allows to use only nodes from local data center, either defined by the spark.cassandra.connection.local_dc configuration parameter, or determined from the DC of the contact point(s) (that is performed by function LocalNodeFirstLoadBalancingPolicy.determineDataCenter). SCC newer will use nodes from other DCs...

Distributing data read from GetMongo in a nifi cluster

I have a clustered nifi setup and we are running GetMongo processor with the Primary mode on, so that duplicate data is not fetched. This seems to be working fine. However once I have this data I want the following processes in the chain to run on a cluster, as in parallel processing to be done on this data which has been fetched. Somehow this is not happening. So my question is below assuming GetMongo has fetched 30000 records and they are in the queue:
1) How do I check whether a processor is running its process on a single node or on all nodes. The config has been set to all nodes, but when the processor is running I see it displays 1 in the top right corner.
2) If one processor has been set to run only on primary node, do all other processors in the flow also run on Primary mode?
Example:
In the screenshot above, my getmongo is running in primary node, how do I make sure that the execute script processor runs in parallel on all 3 nifi nodes. As of now if I check the view status history in the executescript process I see data flowing only through the primary node.
Yes, that's correct. When you mark the source processor to run only the Primary Node, all the subsequent steps will only happen on that node alone since the data is residing only that node (primary node), even when you have the NiFi in a clustered mode. To make it work the way you want, you can follow either of the following two approaches:
Approach #1 : Comibination of RPG and Site-To-Site
Here your flow will look like this:
Create an Input Port on the Root Group (the very top level of the NiFi canvas)
Make GetMongo run only on Primary Node.
Connect the success relationship of the processor to a Remote Processor Group (RPG). This RPG can be configured with the cluster details itself and configure it to connect to the port you added in step #1.
From the input port, connect it to your processing logic.
Useful Links:
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/
This is cumbersome and would make your flow very complex but this how it has to be done, till NiFi 1.8. With NiFi 1.8, you can use the following approach.
Approach #2 : Load-Balanced Connections (Apache NiFi 1.8+)
Apache NiFi had a new release - 1.8, a week ago. With this release, there is a new feature (a long time coming and very much desired one) was introduced. It is called Load-Balanced Connections.
In this approach, you can simply ignore the RPG/Site-To-Site combination and rather do the following:
Connect the output of your source processor, in this case GetMongo with the subsequent processors.
Right click the success relationship of the source processor.
Click configure
Go to Settings tab
Set the Load Balance Strategy to the desired one, preferably Roudd robin in your case.
Useful Links:
https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster
https://pierrevillard.com/2018/10/29/nifi-1-8-revolutionizing-the-list-fetch-pattern-and-more/

rethinkdb cluster, what if some of servers are down?

I have 5 servers. On my first "primary" I have in config:
join=ip2:port
join=ip3:port
join=ip4:port
join=ip5:port
I am connection to rethinkdb via proxy:
proxy --join ip1:port --join ip2:port
When I stop rethinkdb on ip1 everything stops. I do not know how to solve this. Rethinkdb docs are not complete. Do I have to define this joins in every config?
UPDATE
In fact when I stop any server in cluster my app crash! I am getting in webui something like "Table db.table is available for outdated reads, but not up-to-date reads or writes."
Except table shards I do not see point.
Yes, you usually want every node to know the address of every other node so that they can connect to each other if any subset of the nodes is down.

Elasticsearch-hadoop & Elasticsearch-spark sql - Tracing of statements scan&scroll

We are trying to integrate ES (1.7.2, 4 node cluster) with Spark (1.5.1, compiled with hive and hadoop with scala 2.11, 4 node cluster), there is hdfs coming into equation (hadoop 2.7,4 nodes) and thrift jdbc server and elasticsearch-hadoop-2.2.0-m1.jar
Thus, there are two ways of executing statement on ES.
Spark SQL with scala
val conf = new SparkConf().setAppName("QueryRemoteES").setMaster("spark://node1:37077").set("spark.executor.memory","2g")
conf.set("spark.logConf", "true")
conf.set("spark.cores.max","20")
conf.set("es.index.auto.create", "false")
conf.set("es.batch.size.bytes", "100mb")
conf.set("es.batch.size.entries", "10000")
conf.set("es.scroll.size", "10000")
conf.set("es.nodes", "node2:39200")
conf.set("es.nodes.discovery","true")
conf.set("pushdown", "true")
sc.addJar("executorLib/elasticsearch-hadoop-2.2.0-m1.jar")
sc.addJar("executorLib/scala-library-2.10.1.jar")
sqlContext.sql("CREATE TEMPORARY TABLE geoTab USING org.elasticsearch.spark.sql OPTIONS (resource 'geo_2/kafkain')" )
val all: DataFrame = sqlContext.sql("SELECT count(*) FROM geoTab WHERE transmittersID='262021306841042'")
.....
Thrift server (code executed on spark)
....
polledDataSource = new ComboPooledDataSource()
polledDataSource.setDriverClass("org.apache.hive.jdbc.HiveDriver")
polledDataSource.setJdbcUrl("jdbc:hive2://node1:30001")
polledDataSource.setMaxPoolSize(5)
dbConnection = polledDataSource.getConnection
dbStatement = dbConnection.createStatement
val dbResult = dbStatement.execute("CREATE TEMPORARY EXTERNAL TABLE IF NOT EXISTS geoDataHive6(transmittersID STRING,lat DOUBLE,lon DOUBLE) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'geo_2/kafkain','es.query'='{\"query\":{\"term\":{\"transmittersID\":\"262021306841042\"}}}','es.nodes'='node2','es.port'='39200','es.nodes.discovery' = 'false','es.mapping.include' = 'trans*,point.*','es.mapping.names' = 'transmittersID:transmittersID,lat:point.lat,lon:point.lon','pushdown' = 'true')")
dbStatement.setFetchSize(50000)
dbResultSet = dbStatement.executeQuery("SELECT count(*) FROM geoDataHive6")
.....
I have following issues and due to fact that they are connected, I have decided to pack them into one question on stack:
It seems that method using Spark SQL supports pushdown of what goes behind WHERE (whether es.query is specified or not), time of execution is the same and is acceptable. But solution number 1 definitely does not support pushdow of aggregating functions, i.e. presented count(*) is not executed on side of ES, but only after all data are retrieved - ES returns rows and Spark SQL counts them. Please confirm if this is correct behaviour
Solution number one behaves strange in that whether pushdown is passed true or false, time is equal
Solution number 2 seems to support no pushdown, it does not matter in what way I try to specify the sub-query, be it part of the table definition or in WHERE clause of the statement, it seems it is just fetching all the huge index and then to make the maths on it. Is it so that thrift-hive is not able to do pushdown on ES
I'd like to trace queries in elastic search, I do set following:
//logging.yml
index.search.slowlog: TRACE, index_search_slow_log_file
index.indexing.slowlog: TRACE, index_indexing_slow_log_file
additivity:
index.search.slowlog: true
index.indexing.slowlog: true
All index.search.slowlog.threshold.query,index.search.slowlog.threshold.fetch and even index.indexing.slowlog.threshold.index are set to 0ms.
And I do see in slowlog file common statements executed from sense (so it works). But I don't see Spark SQL or thrift statements executed against ES. I suppose these are scan&scroll statement because if i execute scan&scroll from sense, these are also not logged. Is it possible somehow to trace scan&scroll on side of ES?
As far as I know it is an expected behavior. All sources I know behave exactly the same way and intuitively it make sense. SparkSQL is designed for analytical queries and it make more sense to fetch data, cache and process locally. See also Does spark predicate pushdown work with JDBC?
I don't think that conf.set("pushdown", "true") has any effect at all. If you want to configure connection specific settings it should be passed as an OPTION map as in the second case. Using es prefix should work as well
This is strange indeed. Martin Senne reported a similar issue with PostgreSQL but I couldn't reproduce that.
After a discussion I had with Costin Leau on the elasticsearch discussion group, he pointed out the following and I ought sharing it with you :
There are a number of issues with your setup:
You mention using Scala 2.11 but are using Scala 2.10. Note that if you want to pick your Scala version, elasticsearch-spark should be used, elasticsearch-hadoop provides binaries for Scala 2.10 only.
The pushdown functionality is only available through Spark DataSource. If you are not using this type of declaration, the pushdown is not passed to ES (that's how Spark works). Hence declaring pushdown there is irrelevant.
Notice that how all params in ES-Hadoop start with es. - the only exceptions are pushdown and location which are Spark DataSource specific (following Spark conventions as these are Spark specific features in a dedicated DS)
Using a temporary table does count as a DataSource however you need to use pushdown there. If you don't, it gets activated by default hence why you see no difference between your runs; you haven't changed any relevant param.
Count and other aggregations are not pushed down by Spark. There might be something in the future, according to the Databricks team, but there isn't anything currently. For count, you can do a quick call by using dataFrame.rdd.esCount. But it's an exceptional case.
I'm not sure whether Thrift server actually counts as a DataSource since it loads data from Hive. You can double check this by enabling logging on the org.elasticsearch.hadoop.spark package to DEBUG. You should see whether the SQL does get translated to the DSL.
I hope this helps!

Hive - Multiple clusters pointing to same metastore

We have two clusters say one as old and one as new. Both of them are on AWS - EMR. Hive on these clusters pointing to same Hive metastore, which is on RDS. We are migrating from old to new.
Now the question is if I stop old cluster will there be any issue for accessing old tables? " All the data is on S3. All tables are EXTERNAL. But still the databases are on HDFS.. like
hdfs://old:1234/user/hive/warehouse/myfirst.db
If I stop the old cluster this location be void which makes db invalid and also tables? Though they are external.
I am really not sure if this will be an issue but this is on prod so I am trying to find if anyone already faced this issue.
Thanks!
As long as all your tables have the LOCATION set to S3, loosing the location for the DATABASE/SCHEMA will not impact access to your metadata.
The only impact it will have in your new cluster is that CREATE TABLE statements performed in the custom database ("myfirstdb" in your example) without a explicit LOCATION will fail to reach the default HDFS path, which is inherited from the DATABASE location.
Tables created in the "default" schema will not fail as Hive will resolve the location for the new table to the value of the property "hive.metastore.warehouse.dir", which is "/user/hive/warehouse" in Elastic MapReduce.
Again, this does not affect tables with an explicit LOCATION set at creation time.
In general, to achieve a completely "portable" Metastore what you will want to do is:
Make sure all the TABLES have LOCATION set to S3 (any data in HDFS is obviously bound to the cluster lifecycle).
This can be achieved by:
explicitely setting LOCATION in the CREATE TABLE statement or
setting LOCATION for all the DATABASE/SCHEMAS (other than 'default') to a path in S3
Optionally (but strongly recommended) use EXTERNAL (user managed a.k.a. non-managed) tables to prevent accidental data loss due to DDL statements

Resources