RocksDB NoClassDefFound - how to setup rocksDB for Kafka streams - apache-kafka-streams

I'm attempting to solve a problem using kstreams. I'm currently hitting this error when doing an aggregation.
Exception in thread "main" java.lang.NoClassDefFoundError: org/rocksdb/RocksDBException
at org.apache.kafka.streams.state.internals.RocksDbWindowBytesStoreSupplier.get(RocksDbWindowBytesStoreSupplier.java:50)
at org.apache.kafka.streams.state.internals.RocksDbWindowBytesStoreSupplier.get(RocksDbWindowBytesStoreSupplier.java:24)
at org.apache.kafka.streams.state.internals.WindowStoreBuilder.build(WindowStoreBuilder.java:40)
at org.apache.kafka.streams.state.internals.WindowStoreBuilder.build(WindowStoreBuilder.java:26)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder$StateStoreFactory.build(InternalTopologyBuilder.java:141)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.buildProcessorNode(InternalTopologyBuilder.java:966)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.build(InternalTopologyBuilder.java:869)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.build(InternalTopologyBuilder.java:822)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.build(InternalTopologyBuilder.java:805)
at org.apache.kafka.streams.KafkaStreams.<init>(KafkaStreams.java:667)
at org.apache.kafka.streams.KafkaStreams.<init>(KafkaStreams.java:624)
at org.apache.kafka.streams.KafkaStreams.<init>(KafkaStreams.java:534)
My code is effectively this:
KStream<String, InputData> input = builder.stream(topicname);
KTable<Windowed<String>, CustomAgg> grouped =
input.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMillis(60000)))
.aggregate(
CustomAgg::new,
(k, v, agg) -> agg.add(v),
Materialized.<String, CustomAgg, WindowStore<Bytes, byte[]>>as("aggs").withValueSerde(new CustomAggSerde()));
grouped.toStream().print(Printed.toSysOut());
kafka-streams version 2.1.0
I can't seem to find any resources online on how to setup rocksDB for kafka streams - any advice would be much appreciated. (I have it installed with brew but I'm not sure how I need to point to it, any setup, does it need to be in my pom.xml file etc). Working on MacOS currently for development.
Thanks!

You do not need to install RocksDB for Kafka Streams. RocksDB is a dependency of Kafka Streams. If you have Kafka Streams as a dependency in your build automation tool (e.g. maven or gradle), the RocksDB JAR should be automatically downloaded during a build and put onto your class path.
Without a build automation tool you probably need to put the RocksDB JAR on the class path manually. The correct version of RocksDB for Kafka Streams 2.1.0 should be 5.14.2.
The error you get seems to be a class path issue, so maybe it is related to the above.

try insert below dependency in you pom.xml:
<dependency>
<groupId>org.rocksdb</groupId>
<artifactId>rocksdbjni</artifactId>
<version>4.9.0</version>
</dependency>
this link might be helpful to you:
https://technology.amis.nl/software-development/java/getting-started-with-kafka-streams-building-a-streaming-analytics-java-application-against-a-kafka-topic/

Related

How do I programmatically install Maven libraries to a cluster using init scripts?

Have been trying for a while now and Im sure the solution is simple enough, just struggling to find it. Im pretty new so be easy on me..!
Its a requirement to do this using a premade init-script, which is then selected in the UI when configuring the cluster.
I am trying to install com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.18 to a cluster on Azure Databricks. Following the documentations example (it is installing a postgresql driver) they produce an init script using the following command:
dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)```
My question is, what is the /mnt/driver-daemon/jars/postgresql-42.2.2.jar section of this code? And what would I have to do to make this work for my situation?
Many thanks in advance.
/mnt/driver-daemon/jars/postgresql-42.2.2.jar here is the output path where the jar file will be put. But it makes no sense as this jar won't be put into CLASSPATH and won't be found by Spark. Jars need to be put into /databricks/jars/ directory, where they will be picked up by Spark automatically.
But this method with downloading of jars works only for jars without dependencies, and for libraries like EventHubs connector this is not a case - they won't work if dependencies aren't downloaded as well. Instead it's better to use Cluster UI or Libraries API (or Jobs API for jobs) - with these methods, all dependencies will be fetched as well.
P.S. But really, instead of using EventHubs connector, it's better to use Kafka protocol that is supported by EventHubs as well. There are several reasons for that:
It's better from performance standpoint
It's better from stability standpoint
Kafka connector is included into DBR, so you don't need to install anything extra
You can read how to use Spark + EventHubs + Kafka connector in the EventHubs documentation.

Ksql sbt dependency maven

Maybe a naive question, but can anyone provide me the sbt dependency for KSQL?
I checked on Maven, but couldn't find any.
Is the dependency hosted some place other than Maven, if yes what would be the revolver I will have to add in my build.sbt file?
I'm trying to write a Scala app which uses Ksql to query on some kafka topics to create a dashboard with some metrics.
None of the Confluent dependencies are in Maven Central
See
https://docs.confluent.io/current/installation/clients.html#maven-repository-for-jars
And I think this is the KSQL client target
<dependency>
<groupId>io.confluent.ksql</groupId>
<artifactId>ksql-engine</artifactId>
</dependency>
Example Java code - https://github.com/confluentinc/ksql/tree/master/ksqldb-examples/src/main/java/io/confluent/ksql/embedded
You don't need to embed KSQL in your code, though. It's meant to run independently on the KSQL Server, which you can submit from code or use the KSQL CLI. In your application, you'd use a regular consumer or Kafka Streams API directly
I would suggest trying the new Scala Kafka Streams wrapper, too

How to write from Apache Flink to Elasticsearch

I am trying to connect Flink to Elasticsearch and when I run the Maven project I have this error :
or another way to do it, I am using this example : https://github.com/keiraqz/KafkaFlinkElastic
The example you linked depends on various Flink modules with different version which is highly discouraged. Try setting them all to one version and see if this fixes the issue.

Apache Spark and gRPC

I'm writing an application that uses Apache Spark. For communicating with a client, I would like to use gRPC.
In my Gradle build file, I use
dependencies {
compile('org.apache.spark:spark-core_2.11:1.5.2')
compile 'org.apache.spark:spark-sql_2.11:1.5.2'
compile 'io.grpc:grpc-all:0.13.1'
...
}
When leaving out gRPC, everything works fine. However, when gRPC is used, I can create the build, but not execute it, as various versions of netty are used by the packages. Spark seems to use netty-all, which contains the same methods (but with potentially different signatures) than what gRPC uses.
I tried shadowing (using com.github.johnrengelman.shadow) , but somehow it still does not work. How can I approach this problem?
The general solution to this sort of thing is shading with relocation. See the answer to a similar problem with protobuf dependencies: https://groups.google.com/forum/#!topic/grpc-io/ABwMhW9bU34
I think the problem is that spark uses netty 4.0.x and gRPC 4.1.0 .

Spring XD on YARN: ver 1.2.1 direct binding support for kafka source

Spring XD on YARN: ver 1.2.1 direct binding support for kafka source.
1.I know this is not supported yet(as of ver 1.3.0), any definite date/ver would help our project schedule ?
2.This direct binding for kafka source support is very critical for our project. We are in a situation to totally abandon Spring XD YARN in our project just because of this.
Trying to do
stream create --name directkafkatohdfs --definition "kafka | hdfs"
stream deploy directkafkatohdfs --properties "module.*.count=0"
Hitting the exception "must be a positive number. 0-count kafka sources are not currently supported"
I just want to eliminate the use of message bus/transport(redis/kafka/rabbitMQ) and want to have a direct binding of source(kafka) and sink(sink) in the same YARN container.
1.I know this is not supported yet(as of ver 1.3.0), any definite date/ver would help our project schedule.
2.This direct binding for kafka source support is very critical for our project. We are in a situation to totally abandon Spring XD YARN in our project just because of this.
Thanks
Satish Srinivasan
satsrinister#gmail.com
Thanks for the interest in Spring XD :).
For Spring XD 1.x, we suggest using composition instead of direct binding with the Kafka bus - or, in your case, the Kafka source. However, apart from that, in Spring XD 1.x it is not possible to create an entire stream without at least one hop over the bus (regardless of the type of bus or modules being used).
We are addressing direct binding (including support for entire directly bound streams) as part of Spring Cloud Data Flow (http://cloud.spring.io/spring-cloud-dataflow/) - which is the next evolution of Spring XD. We are intending to support it as a specific configuration option, rather than as a side-effect of zero-count modules. From an end-user perspective, SCDF supports the same DSL as Spring XD (with minor variations) and has the same administration UI, and definitely supports YARN, so it should be a fairly seamless transition. I would suggest starting to take a look at that. The upcoming 1.0.0.M2 release of Spring Cloud Data Flow will not support direct binding via DSL yet, but the intent is to support it in the final release which is currently planned for Q1 2016.

Resources