I have a question regarding key deserialization on KafkaStreams. Specifically I use Kafka Connect and debezium connector to read
data from a Postgres table. Data were imported to a Kafka topic created two Avro schemas on Kafka Schema Registry one for the Key
and one for the Value (this contains all Columns on Table).
I read these data on a GlobalKTable like below:
properties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
properties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
GlobalKTable<my.namespace.db.Key, my.namespace.db.Value> tableData = builder.globalTable("topic_name");
My issue is that I have a topology where I need to join this GlobalKTable with a KStream as the one below:
SpecificAvroSerde<EventObj> eventsSpecificAvroSerde = new SpecificAvroSerde<>();
eventsSpecificAvroSerde.configure(Collections.singletonMap(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG,
conf.getString(" kafka.schema.registry.url")), false);
KStream<Integer, EventObj> events = builder.stream( "another_topic_name",Consumed.with(Serdes.Integer(),eventsSpecificAvroSerde))
Note that the Avro schema for my.namespace.db.Key is
{
"type": "record",
"name": "Key",
"namespace":"my.namespace.db",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
Obviously the key on GlobalKTable and KStream is a different object and I do not know how to achieve the
join. I initially tried this but it did not work.
events.join(tableData,
(key,val) -> {return my.namespace.db.Key.newBuilder().setId(key).build();})
/* To convert the Integer Key in KStream to the Avro Object Key
on GlobalKTable as to achieve the join */
(ev,tData) -> ... );
The output I get is the following where I can see a WARN on one of my joined topics (which seems suspect) but there is nothing else no output of the joined entities, it just is as if there is nothing to consume.
INFO [Consumer clientId=kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer, groupId=kafka-streams] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:336)
INFO stream-thread [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer] Assigned tasks to clients as {0401c29c-30a9-4969-93f9-5a83b3c834b4=[activeTasks: ([0_0]) standbyTasks: ([]) assignedTasks: ([0_0]) prevActiveTasks: ([]) prevAssignedTasks: ([]) capacity: 1]}. (org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:341)
WARN [Consumer clientId=kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer, groupId=kafka-streams] The following subscribed topics are not assigned to any members: [my-topic] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:241)
INFO [Consumer clientId=kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer, groupId=kafka-streams] Successfully joined group with generation 1 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:341)
INFO [Consumer clientId=kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer, groupId=kafka-streams] Setting newly assigned partitions [mip-events-2-0] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:341)
INFO stream-thread [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED (org.apache.kafka.streams.processor.internals.StreamThread:346)
INFO KafkaAvroSerializerConfig values:
schema.registry.url = [http://kafka-schema-registry:8081]
auto.register.schemas = true
max.schemas.per.subject = 1000
(io.confluent.kafka.serializers.KafkaAvroSerializerConfig:175)
INFO KafkaAvroDeserializerConfig values:
schema.registry.url = [http://kafka-schema-registry:8081]
auto.register.schemas = true
max.schemas.per.subject = 1000
specific.avro.reader = true
(io.confluent.kafka.serializers.KafkaAvroDeserializerConfig:175)
INFO KafkaAvroSerializerConfig values:
schema.registry.url = [http://kafka-schema-registry:8081]
auto.register.schemas = true
max.schemas.per.subject = 1000
(io.confluent.kafka.serializers.KafkaAvroSerializerConfig:175)
INFO KafkaAvroDeserializerConfig values:
schema.registry.url = [http://kafka-schema-registry:8081]
auto.register.schemas = true
max.schemas.per.subject = 1000
specific.avro.reader = true
(io.confluent.kafka.serializers.KafkaAvroDeserializerConfig:175)
INFO stream-thread [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1] partition assignment took 10 ms.
current active tasks: [0_0]
current standby tasks: []
previous active tasks: []
(org.apache.kafka.streams.processor.internals.StreamThread:351)
INFO stream-thread [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1] State transition from PARTITIONS_ASSIGNED to RUNNING (org.apache.kafka.streams.processor.internals.StreamThread:346)
INFO stream-client [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4]State transition from REBALANCING to RUNNING (org.apache.kafka.streams.KafkaStreams:346)
Can I make this join work on Kafka Streams?
Note that this works if I use a KTable to read the topic and use selectKey on
KStream to convert the key but I want to avoid the repartition.
Or should the right approach be importing my data from database in another way as to avoid creating Avro Objects and
how is this possible using debezium connectors and KafkaConnect with AvroConverter enable ?
Related
BACKGROUND
I have some time series data that looks like this:
1671407650806101433,GROUPSTATS,group_id=0,last_rx=1671407650694827472,last_rx_local=18:54:10.694827472,time_since_last_rx=111273960,desc=00AL.R24097.-,exch=mdx,pid=12345
Where GROUPSTATS is the name of the type, group_id is either 0 or 1 and last_rx, last_rx_local and time_since_last_rx are the data sets that I care about and descriptor is the channel (many per pid).
Telegraf obtains this data which is then scraped by prometheus and results in the following metrics and labels:
metrics:
GROUPSTATS_group_id
GROUPSTATS_last_rx
GROUPSTATS_pid
GROUPSTATS_time
GROUPSTATS_time_since_last_rx
labels:
descriptor
exch
last_rx_local
type
instance (telegraf data)
I then visualise this data in grafana. Telegraf -> prometheus -> grafana.
QUESTION
Why are some of those fields arbitrarily taken as labels and some as metrics? I.e. why is pid a metric but exch a label?
Is there anyway to convert metrics to labels?
I want to be able to average time_since_last_rx on a host and show it by group_id, or average time_since_last_rx across a pid. At the moment I can't filter by group_id as its a separate metric.
telegraf config, is below. someBashScripts removes the keys from the csv:
commands = [
"someBashScript.sh"
]
data_format = "csv"
csv_header_row_count = 0
csv_column_names = ["time", "type", "group_id", "last_rx", "local_last_rx", "time_since_last_rx", "descriptor", "pid", "exch"]
csv_measurement_column = "type"
csv_timestamp_column = "time"
csv_timestamp_format = "unix_ns"
I need to insert multiple rows into one table in one batch.
In DatabaseClient i found insert() statement and using(Publisher objectToInsert) method which has multiple objects as argument. But would it insert them in one batch or not?
Another possible solution is connection.createBatch(), but it has a drowback : I cannot pass my Entity object there and i cannot generate sql query from the entity.
So, is it possible to create batch insert in r2dbc?
There are two questions:
Would DatabaseClient.insert() insert them in one batch or not?
Not a batch.
Is it possible to create batch insert in r2dbc? (except Connection.createBatch())
No, use Connection.createBatch() is only one way to create a Batch for now.
See also issues:
spring-data-r2dbc#259
spring-framework#27229
There is no direct support till now, but I found it is possible to use Connection to overcome this barrier simply, check out this issue, spring-data-r2dbc#259
The statement has a add to repeat to bind parameters.
The complete codes of my solution can be found here.
return this.databaseClient.inConnectionMany(connection -> {
var statement = connection.createStatement("INSERT INTO posts (title, content) VALUES ($1, $2)")
.returnGeneratedValues("id");
for (var p : data) {
statement.bind(0, p.getTitle()).bind(1, p.getContent()).add();
}
return Flux.from(statement.execute()).flatMap(result -> result.map((row, rowMetadata) -> row.get("id", UUID.class)));
});
A test for this method.
#Test
public void testSaveAll() {
var data = Post.builder().title("test").content("content").build();
var data1 = Post.builder().title("test1").content("content1").build();
var result = posts.saveAll(List.of(data, data1)).log("[Generated result]")
.doOnNext(id->log.info("generated id: {}", id));
assertThat(result).isNotNull();
result.as(StepVerifier::create)
.expectNextCount(2)
.verifyComplete();
}
The generated ids are printed as expected in the console.
...
2020-10-08 11:29:19,662 INFO [reactor-tcp-nio-2] reactor.util.Loggers$Slf4JLogger:274 onNext(a3105647-a4bc-4986-9ad4-1e6de901449f)
2020-10-08 11:29:19,664 INFO [reactor-tcp-nio-2] com.example.demo.PostRepositoryTest:31 generated id: a3105647-a4bc-4986-9ad4-1e6de901449f
//.....
2020-10-08 11:29:19,671 INFO [reactor-tcp-nio-2] reactor.util.Loggers$Slf4JLogger:274 onNext(a611d766-f983-4c8e-9dc9-fc78775911e5)
2020-10-08 11:29:19,671 INFO [reactor-tcp-nio-2] com.example.demo.PostRepositoryTest:31 generated id: a611d766-f983-4c8e-9dc9-fc78775911e5
//......
Process finished with exit code 0
I just implemented a trident DRPC function to process incoming messages and I am trying to persist the count of processed tuples at the final stage of the topology as a Trident state. Here is my topology:
topology.newDRPCStream("portfolio")
.map(parseMapFunction,
new Fields("portfolioUrn", "portfolioSourceSystem", "portfolioRegion",
"portfolioTimestamp", "portfolioPayload"))
.filter(new FilterNull())
.flatMap(splitMapFunction,
new Fields("portfolioUrn", "portfolioSourceSystem", "portfolioRegion",
"portfolioTimestamp", "strategyCode"))
.parallelismHint(1)
.shuffle()
.each(new Fields("strategyCode"), findMongoTradesFunction,
new Fields("uitid", "id", "sourceSystem", "sourceTransactionTime", "publicationTime",
"tradeVersion", "urn", "riskViewFrom", "riskViewTo", "authorized"))
.parallelismHint(10)
.shuffle()
.filter(tradeFilterFunction)
.parallelismHint(150)
.groupBy(new Fields("uitid"))
.aggregate(
new Fields("portfolioUrn", "portfolioTimestamp", "strategyCode", "id", "sourceSystem",
"sourceTransactionTime", "publicationTime", "tradeVersion", "urn", "riskViewFrom",
"riskViewTo", "uitid"), reduceAggregateFunction,
new Fields("portfolioUrn", "portfolioTimestamp", "strategyCode", "id", "sourceSystem",
"sourceTransactionTime", "publicationTime", "tradeVersion", "urn", "riskViewFrom",
"riskViewTo"))
.parallelismHint(200)
.groupBy(new Fields("portfolioUrn"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
While I am trying to submit this topology to Storm I am having this error:
Exception in thread "main" java.lang.RuntimeException: InvalidTopologyException(msg:Component: [b-4] subscribes from non-existent component [$mastercoord-bg0])
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:273)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
at com.citi.tm.portfolio.tps.PortfolioLauncher.main(PortfolioLauncher.java:34)
Caused by: InvalidTopologyException(msg:Component: [b-4] subscribes from non-existent component [$mastercoord-bg0])
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8070)
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8047)
at org.apache.storm.generated.Nimbus$submitTopology_result.read(Nimbus.java:7981)
at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.storm.generated.Nimbus$Client.recv_submitTopology(Nimbus.java:306)
at org.apache.storm.generated.Nimbus$Client.submitTopology(Nimbus.java:290)
at org.apache.storm.StormSubmitter.submitTopologyInDistributeMode(StormSubmitter.java:326)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:260)
... 3 more
I can successfully submit the topology if I remove the last 2 functions from that topology , that are:
.groupBy(new Fields("portfolioUrn"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
After I run my aggregation function (aggregate()) I'd like to group tuples with 'portfolioUrn' field and persist the count into mongoDB. I don't understand why the last groupBy().persistentAggregate() section causes this error. Could you please help to find the reason?
After some research, I found this page which seems a similar case to me. Nathan Marz is stating that DRPC topologies do not support partition persist (as of 2013) and I believe it is the same for my case as well. I think, (not fully verified) Storm 1.2.1 DRPC topologies might not support state persistence at all.
According to the Project Reactor documentation regarding processors:
direct (DirectProcessor and UnicastProcessor): These processors can
only push data through direct user action (calling their Sink's
methods directly).
synchronous (EmitterProcessor and ReplayProcessor): These processors
can push data both through user action and by subscribing to an
upstream Publisher and synchronously draining it.
UnicastProcessor shouldn't be able to subscribe to an upstream Publisher. There documentation offers an example of the direct user Sink invocation:
UnicastProcessor<String> hotSource = UnicastProcessor.create();
Flux<String> hotFlux = hotSource.publish()
.autoConnect()
.map(String::toUpperCase);
hotFlux.subscribe(d -> System.out.println("Subscriber 1 to Hot Source: "+d));
hotSource.onNext("blue");
However I have tried subscribing directly a UnicastProcessor to a Publisher and it works. This shouldn't be possible as stated in the documentation. Is the doc wrong of am I missing something?
In the following example, I'm subscribing the UnicastProcessor to an upstream Flux without any problem:
val latch = CountDownLatch(20)
val numberGenerator: Flux<Long> = counter(1000)
val processor = UnicastProcessor.create<Long>()
val connectableFlux = numberGenerator.subscribeWith(processor)
connectableFlux.subscribe {
logger.info("Element [{}]", it)
}
latch.await()
Log:
12:50:12.193 [main] INFO reactor.Flux.Map.1 - onSubscribe(FluxMap.MapSubscriber)
12:50:12.196 [main] INFO reactor.Flux.Map.1 - request(unbounded)
12:50:13.203 [parallel-1] INFO reactor.Flux.Map.1 - onNext(0)
12:50:13.203 [parallel-1] INFO com.codependent.Test - Element [0]
Yes it seems this aspect of the documentation is outdated. Even DirectProcessor can be used as a Subscriber and propagate signals to its own subscribers.
NB: You used an EmitterProcessor in your snippet, but it still behaves the same with UnicastProcessor.
I'm trying to create a Kafka sink connector using the spredfast s3 connector. However, for some reason, the log output is reporting a SourceConnectorConfig:
INFO ConnectorConfig values:
connector.class = com.spredfast.kafka.connect.s3.sink.S3SinkConnector
key.converter = null
name = transactions-s3-sink
tasks.max = 1
transforms = null
value.converter = class org.apache.kafka.connect.storage.StringConverter
(org.apache.kafka.connect.runtime.ConnectorConfig:180)
INFO Creating connector transactions-s3-sink of type com.spredfast.kafka.connect.s3.sink.S3SinkConnector (org.apache.kafka.connect.runtime.Worker:178)
INFO Instantiated connector transactions-s3-sink with version 0.0.1 of type class com.spredfast.kafka.connect.s3.sink.S3SinkConnector (org.apache.kafka.connect.runtime.Worker:181)
INFO Finished creating connector transactions-s3-sink (org.apache.kafka.connect.runtime.Worker:194)
INFO SourceConnectorConfig values:
connector.class = com.spredfast.kafka.connect.s3.sink.S3SinkConnector
key.converter = null
name = transactions-s3-sink
tasks.max = 1
transforms = null
value.converter = class org.apache.kafka.connect.storage.StringConverter
(org.apache.kafka.connect.runtime.SourceConnectorConfig:180)
INFO Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:824)
...
INFO Sink task WorkerSinkTask{id=transactions-s3-sink-0} finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:232)
Why is a SinkConnectorConfig reported yet further on in the log output I can see a WorkerSinkTask was created?
The reason is that this connector extends Connector abstract class instead of SinkConnector abstract class from Connect's API (see the source code here).
Thus, Connect framework can't tell whether this connector is a source or a sink, and currently the logic in the code is that if it's not a sink, assume it's a source. That's why you experience this inconsistency.
The solution is for the connector to extend appropriate abstract class (here org.apache.kafka.connect.sink.SinkConnector)