I'm trying to set up an application which produces change events with MySQL+Debezium+Kafka. I'd like to consume messages from the Debezium topic with a Quarkus Microprofile application.
I'm using the following configuration on the Quarkus side to capture incoming messages:
mp.messaging.incoming.customers.connector=smallrye-kafka
mp.messaging.incoming.customers.topic=dbserver1.inventory.customers
mp.messaging.incoming.customers.value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
That works, however the change event, when captured with a StringDeserializer, does not just contain the changed record:
{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"field":"id"},{"type":"string","optional":false,"field":"first_name"},{"type":"string","optional":false,"field":"last_name"},{"type":"string","optional":false,"field":"email"}],"optional":true,"name":"dbserver1.inventory.customers.Value","field":"before"},{"type":"struct","fields":[{"type":"int32","optional":false,"field":"id"},{"type":"string","optional":false,"field":"first_name"},{"type":"string","optional":false,"field":"last_name"},{"type":"string","optional":false,"field":"email"}],"optional":true,"name":"dbserver1.inventory.customers.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"table"},{"type":"int64","optional":false,"field":"server_id"},{"type":"string","optional":true,"field":"gtid"},{"type":"string","optional":false,"field":"file"},{"type":"int64","optional":false,"field":"pos"},{"type":"int32","optional":false,"field":"row"},{"type":"int64","optional":true,"field":"thread"},{"type":"string","optional":true,"field":"query"}],"optional":false,"name":"io.debezium.connector.mysql.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"field":"transaction"}],"optional":false,"name":"dbserver1.inventory.customers.Envelope"},"payload":{"before":null,"after":{"id":1005,"first_name":"myname","last_name":"myusername","email":"amail#mail.com"},"source":{"version":"1.3.0.Final","connector":"mysql","name":"dbserver1","ts_ms":1603634203000,"snapshot":"false","db":"inventory","table":"customers","server_id":223344,"gtid":null,"file":"mysql-bin.000003","pos":364,"row":0,"thread":6,"query":null},"op":"c","ts_ms":1603634203419,"transaction":null}}
How can I extract the changed data from this huge JSON?
which in my case is:
{"id":1005,"first_name":"myname","last_name":"myusername","email":"amail#mail.com"}
Should I keep using a StringDeserializer and use JSONB and iterate through the JSON Payload? or is there a better solution?
I don't think there's a better approach for that, however, as the text is a JSON, using a custom Deserializer that extends would JsonbDeserializer work:
#RegisterForReflection
public class CustomerDeserializer extends JsonbDeserializer<Customer> {
#Override
public Customer deserialize(String topic, byte[] data) {
JsonReader reader = Json.createReader(new StringReader(new String(data)));
JsonObject jsonObject = reader.readObject();
JsonObject payload = jsonObject.getJsonObject("payload");
String firstName = payload.getJsonObject("after").getString("first_name");
String lastName = payload.getJsonObject("after").getString("last_name");
String email = payload.getJsonObject("after").getString("email");
return new Customer(firstName,lastName,email);
}
}
Edit: You can find the full Debezium example here.
I'm very new to Kafka Streams and encountered a problem.
I have two tables - one is for long-term data (descriptions) and the other is for live data (live). They have a common id.
And the idea is to store data from descriptions (presumably in KTable, keep latest description for each id) and when new messages appear in live - join with data from descriptions on corresponding id and send it further.
For simplicity let's just make all types String.
So the basic idea was like in every tutorial I've seen:
interface Processor {
#Input("live")
KStream<String, String> input();
#Input("descriptions")
KTable<String, String> input();
#Output("output")
KStream<String, String> output();
}
And then:
#StreamListener
#SendTo("output")
public KStream<String, String> process(
#Input("live") KStream<String, String> live,
#Input("descriptions") KTable<String, String> descriptions) {
// ...
}
The problem is that descriptions topic is not KTable-suitable (null keys, just messages).
So I can't use it as an input and I can't create any new intermediate topics for storing a valid stream out of this table (basically read-only).
I was searching for some sort of in-memory Binding destination, but to no avail.
The way I thought it could be possible is something like creating an intermediate output that just stores KTable in-memory or something and then using this intermediate as an input in live processing. Like:
#StreamListener("descriptions")
#SendTo("intermediate")
public KTable<String, String> process(#Input("descriptions") KStream<String, String> descriptions) {
// ...
}
Hope it's possible with this Binding semantics.
I think you can try to introduce an intermediate topic for storing the key/value by introducing an initial processor. Then use that stream as a table for the input in your regular processor. Here are some templates. I am using the new functional model in Spring Cloud Stream to write these processors.
#Bean
public Function<KStream<String, String>, KStream<String, String>> processDescriptions() {
return descriptions ->
descriptions.map((key, value) -> {
Pojo p = parseIntoPojo(value);
return new KeyValue<>(p.getId(), value);
})
.groupByKey()
.reduce((v1, v2) -> v2)
.toStream();
}
#Bean
public BiFunction<KStream<String, String>, KTable<String, String>, KStream<String, String>> realStream() {
return (live, description) -> {
}
}
The first processor receives the description as KStream and then enrich that with the key and then output as KStream. Now that this topic has both key and value, we can use this as a KTable in our next processor. The next processor is a java.util.function.BiFunction which receives two inputs and generate an output. The inputs are KStream and KTable respectively and the output is a KStream.
You can set destinations on them as below:
spring.cloud.stream.function.definition=prorcessDescriptions;realStream
spring.cloud.stream.bindings.processDescriptions-in-0.destinaion=description-topic
spring.cloud.stream.bindings.processDescriptions-out-0.destinaion=description-table-topic
spring.cloud.stream.bindings.realStream-in-0.destinaion=live-topic
spring.cloud.stream.bindings.realStream-in-1.destinaion=description-table-topic
spring.cloud.stream.bindings.realStream-out-0.destinaion=output
You can achieve the same results by using the StreamListener approach as well.
The downside of this approach is that you need to maintain an extra intermediate topic in Kafka, but if you really want it as a KTable and the underlying information is non-keyed, I don't think there are too many options here.
If you don't need the descriptions as a top-level KTable, you might be able to store this somehow in a state store and later query that store all within a single processor. I haven't tried that out, so you need to play around with that idea. Basically, you get two streams, live and descriptions
(live, descriptions) -> Reduce key/value for descriptions and keep that in a state store.
Then, do the processing on live by joining with what is in the state store.
Kafka Streams allows various ways to accomplish things like that. Check their reference docs for more info.
Hope this helps.
I have to create a batch job where I need to fetch data from 1 DB and after processing dump that data to another DB where auto generated ID would be assigned to persisted data. I need to send that data along with generated ID to solace queue.
Reader(DB1) --data1--> Processor --data2--> Writer (DB2) --data3--> Writer (Solace Publisher)
I am using spring boot-2.2.5.RELEASE and spring-boot-starter-batch.
I have created a job having 1 step that read data from DB1 and write data to DB2 via RepositoryItemReader and RepositoryItemWriter respectively. This is working fine.
Now next task is to send persisted data having generated ID to solace stream (using spring-cloud-starter-stream-solace).
I have below questions. Please assist as I am totally new to spring batch
How can I get the complete record after it's saved to DB2 based on some parameter? Do I have to write my own RepositoryItemWriter having StepExecution Context or can I somehow use the existing RepositoryItemWriter.
Once I got the record I need to use solace stream and there I have publish method which expects argument(record) to be published. I think again I need to write my own Item Writer and either I could use the record passed from above repositoryItemWriter by StepExecutionContext or should I query into DB2 directly from here based on some parameter ?
Either of the above case I need to use stepexecution context but can I use available RepositoryItemWriter or do I have to write my own?
Is there any other concept which is handy in this handy instead of using above approaches?
Passing data to future steps is a common pattern in Spring Batch. According to the documentation https://docs.spring.io/spring-batch/docs/current/reference/html/common-patterns.html#passingDataToFutureSteps you can use stepExecution to store and retrieve your generated IDs. In your case the writers are also listeners which has before step methods annotated with #BeforeStep. For example:
public class DB2ItemWriter implements ItemWriter<Object> {
private StepExecution stepExecution;
public void write(List<? extends Object> items) throws Exception {
// ...
ExecutionContext stepContext = this.stepExecution.getExecutionContext();
stepContext.put("generatedIds", ids);
}
#BeforeStep
public void saveStepExecution(StepExecution stepExecution) {
this.stepExecution = stepExecution;
}
}
and then you retrieve the ids in the next writer
public class SolacePublisherItemWriter implements ItemWriter<Object> {
public void write(List<? extends Object> items) throws Exception {
// ...
}
#BeforeStep
public void retrieveGeneratedIds(StepExecution stepExecution) {
ExecutionContext stepExecutionContext = stepExecution.getExecutionContext();
this.generatedIds = stepExecutionContext.get("generatedIds");
}
}
I have created a job having 1 step that read data from DB1 and write data to DB2 via RepositoryItemReader and RepositoryItemWriter respectively. This is working fine.
I would add a second step that reads data from the table (in which records have been persisted by step 1 and have their IDs generated) and push it to solace using a custom writer.
I am new to functional paradigm, wondering how to go about doing some querying before creating a new object?
#Override
public Mono<Order> create(CreateOrderRequest specs) {
//itemRepository.findAll(specs.getItemCodes()) //returns Flux<Item>
final Order newOrder = new Order(items);
return orderRepository.insert(newOrder)
.switchIfEmpty(Mono.error(new ResponseStatusException(HttpStatus.BAD_REQUEST, "Failed to create order")));
}
How do I chain the commented code in a non blocking way? The query returns Flux<Item> while Order constructor requires a List<Item>
You can use the collectList() method, which will change your Flux<Item> into a Mono<List<Item>>.
After that, you can use the map() method to convert your List<Item> into an Order object, and the flatMap() method to get the saved result.
For example:
return itemRepository
.findAll(specs.getItemCodes())
.collectList()
.map(Order::new)
.flatMap(orderRepository::insert)
.switchIfEmpty(Mono.error(new ResponseStatusException(HttpStatus.BAD_REQUEST, "Failed to create order")));
Can field grouping be done on tuples emitted by a kafka spout? If yes, then how does Storm gets to know the fields in a Kafka record?
Field grouping (and grouping in general) in Storm is for bolts, not for spouts. This is done via InputDeclarer class.
When you call setBolt() on TopologyBuilder, InputDeclarer is returned.
Kafka Spout declared its output fields like any other component. My explanation is based on current implementation of KafkaSpout.
In KafkaSpout.java class we see declareOutputFields method that call getOutputFields() method of KafkaConfig Scheme.
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(_spoutConfig.scheme.getOutputFields());
}
By default, KafkaConfig uses RawMultiScheme that implements this method in this way.
#Override
public Fields getOutputFields() {
return new Fields("bytes");
}
So what does it mean?, if you declared bolt which reads tuples from KafkaSpout with fieldGrouping you know that every tuple that contains equals field "bytes" is going to be executed by the same task. If you want to emit any field, you should implement new scheme for your needs.
TL:DR
The default implementation of KafkaSpout declares following output fields in declareOutputFields:
new Fields("topic", "partition", "offset", "key", "value");
So in building topology code directly do:
topologyBuilder.setSpout(spoutName, mySpout, parallelismHintSpout);
topologyBuilder.setBolt(boltName, myBolt, parallelismHintBolt).fieldsGrouping(spoutName, new Fields("key"));
Details: A little looking into code tells that:
In Kafka Spout, declareOutputFields is implemented in following way:
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
RecordTranslator<K, V> translator = kafkaSpoutConfig.getTranslator();
for (String stream : translator.streams()) {
declarer.declareStream(stream, translator.getFieldsFor(stream));
}
}
It gets fields from RecordTranslator interface and its instance is fetched from kafkaSpoutConfig i.e. KafkaSpoutConfig<K, V>. KafkaSpoutConfig<K, V> extends from CommonKafkaSpoutConfig (this is slightly different in 1.1.1 version though). The builder of this returns DefaultRecordTranslator. If you check the Fields in this class implementation, you will find:
public static final Fields FIELDS = new Fields("topic", "partition", "offset", "key", "value");
So we can use Fields("key") directly in fields grouping in topology code:
topologyBuilder.setBolt(boltName, myBolt, parallelismHintBolt).fieldsGrouping(spoutName, new Fields("key"));