Correct flatMapvalues in Kafka streams split single message based on a field value - apache-kafka-streams

Needing some guidance w.r.t Kafka streams split.
I have a message value fields like this
{"name": "val1", "role": "val2"}
key of the message is a String field which we don't worry about here.
When in the name field I get multiple values separated by a / like this {"name": "tom/dick/harry", "role": "manager"} I want to be able to check those records in my stream with multiple / separated values in name field and then split or branch based that and send each message to the output topic. So basically 1 message to 3 different messages in this case:
{"name": "tom", "role": "manager"}
{"name": "dick", "role": "manager"}
{"name": "harry", "role": "manager"}
and send each of these to output topic.
I have tried Kafka streams' flatMapValues() and branch but it doesn't work. Just looking for a one line code or method I can use to achieve this.
Here is my code:
modifiedStream.filter(((key, value) -> value.getPerformerName().contains("/")))
.peek((key, value) -> log.info("Splitting this record to multiple ones..."))
.flatMapValues(new ValueMapper<work_reg_performer_int, Iterable<?>>() {
#Override
public Iterable<?> apply(work_reg_performer_int value) {
return Arrays.asList(value.getPerformerName().split("/"));
}
}
).to("split_performers_topic");
Here is my consumer's stream config:
consumer:
keySerde: org.apache.kafka.common.serialization.Serdes$StringSerde
valueSerde: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde
startOffset: earliest
Running this code is throwing this exception stack which I think is due to only each of the performer name becoming its own message without anything else?
org.apache.kafka.streams.errors.StreamsException: ClassCastException while producing data to topic split_performers_topic. A serializer (key: org.apache.kafka.common.serialization.StringSerializer / value: io.confluent.kafka.streams.serdes.avro.SpecificAvroSerializer) is not compatible to the actual key or value type (key type: java.lang.String / value type: java.lang.String). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters (for example if using the DSL, `#to(String topic, Produced<K, V> produced)` with `Produced.keySerde(WindowedSerdes.timeWindowedSerdeFrom(String.class))`).
PS: I am using Spring Cloud stream for Kafka for this.

Can you update your problem statement with Stream config values? From the error, you seem generating Streams of records with key of type String and value of type of avro specific record, but the expected format to the destination topic is String and String.
Also as per the code snippet you have hardcoded Stream out topic as "split_performers_topic", but in your error it is complaining about the producer topic APRADB_work_reg_performer. Not sure where is the mismatch coming from. Kindly check and confirm.

Related

Nested keys in redis using Spring Boot

I want to run a job in spring boot using quartz where multiple threads will execute the method.
What i want is to save the result in redis for every processing, so i can get idea how good job is working.
I want to save data in redis in this form.
{
"2020-04-20": [
{
"item_1": {
"success": "true",
"message": ""
}
},
{
"item_2": {
"success": "true",
"message": ""
}
}
]
}
I want to insert all the items in key date.
Since multiple threads are working , every thread is working on some item. So all item should be inserted into only key (date).
Is it possible?
one solution is to over-write the data of (date) key again and again , first getting data from redis, appending item on it and again saving the key in redis.
Is there another way , or using some annotation like #cacheable, #cacheput etc. so that i can create nested key. automatically item is appended in the (date) key.
Have you considered RedisJSON?
Soemthing like this (I haven't tested it, I don't have RedisJSON handy)
JSON.SET "2020-04-20" . [] // create the object once
JSON.ARRAPPEND "2020-04-20". '{ // every thread issues a command like this.
"item": {
"success": "true",
"message": "thread 123"
} }'
JSON.ARRAPPEND "2020-04-20". '{ // every thread issues a command like this.
"item": {
"success": "true",
"message": "thread 456"
} }'
JSON.ARRAPPEND are supposed to be atomic.
I solved it using redis set functionality.
I am using jedis client in my project.
It has very useful funtions like:-
1) sadd => insertion of element.O(1)
2) srem => deletion of element in set.O(1)
3) smembers => getting all results.O(N)
This is what i needed.
In my case date is the key, and other details (one object of json) is the member of the set. So, i convert my json to data to string when adding memeber in set, and when getting data i convert it back from string to json.
This solved my problem.
Note:- There is also list functionality that can be used. But time complexities for list are not O(1). In my case i am sure i will not have duplicates so set works for me.

Spring cloud contracts with generic api

How to use spring cloud contracts with generic api. I'm asking about REST contracts on producer service. So consider an example. I have a service which allows to store user data into different formats into database and acts like proxy between service and database. It has parameters required for all consumers, and parameters which depend on a consumer.
class Request<T> {
Long requestId;
String documentName;
T documentContent;
}
And it has two consumers.
Consumer 1:
{
"requestId": 1,
"documentName": "login-events",
"documentContent": {
"userId": 2,
"sessionId": 3
}
}
Consumer 2:
{
"requestId": 1,
"documentName": "user-details",
"documentContent": {
"userId": 2,
"name": "Levi Strauss",
"age": 11
}
}
As you can see documentContent depends on consumer. In I want to write such contracts, which will check content of this field on consumer side and ignore it on producer side. Options like
"documentContent": ["age": $(consumer(11))] //will produce .field(['age']").isEqualTo(11)
and
"documentContent": ["age": $(consumer(11), producer(optional(anInteger())))] //will require field presence
didn't work. Of course I may write "documentContent": [] or even ignore this field in contracts, but I want them to act like Rest Api documentation. Does anybody has ideas how to solve this?
Ignore the optional element and define 2 contracts. One with the age value and one without it. The one with the age value should contain also contain a priority field. You can read about priority here https://cloud.spring.io/spring-cloud-static/spring-cloud-contract/2.2.0.RELEASE/reference/html/project-features.html#contract-dsl-http-top-level-elements
It would look more or less like this (contract in YAML):
priority: 5 # lower value of priority == higher priority
request:
...
body:
documentContent:
age: 11
response:
...
and then the less concrete case (in YAML)
priority: 50 # higher value of priority == lower priority
request:
...
body:
documentContent:
# no age
response:
...
I found solution, that is more applicable for my case (groovy code):
def documentContent = [
"userId": 2,
"sessionId": 3
]
Contract.make {
response {
body(
[
............
"documentContent" : $(consumer(documentContent), producer(~/.+/)),
............
]
)
}
}
But please, take into consideration, that I stubbed documentContent value with a String ("documentContent") in producer contract test.

Publishing Avro messages using Kafka REST Proxy throws "Conversion of JSON to Avro failed"

I am trying to publish a message which has a union for one field as
{
"name": "somefield",
"type": [
"null",
{
"type": "array",
"items": {
"type": "record",
Publishing the message using the Kafka REST Proxy keeps throwing me the following error when somefield has an array populated.
{
"error_code": 42203,
"message": "Conversion of JSON to Avro failed: Failed to convert JSON to Avro: Expected start-union. Got START_ARRAY"
}
Same schema with somefield: null is working fine.
The Java classes are built in the Spring Boot project using the gradle plugin from the Avro schemas. When I use the generated Java classes and publish a message, with the array populated using the Spring KafkaTemplate, the message is getting published correctly with the correct schema. (The schema is taken from the generated Avro Specific Record) I copy the same json value and schema and publish via REST proxy, it fails with the above error.
I have these content types in the API call
accept:application/vnd.kafka.v2+json, application/vnd.kafka+json, application/json
content-type:application/vnd.kafka.avro.v2+json
What am I missing here? Any pointers to troubleshoot the issue is appreciated.
The messages I tested for were,
{
"somefield" : null
}
and
{
"somefield" : [
{"field1": "hello"}
]
}
However, it should be instead passed as,
{
"somefield" : {
"array": [
{"field1": "hello"}
]}
}

Kafka connect ElasticSearch sink - using if-else blocks to extract and transform fields for different topics

I have a kafka es sink properties file like the following
name=elasticsearch.sink.direct
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=16
topics=data.my_setting
connection.url=http://dev-elastic-search01:9200
type.name=logs
topic.index.map=data.my_setting:direct_my_setting_index
batch.size=2048
max.buffered.records=32768
flush.timeout.ms=60000
max.retries=10
retry.backoff.ms=1000
schema.ignore=true
transforms=InsertKey,ExtractId
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=MY_SETTING_ID
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=MY_SETTING_ID
This works perfectly for a single topic (data.my_setting). I would like to use the same connector for data coming in from more than one topic. A message in a different topic will have a different key which I'll need to transform.I was wondering if there's a way to use if else statements with a condition on the topic name or on a single field in the message such that I can then transform the key differently. All the incoming messages are json with schema and payload.
UPDATE based on the answer:
In my jdbc connector I add the key as follows:
name=data.my_setting
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
poll.interval.ms=500
tasks.max=4
mode=timestamp
query=SELECT * FROM MY_TABLE with (nolock)
timestamp.column.name=LAST_MOD_DATE
topic.prefix=investment.ed.data.app_setting
transforms=ValueToKey
transforms.ValueToKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.ValueToKey.fields=MY_SETTING_ID
I still however get the error when a message produced from this connector is read by elasticsearch sink
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
Caused by: org.apache.kafka.connect.errors.DataException: STRUCT is not supported as the document id
The payload looks like this:
{
"schema": {
"type": "struct",
"fields": [{
"type": "int32",
"optional": false,
"field": "MY_SETTING_ID"
}, {
"type": "string",
"optional": true,
"field": "MY_SETTING_NAME"
}
],
"optional": false
},
"payload": {
"MY_SETTING_ID": 9,
"MY_SETTING_NAME": "setting_name"
}
}
Connect standalone property file looks like this:
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/apps/{env}/logs/infrastructure/offsets/connect.offsets
rest.port=8084
plugin.path=/usr/share/java
Is there a way to achieve my goal which is to have messages from multiple topics (in my case db tables) which will have their own unique ids (which will also be the id of a document in ES) be sent to a single ES sink.
Can I use avro for this task. Is there a way to define the key in schema registry or will I run into the same problem?
This isn't possible. You'd need multiple Connectors if the key fields are different.
One option to think about is pre-processing your Kafka topics through a stream processor (e.g. Kafka Streams, KSQL, Spark Streaming etc etc) to standardise the key fields, so that you can then use a single connector. It depends what you're building as to whether this would be worth doing, or overkill.

How to stream repeated fields into bigquery for non-records?

I'm streaming data into Google bigquery. I have a repeated field, but I'm receiving the following error:
[{"errors"=[
{"debugInfo"="generic::invalid_argument: This field is not a record.",
"location"="hashtags",
"message"="This field is not a record.",
"reason"="invalid"}],
"index"=0}]
The schema contains:
...,
{
"name": "hashTags",
"type": "string",
"mode": "repeated"
}
I'm passing a list of strings for hashTags in the JSON I'm sending.
What's going wrong and how can I fix it? I don't really want to have to make a single-field value into a record.

Resources