Kafka Connect without schema, only JSON - apache-kafka-connect

I want to use JDBC sink connector with JSON and without schema.
They write (source):
If you need to use JSON without Schema Registry for Connect data, you
can use the JsonConverter supported with Kafka. The example below
shows the JsonConverter key and value properties that are added to the
configuration:
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
When the properties key.converter.schemas.enable and
value.converter.schemas.enable are set to true, the key or value is
not treated as plain JSON, but rather as a composite JSON object
containing both an internal schema and the data. When these are
enabled for a source connector, both the schema and data are in the
composite JSON object. When these are enabled for a sink connector,
the schema and data are extracted from the composite JSON object.
Note that this implementation never uses Schema Registry.
When the properties key.converter.schemas.enable and
value.converter.schemas.enable are set to false (the default), only
the data is passed along, without the schema. This reduces the payload
overhead for applications that do not need a schema.
I configured connector:
{
"name": "noschemajustjson",
"config": {
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"schemas.enable": "false",
"name": "noschemajustjson",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"config.action.reload": "restart",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"topics": "testconnect2",
"connection.url": "jdbc:postgresql://postgres:5432/postgres",
"connection.user": "postgres",
"connection.password": "********",
"dialect.name": "PostgreSqlDatabaseDialect",
"table.name.format": "utp",
"auto.create": "false",
"auto.evolve": "false"
}
}
But I still get error:
Caused by: org.apache.kafka.connect.errors.ConnectException: Sink
connector 'noschemajustjson2' is configured with
'delete.enabled=false' and 'pk.mode=none' and therefore requires
records with a non-null Struct value and non-null Struct schema, but
found record at
(topic='testconnect2',partition=0,offset=0,timestamp=1626416739697)
with a HashMap value and null value schema.
So what should I do to force Connect work without schema at all (only plain JSON)?

I want to use JDBC sink connector with JSON and without schema
You cannot do this - the JDBC Sink connector streams to a relational database, and relational databases have schemas :-D The JDBC Sink connector therefore requires a schema to be present for the data.
Depending on where your data is coming from you have different options.
If it's ingested from Kafka Connect, use a converter that supports schemas (Avro, Protobuf, JSON Schema)
If it's produced by an application that you have control over, get that application to serialise that data with a schema (Avro, Protobuf, JSON Schema)
If it's coming from somewhere you don't have control over then you'll need to pre-process the topic to add an explicit schema and write it to a new topic that is then consumed by the JDBC Sink connector.
References & resources:
Kafka Connect JDBC Sink deep-dive: Working with Primary Keys
Applying a schema to JSON data with ksqlDB

Related

Kafka JDBC Sink handle array datatype

I know that Kafka JDBC Sink Connector have some drawbacks for array datatype. However is it possible to combine the Sink Connector with a simple Kafka Connector which can support array datatype. How can I filter from Kafka configuration and switch them into simple Kafka Connector configuration What does simple Kafka Configuration mean? How can Kafka Connect support array fields
name: topic_name
type: array
item: Topic file
Is this possible where it will consume to the db as a string not an array
"fields":[{
"name":"item_id",
"type":{
"type":"array",
"items":["null", "string"]
},
"default":[]
}]
}
Kafka Connect framework itself doesn't expose limitations around types, it's in the source code for the JDBC sink that arrays are rejected.
There's an outstanding PR to support it for Postgres - https://github.com/confluentinc/kafka-connect-jdbc/pull/805
Unclear what you mean by "simple", but if you want to use a different connector, then you'd need to install it, and then change the class. For example, maybe the MongoDB sink handles arrays. I know that S3 and HDFS sinks do...
is it possible to combine the Sink Connector with a simple Kafka Connector
Again, unsure what you mean by this, but connectors generally don't "chain together". While you could use MirrorMaker2 with a transform to effectively do the same work as Kafka Streams, best to use the more appropriate tools for that
Is this possible where it will consume to the db as a string not an array
Sure, if the message field is actually a string. As suggested, you're going to need to process the message before the sink connector consumes it

How to change Elasticsearch document source content by modifying Elasticsearch source code?

I need to encrypt Elasticsearch document source content for security.
The final effect to be achieved is as follows:
input:
{
"title":"you know, for search",
"viewcount": 20
}
In es:
{
"title": "zpv!lopx-!gps!tfbsdi", // whatever, encrypted title
"viewcount": ☯ // whatever, encrypted viewcount
}
Instead of having encrypted data in ES, We can make communication between ES nodes and clients can be encrypted with X-Pack. That means If the Client is allowed to query the data in the end he will be able to get the data. We can control that with X-Pack.
Indexing encrypted data in ElasticSearch is not recommended IMO, since it involves additional overhead of Decrypting and encrypting the data.

Kafka Connect to persist topic to Elasticsearch index using field from (json) message

I'm attempting to index messages in Elasticsearch using SMT's from Kafka's Connect API only.
So far I had luck with simply using the topic and timestamp router functionality.
However, now I'd like to create separate indices based on a certain field in the message.
Suppose the messages are formatted as such:
{"productId": 1, "category": "boat", "price": 135000}
{"productId": 1, "category": "helicopter", "price": 300000}
{"productId": 1, "category": "car", "price": 25000}
Would it somehow be possible to index these to the following indices based on product category?
product-boat
product-helicopter
product-car
or would I have to create separate topics for every single category (knowing that it could become hundreds or thousands of them)?
Am I overseeing a transform that could do this or is this simply not possible and will a custom component have to be built?
There's nothing out of the box with Kafka Connect that will do this. You have a few options:
The Elasticsearch sink connector will route messages to a target index based on its topic, so you could write a custom SMT that would inspect a message and route it to a different topic accordingly
Use a stream processor to pre-process the messages such that they're already on different topics by the time they are consumed by the Elasticsearch sink connector. For example, Kafka Streams or KSQL.
KSQL you would need to hard code each category (CREATE STREAM product-boat AS SELECT * FROM messages WHERE category='boat' etc)
Kafka Streams now has Dynamic Routing (KIP-303) which would be a more flexible way of doing it
Handcode a bespoke Elasticsearch sink connector with the logic coded in to route the messages to indices based on message contents. This feels like the worst of the three approach IMO.
If you are using Confluent Platform you can do some kind of routing depends on field value in the message.
To do that you have to use ExtractTopic SMT from Confluent. More details regarding that SMT can be found at https://docs.confluent.io/current/connect/transforms/extracttopic.html#extracttopic
Kafka Sink Connector processes messages, that are represented by SinkRecord. Each SinkRecord contains of several fields: topic, partition, value, key, etc. Those fields are set by Kafka Connect and using transformation you can change those value. ExtractTopic SMT changes value of topic based on value or key of the message.
Transformations configuration will be something like that:
{
...
"transforms": "ExtractTopic",
"transforms.ExtractTopic.type": "io.confluent.connect.transforms.ExtractTopic$Value",
"transforms.ExtractTopic.field": "name", <-- name of field, that value will be used as index name
...
}
One limitation is, that you have to create indices in advance.
How I assume you are using Elasticsearch Sink Connector. Elasticsearch connector has ability to create index, but it does it when its opens - method to create writers for particular partition (ElasticsearchSinkTask::open). In your use case at that moment all indices can't be created, because value of all messages are not available.
Maybe it isn't the purest approach, because ExtractTopic should be rather used for Source connectors, but in you case it might work.

Hibernate Search Dynamic Mapping

I am attempting to utilize elastic search with hibernate by using the hibernate-search elastic search integration. I have multi-tenant data that uses a discriminator strategy, so it would be great to index entities with that tenant identifier automatically added. This seems to work so far:
Session session = sessionFactory
.withOptions()
.tenantIdentifier("parkId-" + p.getId().toString())
.openSession();
However, during the index process, elastic search complains because of a strict_dynamic_mapping_exception:
Response:
{
"index": {
"_index": "entities.productmodel",
"_type": "entities.ProductModel",
"_id": "parkId-1_29426",
"status": 400,
"error": {
"type": "strict_dynamic_mapping_exception",
"reason": "mapping set to strict, dynamic introduction of [__HSearch_TenantId] within [entities.ProductModel] is not allowed"
}
}
}
This is all despite the fact that I am overriding the default behavior of hibernate search and setting dynamic mapping to true, as is shown in the docs:
configuration.setProperty("hibernate.search.default.elasticsearch.dynamic_mapping", "true");
(Other settings are properly being set via this method, so I know that is not the issue.)
Any idea what I'm missing? Even setting the dynamic_mapping to false results in no changes - elastic search still complains that the mapping is set to strict. My elastic search cluster is running locally via docker.
Make sure to re-generate your schema before each attempt when developping. You shouldn't need the dynamic_mapping setting for this, but if you generated the schema before you tried adding multitenancy, and did not update the schema since, you will experience errors like this one.
Just drop the indexes in your Elasticsearch cluster, or set the property hibernate.search.default.elasticsearch.index_schema_management_strategy to drop-and-create (NOT FOR USE IN PRODUCTION, YOU WILL LOSE ALL INDEX DATA). See this section of the documentation for more information about schema generation.
The tenant id field should be part of the schema if you have set hibernate.multiTenancy in your ORM configuration.
Did you?
If so, we might have a bug somewhere and a test case would help. We have a test case template here: https://github.com/hibernate/hibernate-test-case-templates/tree/master/search/hibernate-search-elasticsearch/hibernate-search-elasticsearch-5 .

Elasticsearch Kafka Connector - setting index based on message value

I have messages coming through a Kafka topic in the following format:
{"elasticsearch_index": "index_1", "first_name": "Jane"}
{"elasticsearch_index": "index_2", "first_name": "John"}
Note that each message contains the desired Elasticsearch index that we would like to route the record to. Is it possible to use Confluent's Elasticsearch Kafka Connector to route these records to the appropriate index name (e.g. whatever is listed under the elasticsearch_index key)?
It doesn't look like the Single Message Transforms (SMT) supports this behavior currently, but maybe I am misreading. Any info will be greatly appreciated.
Two options:
Write your own Transform using the Single Message Transform API
Use KSQL (or Kafka Streams) to route the messages to required topics first, and then use the new (Apache Kafka 1.1) regex capabilities to land those topics to Elasticsearch from Kafka Connect.

Resources