How to transform more than 1 field in Kafka Connect? - apache-kafka-connect

I am using a Kafka Connect Sink config to get data from a topic and persist to an Oracle DB. Works like a champ, and I'm doing a transformation on a timestamp column that comes in via an Avro schema as a long, and I then transform to an Oracle Timestamp column.
"transforms": "TimestampConverter",
"transforms.TimestampConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.TimestampConverter.format": "mm/dd/yyyy HH:mm:ss",
"transforms.TimestampConverter.target.type": "Timestamp",
"transforms.TimestampConverter.field": "created_ts"
But, I can't figure out how to do this on multiple timestamps. That is, in addition to the created_ts, I also have an updated_ts I need to transform.
I tried this:
"transforms.TimestampConverter.field": "created_ts, updated_ts"
Does not work, nor can I repeat the whole block for the other field, because Connect only allows 1 same-named entry.
Lastly, I tried this:
"transforms.TimestampConverter.field.1": "created_ts",
"transforms.TimestampConverter.field.2": "updated_ts"

You would add 2 transforms
"transforms": "CreatedConverter,UpdatedConverter",
"transforms.CreatedConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value"
"transforms.CreatedConverter.field": "created_ts",
...
"transforms.UpdatedConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value"
"transforms.UpdatedConverter.field": "updated_ts"
...

Related

Transform date format inside CSV using Apache Nifi

I need to modify CSV file in Apache Nifi environment.
My CSV looks like file:
Advertiser ID,Campaign Start Date,Campaign End Date,Campaign Name
10730729,1/29/2020 3:00:00 AM,2/20/2020 3:00:00 AM,Nestle
40376079,2/1/2020 3:00:00 AM,4/1/2020 3:00:00 AM,Heinz
...
I want to transform dates with AM/PM values to simple date format. From 1/29/2020 3:00:00 AM to 2020-01-29 for each row. I read about UpdateRecord processor, but there is a problem. As you can see, CSV headers contain spaces and I can't even parse these fields with both Replacement Value Strategy (Literal and Record Path).
Any ideas to solve this problem? Maybe somehow I should modify headers from Advertiser ID to advertiser_id, etc?
You don't need to actually make the transformation yourself, you can let your Readers and Writers handle it for you. To get the CSV Reader to recognize dates though, you will need to define a schema for your rows. Your schema would look something like this (I've removed the spaces from the column names because they are not allowed):
{
"type": "record",
"name": "ExampleCSV",
"namespace": "Stackoverflow",
"fields": [
{"name": "AdvertiserID", "type": "string"},
{"name": "CampaignStartDate", "type" : {"type": "long", "logicalType" : "timestamp-micros"}},
{"name": "CampaignEndDate", "type" : {"type": "long", "logicalType" : "timestamp-micros"}},
{"name": "CampaignName", "type": "string"}
]
}
To configure the reader, set the following properties:
Schema Access Strategy = Use 'Schema Text' property
Schema Text = (Above codeblock)
Treat First Line as Header = True
Timestamp Format = "MM/dd/yyyy hh:mm:ss a"
Additionally you can set this property to ignore the Header of the CSV if you don't want to or are unable to change the upstream system to remove the spaces.
Ignore CSD Header Column Names = True
Then in your CSVRecordSetWriter service you can specify the following:
Schema Access Strategy = Inherit Record Schema
Timestamp Format = "yyyy-MM-dd"
You can use UpdateRecord or ConvertRecord (or others as long as they allow you to specify both a reader and a writer)and it will just do the conversion for you. The difference between UpdateRecord and ConvertRecord is that UpdateRecord requires you to specify a user defined property, so if this is the only change you will make, just use ConvertRecord. If you have other transformations, you should use UpdateRecord and make those changes at the same time.
Caveat: This will rewrite the file using the new column names (in my example, ones without spaces) so keep that in mind for downstream usage.

Making timestamps out of Kafka Connect Single Message Transformer

I am moving data from Kafka to Elasticsearch and using Kafka connects SMT and more specificly TimeStampConverter . I fiddled around some with it and couldnt get it to output a Timestamp format.
When I used types "Date", "Time" or "Timestamp" as the values for transforms.TimestampConverter.target.type I couldnt get data into Elasticsearch. It was only until I set the value "string" there that it outputs the values into elasticsearch as date data type. This unfortunately means that I can only get the value by accuracy of a day.
Here is the transformer configs:
"transforms": "TimestampConverter",
"transforms.TimestampConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.TimestampConverter.field": "UPDATED",
"transforms.TimestampConverter.format": "yyyy-MM-dd",
"transforms.TimestampConverter.target.type": "string"
Any known ways how to achieve this with more accurate timestamp? I tried all kinds of configurations alterin the target.typeand format fields
The UPDATED value is epoch bigint

Kafka connect JDBC sink isse with table name containing a " . "

I am trying to build a kafka connect jdbc sink connector. The issue is, the database table name contains a dot and when the connector is created, the process splits the table name in two leading to unfound database table. I tried multiple things to escape the dot so it can be read as a string in the table name but nothing worked ..
Here is the actual name :
"table.name.format":"Bte3_myname.centrallogging",
here is the error :
Caused by: org.apache.kafka.connect.errors.ConnectException: Table \"Bte3_myname\".\"centrallogrecord\" is missing.
Here is my config file :
{
"name": "jdbc-connect-central-logging-sink",
"config":
{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "3",
"topics": "central_logging",
"connection.url": "...",
"connection.user": "...",
"connection.password": "...",
"table.name.format":"Bte3_myname.centrallogging",
"pk.mode": "kafka",
"auto.create": "false",
"auto.evolve": "false"
}
}
Would someone have any idea about how to parse that correctly in the config file ?
Thanks a lot !
In case when a topic name contains dots due to naming reasons, but the table name is just a part of it like topic.prefix.MY_TABLE_NAME.topic.suffix, then the sink connector can be configured with a RegexRouter transformation, which can extract MY_TABLE_NAME for sinking operations.
The transformation may look like:
"transforms": "changeTopicName",
"transforms.changeTopicName.type": org.apache.kafka.connect.transforms.RegexRouter",
"transforms.changeTopicName.regex": "topic.prefix.(MY_.*).topic.suffix",
"transforms.changeTopicName.replacement": "$1",
then the connector will use MY_TABLE_NAME as a table name.
P.S. Indeed, the regex should be defined smarter, but it's up to the case, right? ;)
If bte3_myname is actually your schema, this may work
"table.name.format": "bte3_myname.${topic}"
(give or take one extra underscore).
I also notice you are using mixed case - so you may need to set "quote.sql.identifiers" accordingly.

Reindex Elasticsearch converting unixtime to date

I have an Elasticsearch index which uses the #timestamp field to store the date in a date field.
There are many records which are missing the #timestamp field, but have a timestamp field containing a unix timestamp. (Generated from PHP, so seconds, not milliseconds)
Note, the timestamp field is of date type, but numeric data seems to be stored there.
How can I use Painless script in a reindex and set #timestamp where it is missing, IF there is a numeric timestamp field with a unix timestamp?
Here's an example record that I would want to transform.
{
"_index": "my_log",
"_type": "doc",
"_id": "AWjEkbynNsX24NVXXmna",
"_score": 1,
"_source": {
"name": null,
"pid": "148651",
"timestamp": 1549486104
}
},
Did you have a look at the ingest module of Elasticsearch??
https://www.elastic.co/guide/en/elasticsearch/reference/current/date-processor.html
Parses dates from fields, and then uses the date or timestamp as the
timestamp for the document. By default, the date processor adds the
parsed date as a new field called #timestamp. You can specify a
different field by setting the target_field configuration parameter.
Multiple date formats are supported as part of the same date processor
definition. They will be used sequentially to attempt parsing the date
field, in the same order they were defined as part of the processor
definition.
It does exactly what you want :) In your reindex statement you can direct documents through this ingest processor.
If you need more help let me know, then I can jump behind a computer and help out :D

Logstash inserting dates as strings instead of dateOptionalTime

I have an Elasticsearch index with the following mapping:
"pickup_datetime": {
"type": "date",
"format": "dateOptionalTime"
}
Here is an example of a date contained in the file that is being read in
"pickup_datetime": "2013-01-07 06:08:51"
I am using Logstash to read and insert data into ES with the following lines to attempt to convert the date string into the date type.
date {
match => [ "pickup_datetime", "yyyy-MM-dd HH:mm:ss" ]
target => "pickup_datetime"
}
But the match never seems to occur.
What am I doing wrong?
It turns out the date filter was before the csv filter, where the columns get named, hence the date filter was not finding the pickup_datetime column since it had not yet been named.
It might be a good idea to clearly mention the sequentiality of the filters in the documentation to avoid others having similar problems in the future.

Resources