I am trying to build a kafka connect jdbc sink connector. The issue is, the database table name contains a dot and when the connector is created, the process splits the table name in two leading to unfound database table. I tried multiple things to escape the dot so it can be read as a string in the table name but nothing worked ..
Here is the actual name :
"table.name.format":"Bte3_myname.centrallogging",
here is the error :
Caused by: org.apache.kafka.connect.errors.ConnectException: Table \"Bte3_myname\".\"centrallogrecord\" is missing.
Here is my config file :
{
"name": "jdbc-connect-central-logging-sink",
"config":
{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "3",
"topics": "central_logging",
"connection.url": "...",
"connection.user": "...",
"connection.password": "...",
"table.name.format":"Bte3_myname.centrallogging",
"pk.mode": "kafka",
"auto.create": "false",
"auto.evolve": "false"
}
}
Would someone have any idea about how to parse that correctly in the config file ?
Thanks a lot !
In case when a topic name contains dots due to naming reasons, but the table name is just a part of it like topic.prefix.MY_TABLE_NAME.topic.suffix, then the sink connector can be configured with a RegexRouter transformation, which can extract MY_TABLE_NAME for sinking operations.
The transformation may look like:
"transforms": "changeTopicName",
"transforms.changeTopicName.type": org.apache.kafka.connect.transforms.RegexRouter",
"transforms.changeTopicName.regex": "topic.prefix.(MY_.*).topic.suffix",
"transforms.changeTopicName.replacement": "$1",
then the connector will use MY_TABLE_NAME as a table name.
P.S. Indeed, the regex should be defined smarter, but it's up to the case, right? ;)
If bte3_myname is actually your schema, this may work
"table.name.format": "bte3_myname.${topic}"
(give or take one extra underscore).
I also notice you are using mixed case - so you may need to set "quote.sql.identifiers" accordingly.
Related
I need to modify CSV file in Apache Nifi environment.
My CSV looks like file:
Advertiser ID,Campaign Start Date,Campaign End Date,Campaign Name
10730729,1/29/2020 3:00:00 AM,2/20/2020 3:00:00 AM,Nestle
40376079,2/1/2020 3:00:00 AM,4/1/2020 3:00:00 AM,Heinz
...
I want to transform dates with AM/PM values to simple date format. From 1/29/2020 3:00:00 AM to 2020-01-29 for each row. I read about UpdateRecord processor, but there is a problem. As you can see, CSV headers contain spaces and I can't even parse these fields with both Replacement Value Strategy (Literal and Record Path).
Any ideas to solve this problem? Maybe somehow I should modify headers from Advertiser ID to advertiser_id, etc?
You don't need to actually make the transformation yourself, you can let your Readers and Writers handle it for you. To get the CSV Reader to recognize dates though, you will need to define a schema for your rows. Your schema would look something like this (I've removed the spaces from the column names because they are not allowed):
{
"type": "record",
"name": "ExampleCSV",
"namespace": "Stackoverflow",
"fields": [
{"name": "AdvertiserID", "type": "string"},
{"name": "CampaignStartDate", "type" : {"type": "long", "logicalType" : "timestamp-micros"}},
{"name": "CampaignEndDate", "type" : {"type": "long", "logicalType" : "timestamp-micros"}},
{"name": "CampaignName", "type": "string"}
]
}
To configure the reader, set the following properties:
Schema Access Strategy = Use 'Schema Text' property
Schema Text = (Above codeblock)
Treat First Line as Header = True
Timestamp Format = "MM/dd/yyyy hh:mm:ss a"
Additionally you can set this property to ignore the Header of the CSV if you don't want to or are unable to change the upstream system to remove the spaces.
Ignore CSD Header Column Names = True
Then in your CSVRecordSetWriter service you can specify the following:
Schema Access Strategy = Inherit Record Schema
Timestamp Format = "yyyy-MM-dd"
You can use UpdateRecord or ConvertRecord (or others as long as they allow you to specify both a reader and a writer)and it will just do the conversion for you. The difference between UpdateRecord and ConvertRecord is that UpdateRecord requires you to specify a user defined property, so if this is the only change you will make, just use ConvertRecord. If you have other transformations, you should use UpdateRecord and make those changes at the same time.
Caveat: This will rewrite the file using the new column names (in my example, ones without spaces) so keep that in mind for downstream usage.
I am using a Kafka Connect Sink config to get data from a topic and persist to an Oracle DB. Works like a champ, and I'm doing a transformation on a timestamp column that comes in via an Avro schema as a long, and I then transform to an Oracle Timestamp column.
"transforms": "TimestampConverter",
"transforms.TimestampConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.TimestampConverter.format": "mm/dd/yyyy HH:mm:ss",
"transforms.TimestampConverter.target.type": "Timestamp",
"transforms.TimestampConverter.field": "created_ts"
But, I can't figure out how to do this on multiple timestamps. That is, in addition to the created_ts, I also have an updated_ts I need to transform.
I tried this:
"transforms.TimestampConverter.field": "created_ts, updated_ts"
Does not work, nor can I repeat the whole block for the other field, because Connect only allows 1 same-named entry.
Lastly, I tried this:
"transforms.TimestampConverter.field.1": "created_ts",
"transforms.TimestampConverter.field.2": "updated_ts"
You would add 2 transforms
"transforms": "CreatedConverter,UpdatedConverter",
"transforms.CreatedConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value"
"transforms.CreatedConverter.field": "created_ts",
...
"transforms.UpdatedConverter.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value"
"transforms.UpdatedConverter.field": "updated_ts"
...
We are having problems enforcing the order in which messages from a Kafka topic are sent to Elasticsearch using the Kafka Connect Elasticsearch Connector. In the topic the messages are in the right order with the correct offsets, but if there are two messages with the same ID created in quick succession, they are intermittently sent to Elasticsearch in the wrong order. This causes Elasticsearch to have the data from the second last message, not from the last message. If we add some artificial delay of a second or two between the two messages in the topic, the problem disappears.
The documentation here states:
Document-level update ordering is ensured by using the partition-level
Kafka offset as the document version, and using version_mode=external.
However I can't find any documentation anywhere about this version_mode setting, and whether it's something we need to set ourselves somewhere.
In the log files from the Kafka Connect system we can see the two messages (for the same ID) being processed in the wrong order, a few milliseconds apart. It might be significant that it looks like these are processed in different threads. Also note that there is only one partition for this topic, so all messages are in the same partition.
Below is the log snippet, slightly edited for clarity. The messages in the Kafka topic are populated by Debezium, which I don't think is relevant to the problem, but handily happens to include a timestamp value. This shows that the messages are processed in the wrong order (though they're in the correct order in the Kafka topic, populated by Debezium):
[2019-01-17 09:10:05,671] DEBUG http-outgoing-1 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE SECOND UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER SECOND UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716205205
}
" (org.apache.http.wire)
...
[2019-01-17 09:10:05,696] DEBUG http-outgoing-2 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE FIRST UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER FIRST UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716204190
}
" (org.apache.http.wire)
Does anyone know how to force this connector to maintain message order for a given document ID when sending the messages to Elasticsearch?
The problem was that our Elasticsearch connector had the key.ignore configuration set to true.
We spotted this line in the Github source for the connector (in DataConverter.java):
final Long version = ignoreKey ? null : record.kafkaOffset();
This meant that, with key.ignore=true, the indexing operations that were being generated and sent to Elasticsearch were effectively "versionless" ... basically, the last set of data that Elasticsearch received for a document would replace any previous data, even if it was "old data".
From looking at the log files, the connector seems to have several consumer threads reading the source topic, then passing the transformed messages to Elasticsearch, but the order that they are passed to Elasticsearch is not necessarily the same as the topic order.
Using key.ignore=false, each Elasticsearch message now contains a version value equal to the Kafka record offset, and Elasticsearch refuses to update the index data for a document if it has already received data for a later "version".
That wasn't the only thing that fixed this. We still had to apply a transform to the Debezium message from the Kafka topic to get the key into a plain text format that Elasticsearch was happy with:
"transforms": "ExtractKey",
"transforms.ExtractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.ExtractKey.field": "id"
I have connected Kibana to my ES instance.
cat/indices returns:
yellow open .kibana 1 1 1 0 3.1kb 3.1kb
yellow open tests 5 1 413042 0 3.4gb 3.4gb
However I get the following on the kibana configuration screen. What am I missing?
Update:
My sample document looks like this
"_index": "tests",
"_type": "test7",
"_id": "AVGlIKIM1CQ8BZRgLZVg",
"_score": 1.7840601,
"_source": {
"severity": "ERROR",
"code": "CODE,
"message": "MESSAGE",
"environment": "TEST",
"error_uuid": "cbe99080-0bf3-495c-a417-77384ba0fd39",
"correlation_id": "cf5a1fd5-4fd2-40bb-9cdf-405b91dcbd6f",
"timestamp": "2015-11-20 15:24:39.831"
Disable the option Use event times to create index names and put the index name instead of the pattern (tests).
The option you are trying to use is used when you have index names based on timestamp (imagine you create a new index per day with tests-2015.12.01, tests-2015.12.02...). It's quite clear if you read the message when you enable that option:
Patterns allow you to define dynamic index names. Static text in an index name is denoted using brackets. Example: [logstash-]YYYY.MM.DD. Please note that weeks are setup to use ISO weeks which start on Monday
EDIT: The problem with an empty dropdown in the time-field name is because you don't have any field with date type in the mapping of your index. You can actually check if you do GET /<index-name>/_mapping?pretty, that the timestamp is a "string" type and not "date". This happens because the format didn't match the regex for the date detection (yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z). To solve this:
You can change the format of the timestamp you are inserting to match the default regex.
You can modify the dynamic_date_format property and put a regex that matches the current format of your timestamp.
You can set an index template and set the type "date" for the "timestamp" field.
In any of the cases, you would need to delete the index and create a new one or reindex the data.
I am using Codeigniter and Alex Bilbie's MongoDB library.
In my API that I am developing users can upload images and other users can comment on them.
I have chosen to include the comments as sub documents to the images.
Each comment contains:
Fullname (of author)
Comment
Created_at
So in other words. The users full name is "hard coded" into each comment so if they
later decides to change their names I have a problem.
I read that I can use atomic updates to update all occurrences of the name (like in comments) but how can I do this using Alex´s library? Can I update all places where the name is wrong?
UPDATE
This is how the image document looks like with the comments.
I think that it is pretty strange that MongoDB encourage the use of subdocuments but then does not include a way to update multiple items in an array.
{
"_id": ObjectId("4e9ead773dc793dc01020000"),
"description": "An image",
"category": "accident",
"comments": [
{
"id": ObjectId("4e96bd063dc7937202000000"),
"fullname": "James Bond",
"comment": "This is a comment.",
"created_at": "2011-10-19 13:02:40"
}
],
"created_at": "2011-10-19 12:59:03"
}
Thankful for all help!
I am not familiar with codeignitor, but mb mongodb shell syntax will help you:
db.comments.update( {"Fullname":"Andrew Orsich"},
{ $set : { Fullname: "New name"} }, false, true )
Last true flag indicate that you want update multiple documents. So it is possible to update all comments in one update operation.
BTW: denormalazing (not 'hard coding') data in mongodb and nosql in general is usual operation. Also operation that require update a lot of documents usually work async. But it is up to you.
Update:
db.comments.update( {"comments.Fullname":"Andrew Orsich"},
{ $set : { comments.$.Fullname: "New name"} }, false, true )
But, above query will update full name in first comment on nested array. If you need to affect changes to more than one array element you will need to use multiple update statements.