kafka connect JDBC sink. Error flattening JSON records

kafka connect JDBC sink. Error flattening JSON records - jdbc

I'm using the Kafka connect JDBC Sink Connector to stored data from topics into a SQL Server table. The data needs to be flattened. I've created a SQL Server table and a JSON record based on the example provided by Confluent.
So my record is this one:
{
"payload":{
"id": 42,
"name": {
"first": "David"
}
},
"schema": {
"fields": [
{
"field": "id",
"optional": true,
"type": "int32"
},
{
"name": "name",
"optional": "false",
"type": "struct",
"fields": [
{
"field": "first",
"optional": true,
"type": "string"
}
]
}
],
"name": "Test",
"optional": false,
"type": "struct"
}
}
As you can see, I want to flatten the fields concatenating the delimiter "_". So my Sink Connector configuration is as follows:
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
table.name.format=MyTable
transforms.flatten.type=org.apache.kafka.connect.transforms.Flatten$Value
topics=myTopic
tasks.max=1
transforms=flatten
value.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.json.JsonConverter
connection.url=jdbc:sqlserver:[url]
transforms.flatten.delimiter=_
When I write that record in the topic, I get the following exception:
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:487)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:464)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:320)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: Struct schema's field name not specified properly
at org.apache.kafka.connect.json.JsonConverter.asConnectSchema(JsonConverter.java:512)
at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:360)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:487)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
With records that don't require flattening, the sink connector works fine. Is there anything wrong with the configuration? Is it possible to flatten JSON files with the schema?
P.S. Kafka connect version: 5.3.0-css
Any help would be greatly appreciated.

Ok, the problem was the field name of the nested field. The correct field name is "field", not "name":
{
"payload":{
"id": 42,
"name": {
"first": "David"
}
},
"schema": {
"fields": [
{
"field": "id",
"optional": true,
"type": "int32"
},
{
**"field": "name",**
"optional": "false",
"type": "struct",
"fields": [
{
"field": "first",
"optional": true,
"type": "string"
}
]
}
],
"name": "Test",
"optional": false,
"type": "struct"
}
}

Related

Is there a better/faster way to insert data to a database from an external api in Laravel?

I am currently getting data from an external API for use my in Laravel API. I have everything working but I feel like it is slow.
I'm getting the data from the API with Http:get('url) and that works fast. It is only when I start looping through the data and making edits when things are slowing down.
I don't need all the data, but it would still be nice to edit before entering the data to the database as things aren't very consitent if possible. I also have a few columns that use data and some logic to make new columns so that each app/site doesn't need to do it.
I am saving to the database on each foreach loop with the eloquent Model:updateOrCreate() method which works but these json files can easily be 6000 lines long or more so it obviously takes time to loop through each set modify values and then save to the database each time. There usually isn't more than 200 or so entries but it still takes time. Will probably eventually update this to the new upset() method to make less queries to the database. Running in my localhost it is currently take about a minute and a half to run, which just seams way too long.
Here is a shortened version of how I was looping through the data.
$json = json_decode($contents, true);
$features = $json['features'];
foreach ($features as $feature){
// Get ID
$id = $feature['id'];
// Get primary condition data
$geometry = $feature['geometry'];
$properties = $feature['properties'];
// Get secondary geometry data
$geometryType = $geometry['type'];
$coordinates = $geometry['coordinates'];
Model::updateOrCreate(
[
'id' => $id,
],
[
'coordinates' => $coordinates,
'geometry_type' => $geometryType,
]);
}
Most of what I'm doing behind the scenes to the data before going into the database is cleaning up some text strings but there are a few logic things to normalize or prep the data for websites and apps.
Is there a more efficient way to get the same result? This will ultimately be used in a scheduler and run on an interval.
Example Data structure from API documentation
{
"$schema": "http://json-schema.org/draft-04/schema#",
"additionalProperties": false,
"properties": {
"features": {
"items": {
"additionalProperties": false,
"properties": {
"attributes": {
"type": [
"object",
"null"
]
},
"geometry": {
"additionalProperties": false,
"properties": {
"coordinates": {
"items": {
"items": {
"type": "number"
},
"type": "array"
},
"type": "array"
},
"type": {
"type": "string"
}
},
"required": [
"coordinates",
"type"
],
"type": "object"
},
"properties": {
"additionalProperties": false,
"properties": {
"currentConditions": {
"items": {
"properties": {
"additionalData": {
"type": "string"
},
"conditionDescription": {
"type": "string"
},
"conditionId": {
"type": "integer"
},
"confirmationTime": {
"type": "integer"
},
"confirmationUserName": {
"type": "string"
},
"endTime": {
"type": "integer"
},
"id": {
"type": "integer"
},
"sourceType": {
"type": "string"
},
"startTime": {
"type": "integer"
},
"updateTime": {
"type": "integer"
}
},
"required": [
"id",
"userName",
"updateTime",
"startTime",
"conditionId",
"conditionDescription",
"confirmationUserName",
"confirmationTime",
"sourceType",
"endTime"
],
"type": "object"
},
"type": "array"
},
"id": {
"type": "string"
},
"name": {
"type": "string"
},
"nameId": {
"type": "string"
},
"parentAreaId": {
"type": "integer"
},
"parentSubAreaId": {
"type": "integer"
},
"primaryLatitude": {
"type": "number"
},
"primaryLongitude": {
"type": "number"
},
"primaryMP": {
"type": "number"
},
"routeId": {
"type": "integer"
},
"routeName": {
"type": "string"
},
"routeSegmentIndex": {
"type": "integer"
},
"secondaryLatitude": {
"type": "number"
},
"secondaryLongitude": {
"type": "number"
},
"secondaryMP": {
"type": "number"
},
"sortOrder": {
"type": "integer"
}
},
"required": [
"id",
"name",
"nameId",
"routeId",
"routeName",
"primaryMP",
"secondaryMP",
"primaryLatitude",
"primaryLongitude",
"secondaryLatitude",
"secondaryLongitude",
"sortOrder",
"parentAreaId",
"parentSubAreaId",
"routeSegmentIndex",
"currentConditions"
],
"type": "object"
},
"type": {
"type": "string"
}
},
"required": [
"type",
"geometry",
"properties",
"attributes"
],
"type": "object"
},
"type": "array"
},
"type": {
"type": "string"
}
},
"required": [
"type",
"features"
],
"type": "object"
}
Second, related question.
Since this is being updated on an interval I have it updating and creating records from the json data, but is there an efficient way to delete old records that are no longer in the json file? I currently get an array of current ids and compare them to the new ids and then loop through each and delete them. There has to be a better way.

Have no idea what to say to your first question, but I think you may try to do something like this regarding the second question.
SomeModel::query()->whereNotIn('id', $newIds)->delete();
$newIds you can collect during the first loop.

Oracle Cloud: How to fill json file for custom metrics

I am trying to send custom t2 telemetry metrics to Oracle Cloud. Using below Command I am able to generate param json file.
oci monitoring metric-data post --generate-param-json-input metric-data > metric-data.json
Below is the generate metric-data.json file
[
{
"compartmentId": "string",
"datapoints": [
{
"count": 0,
"timestamp": "2017-01-01T00:00:00+00:00",
"value": 0.0
},
{
"count": 0,
"timestamp": "2017-01-01T00:00:00+00:00",
"value": 0.0
}
],
"dimensions": {
"string1": "string",
"string2": "string"
},
"metadata": {
"string1": "string",
"string2": "string"
},
"name": "string",
"namespace": "string",
"resourceGroup": "string"
},
{
"compartmentId": "string",
"datapoints": [
{
"count": 0,
"timestamp": "2017-01-01T00:00:00+00:00",
"value": 0.0
},
{
"count": 0,
"timestamp": "2017-01-01T00:00:00+00:00",
"value": 0.0
}
],
"dimensions": {
"string1": "string",
"string2": "string"
},
"metadata": {
"string1": "string",
"string2": "string"
},
"name": "string",
"namespace": "string",
"resourceGroup": "string"
}
]
My metrics requirement is below. Need to send below information in case any agent id is late or missing.
MetricsName: [late/missing]
Hostname: somexyz.oraclecloud.com
agentid: asdfkjgsjdg723
category: custom/DB/Webserver
Region:
AD:
Information1:
Information2:
So my question is below.
How to accomodate my information in metric-data.json file
On Cloud how to visualise my data
Do I need to register my service on cloud before sending it

On Cloud how to visualise my data
Use Metrics Explorer in OCI Console
Do I need to register my service on cloud before sending it
No need to register
Sample data -
[
{
"namespace":"monitoring",
"compartmentId":"$compartmentID",
"resourceGroup":"gpu_0_monitoring",
"name":"gpuTemperature",
"dimensions":{
"resourceId":"$instanceOCID",
"instanceName":"$instanceName"
},
"metadata":{
"unit":"degrees Celcius",
"displayName":"GPU Temperature"
},
"datapoints":[
{
"timestamp":"2022-12-06T12:43:40Z",
"value":43
}
]
}
]
Save this data in metric-data.json file.
The above is a sample data that you posting to the monitoring service.
oci monitoring metric-data post --metric-data file://metric-data.json
For visualisation you can refer to the below document.
https://docs.oracle.com/en-us/iaas/Content/Monitoring/Tasks/publishingcustommetrics.htm

Why Druid Sum Aggregator is getting Zero Always

We are trying to aggregate(sum) for one of the columns which is a string datatype. I was using Below Query. I Have Two Druid Servers are there which is having the same data. One is with Imply and the other is From Ambari installation. In Imply It was working And We got output as expected But From Ambari Druid I am getting Zero as output for the below queries. Below is my input Kafka spec for both(My Ambari druid server and Imply as well)
{"type": "kafka",
"dataSchema": {"dataSource": "DRUID_RAIN","parser": {"type": "string", "parseSpec": { "format": "json", "timestampSpec": { "column": "DATE_TIME","format": "auto"},"flattenSpec": {"fields": [{ "type": "path","name": "deviceType","expr": "$.ENVIRONMENT.deviceType"},{ "type": "path","name": "NS","expr":"$.ENVIRONMENT.NS"},
{"type": "path","name": "latitude","expr": "$.ENVIRONMENT.latitude"},{ "type": "path","name": "TIME","expr": "$.ENVIRONMENT.TIME"},{ "type": "path","name": "tenantCode","expr": "$.ENVIRONMENT.tenantCode"},{ "type": "path","name": "deviceName","expr": "$.ENVIRONMENT.deviceName"},{ "type": "path","name": "MAC","expr": "$.ENVIRONMENT.MAC"},{ "type": "path","name": "DATE","expr": "$.ENVIRONMENT.DATE"},{ "type": "path","name": "RAIN","expr": "$.ENVIRONMENT.RAIN"},
{ "type": "path","name": "MESSAGE_ID","expr": "$.ENVIRONMENT.MESSAGE_ID"},{ "type": "path","name": "tenantId","expr": "$.ENVIRONMENT.tenantId"},{ "type": "path","name": "zoneId","expr": "$.ENVIRONMENT.zoneId"},{ "type": "path","name": "DATE_TIME","expr": "$.ENVIRONMENT.DATE_TIME"},{ "type": "path","name": "zoneName","expr": "$.ENVIRONMENT.zoneName" }, { "type": "path","name": "longitude","expr": "$.ENVIRONMENT.longitude"},{ "type": "path","name": "STATUS","expr": "$.ENVIRONMENT.STATUS"}]},"dimensionsSpec": {"dimensions":["deviceType","NS","latitude","TIME","tenantCode","deviceName","MAC","DATE","RAIN","MESSAGE_ID","tenantId","zoneId","DATE_TIME","zoneName","longitude","STATUS"]}}},"metricsSpec": [ ],"granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": {"type": "none"},"rollup": true, "intervals": null},"transformSpec": { "filter": null, "transforms": []}},"tuningConfig": {"type": "kafka","maxRowsInMemory": 1000000,"maxBytesInMemory": 0,"maxRowsPerSegment": 5000000,"maxTotalRows": null,"intermediatePersistPeriod": "PT10M","maxPendingPersists": 0,"indexSpec": { "bitmap": { "type": "concise"}, "dimensionCompression": "lz4", "metricCompression": "lz4","longEncoding": "longs"},"buildV9Directly": true,"reportParseExceptions": false,"handoffConditionTimeout": 0,"resetOffsetAutomatically": false,"segmentWriteOutMediumFactory": null,"workerThreads": null,"chatThreads": null,"chatRetries": 8,"httpTimeout": "PT10S","shutdownTimeout": "PT80S","offsetFetchPeriod": "PT30S","intermediateHandoffPeriod": "P2147483647D","logParseExceptions": false,"maxParseExceptions": 2147483647,"maxSavedParseExceptions": 0,"skipSequenceNumberAvailabilityCheck": false},
"ioConfig": {"topic": "rain_out","replicas": 2,"taskCount": 1,"taskDuration": "PT5S","consumerProperties": { "bootstrap.servers": "XXXX:6667,XXXX:6667,XXXX:6667"},"pollTimeout": 100,"startDelay": "PT10S","period": "PT30S","useEarliestOffset": true,"completionTimeout": "PT20S","lateMessageRejectionPeriod": null,"earlyMessageRejectionPeriod": null,"stream": "env_out","useEarliestSequenceNumber": true,"type": "kafka"},
"context": null,
"suspended": false}
And Below is my Query What I Have Written
{
"queryType": "groupBy",
"dataSource": "DRUID_RAIN",
"granularity": "hour",
"dimensions": [
"zoneName",
"deviceName"
],
"limitSpec": {
"type": "default",
"limit": 5000,
"columns": [
"zoneName",
"deviceName"
]
},
"aggregations": [
{
"type": "doubleSum",
"name": "RAIN",
"fieldName": "RAIN"
}
],
"intervals": [
"2020-10-27T18:30:00.000/2020-10-28T18:30:00.000"
],
"context": {
"skipEmptyBuckets": "true"
}
}
The output Always I am getting '0' for RAIN Column. I am not getting Expected Sum using from Postman or command line of druid DB. But using imply I am getting exact output for the same query.

Nifi JoltTransformRecord UUID in default transform not working as expected

I have a Nifi workflow which uses JoltTranformRecord for doing some manipulation in the data which is record based. I have to create a default value uuid in each message in flow file.
My JoltTranformRecord configuration is as below.
Jolt specification :
[{
"operation": "shift",
"spec": {
"payload": "data.payload"
}
}, {
"operation": "default",
"spec": {
"header": {
"source": "${source}",
"client_id": "${client_id}",
"uuid": "${UUID()}",
"payload_type":"${payload_type}"
}
}
}]
Shift operation and all other default operations are working fine as expected. But UUID is coming same for all the messages. I need different UUIDs for each messages. I don't want to add another processor for this purpose only.
My workflow below :
Reader & Writer configurations for JoltRecord processor is :
IngestionSchemaJsonTreeReader ( From JsonTreeReader Processor ):
IngestionSchemaAvroRecordSetWriter ( From AvroWriter Processor ) :
Configured schema registry has below schemas defined in it.
com.xyz.ingestion.pre_json
{
"type": "record",
"name": "event",
"namespace": "com.xyz.ingestion.raw",
"doc": "Event ingested to kafka",
"fields": [
{
"name": "payload",
"type": [
"null",
"string"
],
"default": "null"
}
]
}
com.xyz.ingestion.raw -
{
"type": "record",
"name": "event",
"namespace": "com.xyz.ingestion.raw",
"doc": "Event ingested to kafka",
"fields": [
{
"type": {
"name": "header",
"type": "record",
"namespace": "com.xyz.ingestion.raw.header",
"doc": "Header data for event ingested",
"fields": [
{
"name": "payload_type",
"type": "string"
},
{
"name": "uuid",
"type": "string",
"size": "36"
},
{
"name": "client_id",
"type": "string"
},
{
"name": "source",
"type": "string"
}
]
},
"name": "header"
},
{
"type": {
"name": "data",
"type": "record",
"namespace": "com.xyz.ingestion.raw.data",
"doc": "Payload for event ingested",
"fields": [
{
"name": "payload",
"type": [
"null",
"string"
],
"default": "null"
}
]
},
"name": "data"
}
]
}

The expression language is evaluated per record. UUID() is executed for each evaluation. So uuid must be unique for each record. From the information you provided I cannot see why you are getting duplicate uuids.
I tried to reproduce your problem with following flow:
GenerateFlowFile:
SplitJson: configure $ as JsonPathExpression to split Json array into records.
JoltTransformRecord:
As you can see the way I am adding the UUID is not different from how you do it. But I am getting different UUIDs as expected:

How to search number full size japanese in couchbase

I'm getting error when try full text search number full size "９" in couchbase 6.0.3. Exception throws : err: bleve: QueryBleve validating request, err: parse error: error parsing number: strconv.ParseFloat: parsing.
If i searching with some string "９abc" , searching successfull so i think , lib of couchbase search regconize "９" is number and parse failed. I dont know to to resolve problem. Please help me!
Couchbase 6.0.3
ConjunctionQuery fts = SearchQuery.conjuncts(SearchQuery.queryString(source));
fts = fts.and(SearchQuery.matchPhrase("123").field("tm"));
fts = fts.and(SearchQuery.booleanField(true).field("active"));
SearchQuery query = new SearchQuery("segmentIndex"), fts);
SearchQueryResult result = bucket.query(query);
Exception throws : err: bleve: QueryBleve validating request, err: parse error: error parsing number: strconv.ParseFloat: parsing.

{
"name": "tmSegmentIndex",
"type": "fulltext-index",
"params": {
"doc_config": {
"docid_prefix_delim": "",
"docid_regexp": "",
"mode": "type_field",
"type_field": "type"
},
"mapping": {
"analysis": {
"analyzers": {
"remove_fullsize_number": {
"char_filters": [
"remove_fullsize_number"
],
"token_filters": [
"cjk_bigram",
"cjk_width"
],
"tokenizer": "whitespace",
"type": "custom"
}
},
"char_filters": {
"remove_fullsize_number": {
"regexp": "９",
"replace": "9",
"type": "regexp"
}
}
},
"default_analyzer": "cjk",
"default_datetime_parser": "dateTimeOptional",
"default_field": "_all",
"default_mapping": {
"default_analyzer": "cjk",
"dynamic": true,
"enabled": true
},
"default_type": "_default",
"docvalues_dynamic": true,
"index_dynamic": true,
"store_dynamic": false,
"type_field": "_type"
},
"store": {
"indexType": "scorch",
"kvStoreName": "mossStore"
}
},
"sourceType": "couchbase",
"sourceName": "tm-segment",
"sourceUUID": "973fdbffc567cdfe8f423289b9700f19",
"sourceParams": {},
"planParams": {
"maxPartitionsPerPIndex": 171,
"numReplicas": 0
},
"uuid": "1265a6bedbfd027c"
}

can you just try a custom analyser with an asciifolding character filter like below.
Also, when you directly search from the UI box without a field name, its getting searched in the "_all" field which won't get the right/intended analyser used for parsing the query text.
You may field scope the query there like => field:"９"
{
"type": "fulltext-index",
"name": "FTS",
"uuid": "401ee8132818cee3",
"sourceType": "couchbase",
"sourceName": "sample",
"sourceUUID": "6bd6d0b1c714fcd7697a349ff8166bf8",
"planParams": {
"maxPartitionsPerPIndex": 171,
"indexPartitions": 6
},
"params": {
"doc_config": {
"docid_prefix_delim": "",
"docid_regexp": "",
"mode": "type_field",
"type_field": "type"
},
"mapping": {
"analysis": {
"analyzers": {
"custom": {
"char_filters": [
"asciifolding"
],
"tokenizer": "unicode",
"type": "custom"
}
}
},
"default_analyzer": "standard",
"default_datetime_parser": "dateTimeOptional",
"default_field": "_all",
"default_mapping": {
"dynamic": false,
"enabled": true,
"properties": {
"id": {
"dynamic": false,
"enabled": true,
"fields": [
{
"analyzer": "custom",
"docvalues": true,
"include_in_all": true,
"include_term_vectors": true,
"index": true,
"name": "id",
"type": "text"
}
]
}
}
},
"default_type": "_default",
"docvalues_dynamic": true,
"index_dynamic": true,
"store_dynamic": false,
"type_field": "_type"
},
"store": {
"indexType": "scorch"
}
},
"sourceParams": {}
}
Asciifolding filters are a part of 6.5.0 Couchbase release. Its available in beta for trials.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

kafka connect JDBC sink. Error flattening JSON records - jdbc

Related

Is there a better/faster way to insert data to a database from an external api in Laravel?

Oracle Cloud: How to fill json file for custom metrics

Why Druid Sum Aggregator is getting Zero Always

Nifi JoltTransformRecord UUID in default transform not working as expected

How to search number full size japanese in couchbase

Categories

Resources