Related
I am using confluent and kafka-connect-oracle (https://github.com/erdemcer/kafka-connect-oracle) to track changes in Oracle database 11g XE and i can see schema content by using schema registry api such as "curl -X GET http://localhost:8081/schemas/ids/44" :
{"subject":"TEST.KAFKAUSER.TEST-value","version":1,"id":44,"schema":"{"type":"record","name":"row","namespace":"test.kafkauser.test","fields":[{"name":"SCN","type":"long"},{"name":"SEG_OWNER","type":"string"},{"name":"TABLE_NAME","type":"string"},{"name":"TIMESTAMP","type":{"type":"long","connect.version":1,"connect.name":"org.apache.kafka.connect.data.Timestamp","logicalType":"timestamp-millis"}},{"name":"SQL_REDO","type":"string"},{"name":"OPERATION","type":"string"},{"name":"data","type":["null",{"type":"record","name":"value","namespace":"","fields":[{"name":"ID","type":["null","double"],"default":null},{"name":"NAME","type":["null","string"],"default":null}],"connect.name":"value"}],"default":null},{"name":"before","type":["null","value"],"default":null}],"connect.name":"test.kafkauser.test.row"}","deleted":false}
However this schema cannot be parsed by confluent's schema registry in python :
schemaRegistryClientURL="http://localhost:8081"
from confluent.schemaregistry.client import CachedSchemaRegistryClient
from confluent.schemaregistry.serializers import MessageSerializer
schema_registry_client= CachedSchemaRegistryClient(url=schemaRegistryClientURL)
schema_registry_client.get_by_id(44)
I get following error :
Traceback (most recent call last):
File "", line 1, in
File "build/bdist.linux-x86_64/egg/confluent/schemaregistry/client/CachedSchemaRegistryClient.py", line 140, in get_by_id
confluent.schemaregistry.client.ClientError: Received bad schema from registry.
Does kafka-connect-oracle send unvalid schema to schema registry ? How can I get this schema into proper format?
Thanks.
Looks like there is a problem with your schema. JSON formatter says it's an invalid format. You can check if your JSON is formatted correctly here: https://jsonformatter.curiousconcept.com/#
By looking at it, I see there are 2 overused quote marks here:
First one is in the firt row, after "schema":
Second one is in the last row, between test.row"} and ,"deleted":false}
After deleting these two, it is now in the valid form. If you are asking a way to do this automatically, I don't know a way to do it. Maybe you can search for some python codes to validate and fix JSON format.
This is the valid format:
{
"subject":"TEST.KAFKAUSER.TEST-value",
"version":1,
"id":44,
"schema":{
"type":"record",
"name":"row",
"namespace":"test.kafkauser.test",
"fields":[
{
"name":"SCN",
"type":"long"
},
{
"name":"SEG_OWNER",
"type":"string"
},
{
"name":"TABLE_NAME",
"type":"string"
},
{
"name":"TIMESTAMP",
"type":{
"type":"long",
"connect.version":1,
"connect.name":"org.apache.kafka.connect.data.Timestamp",
"logicalType":"timestamp-millis"
}
},
{
"name":"SQL_REDO",
"type":"string"
},
{
"name":"OPERATION",
"type":"string"
},
{
"name":"data",
"type":[
"null",
{
"type":"record",
"name":"value",
"namespace":"",
"fields":[
{
"name":"ID",
"type":[
"null",
"double"
],
"default":null
},
{
"name":"NAME",
"type":[
"null",
"string"
],
"default":null
}
],
"connect.name":"value"
}
],
"default":null
},
{
"name":"before",
"type":[
"null",
"value"
],
"default":null
}
],
"connect.name":"test.kafkauser.test.row"
},
"deleted":false
}
I am trying to load records from a Kafka topic to Elasticsearch using the Elasticsearch Sink Connector, but I'm struggling to construct the document ids the way I would like them. I would like the document id that is written to Elasticsearch to be a composition of two values separated by underscore from my kafka topic's message.
For example:
My Kafka topic value has the following Avro schema:
{
"type": "record",
"name": "SampleValue",
"namespace": "com.abc.test",
"fields": [
{
"name": "value1",
"type": [
"null",
{
"type": "int",
"java-class": "java.lang.Integer"
}
],
"default": null
},
{
"name": "value2",
"type": [
"null",
{
"type": "int",
"java-class": "java.lang.Integer"
}
],
"default": null
},
{
"name": "otherValue",
"type": [
"null",
{
"type": "int",
"java-class": "java.lang.Integer"
}
],
"default": null
}
]
}
I would like the document id that is written to Elasticsearch to be the combined values of value1 and value2 separated by an underscore. If the given value in avro looked like
{"value1": {"int": 123}, "value2": {"int": 456}, "value3": {"int": 0}}
then I would like the document id for Elasticsearch to be 123_456.
I can't figure out the correct way to chain transformations in my connector config to create a key that is composed of two values separated by an underscore.
I don't think there is a Single Message Transform out of the box that will do what you want.
You can either write your own, using the Transform API, or you can use a stream processor such as Kafka Streams or ksqlDB.
I have ECK setup and im using filebeat to ship logs from Kubernetes to elasticsearch.
Ive recently added decode_json_fields processor to my configuration, so that im able decode the json that is usually in the message field.
- decode_json_fields:
fields: ["message"]
process_array: false
max_depth: 10
target: "log"
overwrite_keys: true
add_error_key: true
However logs have stopped appearing since adding it.
example log:
{
"_index": "filebeat-7.9.1-2020.10.01-000001",
"_type": "_doc",
"_id": "wF9hB3UBtUOF3QRTBcts",
"_score": 1,
"_source": {
"#timestamp": "2020-10-08T08:43:18.672Z",
"kubernetes": {
"labels": {
"controller-uid": "9f3f9d08-cfd8-454d-954d-24464172fa37",
"job-name": "stream-hatchet-cron-manual-rvd"
},
"container": {
"name": "stream-hatchet-cron",
"image": "<redacted>.dkr.ecr.us-east-2.amazonaws.com/stream-hatchet:v0.1.4"
},
"node": {
"name": "ip-172-20-32-60.us-east-2.compute.internal"
},
"pod": {
"uid": "041cb6d5-5da1-4efa-b8e9-d4120409af4b",
"name": "stream-hatchet-cron-manual-rvd-bh96h"
},
"namespace": "default"
},
"ecs": {
"version": "1.5.0"
},
"host": {
"mac": [],
"hostname": "ip-172-20-32-60",
"architecture": "x86_64",
"name": "ip-172-20-32-60",
"os": {
"codename": "Core",
"platform": "centos",
"version": "7 (Core)",
"family": "redhat",
"name": "CentOS Linux",
"kernel": "4.9.0-11-amd64"
},
"containerized": false,
"ip": []
},
"cloud": {
"instance": {
"id": "i-06c9d23210956ca5c"
},
"machine": {
"type": "m5.large"
},
"region": "us-east-2",
"availability_zone": "us-east-2a",
"account": {
"id": "<redacted>"
},
"image": {
"id": "ami-09d3627b4a09f6c4c"
},
"provider": "aws"
},
"stream": "stdout",
"message": "{\"message\":{\"log_type\":\"cron\",\"status\":\"start\"},\"level\":\"info\",\"timestamp\":\"2020-10-08T08:43:18.670Z\"}",
"input": {
"type": "container"
},
"log": {
"offset": 348,
"file": {
"path": "/var/log/containers/stream-hatchet-cron-manual-rvd-bh96h_default_stream-hatchet-cron-73069980b418e2aa5e5dcfaf1a29839a6d57e697c5072fea4d6e279da0c4e6ba.log"
}
},
"agent": {
"type": "filebeat",
"version": "7.9.1",
"hostname": "ip-172-20-32-60",
"ephemeral_id": "6b3ba0bd-af7f-4946-b9c5-74f0f3e526b1",
"id": "0f7fff14-6b51-45fc-8f41-34bd04dc0bce",
"name": "ip-172-20-32-60"
}
},
"fields": {
"#timestamp": [
"2020-10-08T08:43:18.672Z"
],
"suricata.eve.timestamp": [
"2020-10-08T08:43:18.672Z"
]
}
}
In the filebeat logs i can see the following error:
2020-10-08T09:25:43.562Z WARN [elasticsearch] elasticsearch/client.go:407 Cannot
index event
publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0x36b243a0,
ext:63737745936, loc:(*time.Location)(nil)}, Meta:null,
Fields:{"agent":{"ephemeral_id":"5f8afdba-39c3-4fb7-9502-be7ef8f2d982","hostname":"ip-172-20-32-60","id":"0f7fff14-6b51-45fc-8f41-34bd04dc0bce","name":"ip-172-20-32-60","type":"filebeat","version":"7.9.1"},"cloud":{"account":{"id":"700849607999"},"availability_zone":"us-east-2a","image":{"id":"ami-09d3627b4a09f6c4c"},"instance":{"id":"i-06c9d23210956ca5c"},"machine":{"type":"m5.large"},"provider":"aws","region":"us-east-2"},"ecs":{"version":"1.5.0"},"host":{"architecture":"x86_64","containerized":false,"hostname":"ip-172-20-32-60","ip":["172.20.32.60","fe80::af:9fff:febe:dc4","172.17.0.1","100.96.1.1","fe80::6010:94ff:fe17:fbae","fe80::d869:14ff:feb0:81b3","fe80::e4f3:b9ff:fed8:e266","fe80::1c19:bcff:feb3:ce95","fe80::fc68:21ff:fe08:7f24","fe80::1cc2:daff:fe84:2a5a","fe80::3426:78ff:fe22:269a","fe80::b871:52ff:fe15:10ab","fe80::54ff:cbff:fec0:f0f","fe80::cca6:42ff:fe82:53fd","fe80::bc85:e2ff:fe5f:a60d","fe80::e05e:b2ff:fe4d:a9a0","fe80::43a:dcff:fe6a:2307","fe80::581b:20ff:fe5f:b060","fe80::4056:29ff:fe07:edf5","fe80::c8a0:5aff:febd:a1a3","fe80::74e3:feff:fe45:d9d4","fe80::9c91:5cff:fee2:c0b9"],"mac":["02:af:9f:be:0d:c4","02:42:1b:56:ee:d3","62:10:94:17:fb:ae","da:69:14:b0:81:b3","e6:f3:b9:d8:e2:66","1e:19:bc:b3:ce:95","fe:68:21:08:7f:24","1e:c2:da:84:2a:5a","36:26:78:22:26:9a","ba:71:52:15:10:ab","56:ff:cb:c0:0f:0f","ce:a6:42:82:53:fd","be:85:e2:5f:a6:0d","e2:5e:b2:4d:a9:a0","06:3a:dc:6a:23:07","5a:1b:20:5f:b0:60","42:56:29:07:ed:f5","ca:a0:5a:bd:a1:a3","76:e3:fe:45:d9:d4","9e:91:5c:e2:c0:b9"],"name":"ip-172-20-32-60","os":{"codename":"Core","family":"redhat","kernel":"4.9.0-11-amd64","name":"CentOS
Linux","platform":"centos","version":"7
(Core)"}},"input":{"type":"container"},"kubernetes":{"container":{"image":"700849607999.dkr.ecr.us-east-2.amazonaws.com/stream-hatchet:v0.1.4","name":"stream-hatchet-cron"},"labels":{"controller-uid":"a79daeac-b159-4ba7-8cb0-48afbfc0711a","job-name":"stream-hatchet-cron-manual-c5r"},"namespace":"default","node":{"name":"ip-172-20-32-60.us-east-2.compute.internal"},"pod":{"name":"stream-hatchet-cron-manual-c5r-7cx5d","uid":"3251cc33-48a9-42b1-9359-9f6e345f75b6"}},"log":{"level":"info","message":{"log_type":"cron","status":"start"},"timestamp":"2020-10-08T09:25:36.916Z"},"message":"{"message":{"log_type":"cron","status":"start"},"level":"info","timestamp":"2020-10-08T09:25:36.916Z"}","stream":"stdout"},
Private:file.State{Id:"native::30998361-66306", PrevId:"",
Finished:false, Fileinfo:(*os.fileStat)(0xc001c14dd0),
Source:"/var/log/containers/stream-hatchet-cron-manual-c5r-7cx5d_default_stream-hatchet-cron-4278d956fff8641048efeaec23b383b41f2662773602c3a7daffe7c30f62fe5a.log",
Offset:539, Timestamp:time.Time{wall:0xbfd7d4a1e556bd72,
ext:916563812286, loc:(*time.Location)(0x607c540)}, TTL:-1,
Type:"container", Meta:map[string]string(nil),
FileStateOS:file.StateOS{Inode:0x1d8ff59, Device:0x10302},
IdentifierName:"native"}, TimeSeries:false}, Flags:0x1,
Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400):
{"type":"mapper_parsing_exception","reason":"failed to parse field
[log.message] of type [keyword] in document with id
'56aHB3UBLgYb8gz801DI'. Preview of field's value: '{log_type=cron,
status=start}'","caused_by":{"type":"illegal_state_exception","reason":"Can't
get text on a START_OBJECT at 1:113"}}
It throws an error because apparently log.message is of type "keyword" however this does not exist in the index mapping.
I thought this maybe an issue with the "target": "log" so ive tried changing this to something arbitrary like "my_parsed_message" or "m_log" or "mlog" and i get the same error for all of them.
{"type":"mapper_parsing_exception","reason":"failed to parse field
[mlog.message] of type [keyword] in document with id
'J5KlDHUB_yo5bfXcn2LE'. Preview of field's value: '{log_type=cron,
status=end}'","caused_by":{"type":"illegal_state_exception","reason":"Can't
get text on a START_OBJECT at 1:217"}}
Elastic version: 7.9.2
The problem is that some of your JSON messages contain a message field that is sometimes a simple string and other times a nested JSON object (like in the case you're showing in your question).
After this index was created, the very first message that was parsed was probably a string and hence the mapping has been modified to add the following field (line 10553):
"mlog": {
"properties": {
...
"message": {
"type": "keyword",
"ignore_above": 1024
},
}
}
You'll find the same pattern for my_parsed_message (line 10902), my_parsed_logs (line 10742), etc...
Hence the next message that comes with message being a JSON object, like
{"message":{"log_type":"cron","status":"start"}, ...
will not work because it's an object, not a string...
Looking at the fields of your custom JSON, it seems you don't really have the control over either their taxonomy (i.e. naming) or what they contain...
If you're serious about willing to search within those custom fields (which I think you are since you're parsing the field, otherwise you'd just store the stringified JSON), then I can only suggest to start figuring out a proper taxonomy in order to make sure that they all get a standard type.
If all you care about is logging your data, then I suggest to simply disable the indexing of that message field. Another solution is to set dynamic: false in your mapping to ignore those fields, i.e. not modify your mapping.
I'm attempting to migrate CSV data from S3 to DynamoDB using Data Pipeline. The data is not in a DynamoDB export format but instead in a normal CSV.
I understand that Data Pipeline is more typically used as import or export of DynamoDB format rather than standard CSV. I think I've read across my Googling that is it possible to use a normal file but I haven't been able to put together something that works. The AWS documentation hasn't been terribly helpful either. I haven't been able to find reference posts that are relatively recent ( < 2 years old)
If this is possible, can anyone provide some insight on why my pipeline may not be working? I've pasted the pipeline and error message below. The error seems to indicate an issue plugging data into Dynamo, I'm guessing because it's not in the export format.
I'd do it in Lambda but the data load takes longer than 15 minutes.
Thanks
{
"objects": [
{
"myComment": "Activity used to run the hive script to import CSV data",
"output": {
"ref": "dynamoDataTable"
},
"input": {
"ref": "s3csv"
},
"name": "S3toDynamoLoader",
"hiveScript": "DROP TABLE IF EXISTS tempHiveTable;\n\nDROP TABLE IF EXISTS s3TempTable;\n\nCREATE EXTERNAL TABLE tempHiveTable (#{myDDBColDef}) \nSTORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' \nTBLPROPERTIES (\"dynamodb.table.name\" = \"#{myDDBTableName}\", \"dynamodb.column.mapping\" = \"#{myDDBTableColMapping}\");\n \nCREATE EXTERNAL TABLE s3TempTable (#{myS3ColDef}) \nROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\\n' LOCATION '#{myInputS3Loc}';\n \nINSERT OVERWRITE TABLE tempHiveTable SELECT * FROM s3TempTable;",
"id": "S3toDynamoLoader",
"runsOn": { "ref": "EmrCluster" },
"stage": "false",
"type": "HiveActivity"
},
{
"myComment": "The DynamoDB table that we are uploading to",
"name": "DynamoDB",
"id": "dynamoDataTable",
"type": "DynamoDBDataNode",
"tableName": "#{myDDBTableName}",
"writeThroughputPercent": "1.0",
"dataFormat": {
"ref": "DDBTableFormat"
}
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myLogUri}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"name": "EmrCluster",
"coreInstanceType": "m1.medium",
"coreInstanceCount": "1",
"masterInstanceType": "m1.medium",
"releaseLabel": "emr-5.29.0",
"id": "EmrCluster",
"type": "EmrCluster",
"terminateAfter": "2 Hours"
},
{
"myComment": "The S3 file that contains the data we're importing",
"directoryPath": "#{myInputS3Loc}",
"dataFormat": {
"ref": "csvFormat"
},
"name": "S3DataNode",
"id": "s3csv",
"type": "S3DataNode"
},
{
"myComment": "Format for the S3 Path",
"name": "S3ExportFormat",
"column": "not_used STRING",
"id": "csvFormat",
"type": "CSV"
},
{
"myComment": "Format for the DynamoDB table",
"name": "DDBTableFormat",
"id": "DDBTableFormat",
"column": "not_used STRING",
"type": "DynamoDBExportDataFormat"
}
],
"parameters": [
{
"description": "S3 Column Mappings",
"id": "myS3ColDef",
"default": "phoneNumber string,firstName string,lastName string, spend double",
"type": "String"
},
{
"description": "DynamoDB Column Mappings",
"id": "myDDBColDef",
"default": "phoneNumber String,firstName String,lastName String, spend double",
"type": "String"
},
{
"description": "Input S3 foder",
"id": "myInputS3Loc",
"default": "s3://POCproject-dev1-data/upload/",
"type": "AWS::S3::ObjectKey"
},
{
"description": "DynamoDB table name",
"id": "myDDBTableName",
"default": "POCproject-pipeline-data",
"type": "String"
},
{
"description": "S3 to DynamoDB Column Mapping",
"id": "myDDBTableColMapping",
"default": "phoneNumber:phoneNumber,firstName:firstName,lastName:lastName,spend:spend",
"type": "String"
},
{
"description": "DataPipeline Log Uri",
"id": "myLogUri",
"default": "s3://POCproject-dev1-data/",
"type": "AWS::S3::ObjectKey"
}
]
}
Error
[INFO] (TaskRunnerService-df-09432511OLZUA8VN0NLE_#EmrCluster_2020-03-06T02:52:47-0) df-09432511OLZUA8VN0NLE amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:108)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.writeBatch(DynamoDBClient.java:258)
at org.apache.hadoop.dynamodb.DynamoDBClient.putBatch(DynamoDBClient.java:215)
at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.write(AbstractDynamoDBRecordWriter.java:112)
at org.apache.hadoop.hive.dynamodb.write.HiveDynamoDBRecordWriter.write(HiveDynamoDBRecordWriter.java:42)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
... 18 more
Caused by: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
Have you tried this sample yet? It uses hive to import the CSV file to a DynamoDB table
https://github.com/aws-samples/data-pipeline-samples/tree/master/samples/DynamoDBImportCSV
I am currently getting some data in csv format that has one field which is a string encoding of a JSON array like such:
CED7B5D9-0378-4A37-B746-D6ED7BB35593,"[{\"a\":1},{\"a\":2}]"
D000C576-112C-45BE-BA0F-5DB0E8AF409E,"[{\"a\":3}]"
With millions of lines per file, I'll want to use only Record-based processors. I would like to parse it with the following avro schema:
{
"type": "record",
"name": "test",
"fields": [
{"name": "id", "type": "string"},
{
"name": "json_array",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "array_item",
"fields": [
{"name": "a", "type": "int"}
]
}
}
}
]
}
But attempting to parse this file with ConvertRecord gives the error Cannot convert [[{"a":1},{"a":2}]] of type class java.lang.String to Object Array...
I think I want to use and UpdateRecord processor to parse the string as an object array, but I'm not sure which expression language function or record path function to use. Any suggestions?