Loading Numeric data into BigQuery with Avro files created with goavro - go

I am trying to figure out how to load dollar values into a Numeric column in BigQuery using an Avro file. I am using golang and the goavro package to generate the avro file.
It appears that the appropriate datatype in go to handle money is big.Rat.
BigQuery documentation indicates it should be possible to use Avro for this.
I can see from a few goavro test cases that encoding a *big.Rat into a fixed.decimal type is possible.
I am using a goavro.OCFWriter to encode data using a simple avro schema as follows:
{
"type": "record",
"name": "MyData",
"fields": [
{
"name": "ID",
"type": [
"string"
]
},
{
"name": "Cost",
"type": [
"null",
{
"type": "fixed",
"size": 12,
"logicalType": "decimal",
"precision": 4,
"scale": 2
}
]
}
]
}
I am attempting to Append data with the "Cost" field as follows:
map[string]interface{}{"fixed.decimal": big.NewRat(617, 50)}
This is successfully encoded, but the resulting avro file fails to load into BigQuery:
Err: load Table MyTable Job: {Location: ""; Message: "Error while reading data, error message: The Apache Avro library failed to parse the header with the following error: Missing Json field \"name\": {\"logicalType\":\"decimal\",\"precision\":4,\"scale\":2,\"size\":12,\"type\":\"fixed\"}"; Reason: "invalid"}
So am doing something wrong here... Hoping someone can point me in the right direction.

I figured it out. I need to use bytes.decimal instead of fixed.decimal
{
"type": "record",
"name": "MyData",
"fields": [
{
"name": "ID",
"type": [
"string"
]
},
{
"name": "Cost",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 4,
"scale": 2
}
]
}
]
}
Then encode similarly
map[string]interface{}{"bytes.decimal": big.NewRat(617, 50)}
And it works nicely!

Related

Unable to parse schemas received by schema registry while tracking oracle database changes

I am using confluent and kafka-connect-oracle (https://github.com/erdemcer/kafka-connect-oracle) to track changes in Oracle database 11g XE and i can see schema content by using schema registry api such as "curl -X GET http://localhost:8081/schemas/ids/44" :
{"subject":"TEST.KAFKAUSER.TEST-value","version":1,"id":44,"schema":"{"type":"record","name":"row","namespace":"test.kafkauser.test","fields":[{"name":"SCN","type":"long"},{"name":"SEG_OWNER","type":"string"},{"name":"TABLE_NAME","type":"string"},{"name":"TIMESTAMP","type":{"type":"long","connect.version":1,"connect.name":"org.apache.kafka.connect.data.Timestamp","logicalType":"timestamp-millis"}},{"name":"SQL_REDO","type":"string"},{"name":"OPERATION","type":"string"},{"name":"data","type":["null",{"type":"record","name":"value","namespace":"","fields":[{"name":"ID","type":["null","double"],"default":null},{"name":"NAME","type":["null","string"],"default":null}],"connect.name":"value"}],"default":null},{"name":"before","type":["null","value"],"default":null}],"connect.name":"test.kafkauser.test.row"}","deleted":false}
However this schema cannot be parsed by confluent's schema registry in python :
schemaRegistryClientURL="http://localhost:8081"
from confluent.schemaregistry.client import CachedSchemaRegistryClient
from confluent.schemaregistry.serializers import MessageSerializer
schema_registry_client= CachedSchemaRegistryClient(url=schemaRegistryClientURL)
schema_registry_client.get_by_id(44)
I get following error :
Traceback (most recent call last):
File "", line 1, in
File "build/bdist.linux-x86_64/egg/confluent/schemaregistry/client/CachedSchemaRegistryClient.py", line 140, in get_by_id
confluent.schemaregistry.client.ClientError: Received bad schema from registry.
Does kafka-connect-oracle send unvalid schema to schema registry ? How can I get this schema into proper format?
Thanks.
Looks like there is a problem with your schema. JSON formatter says it's an invalid format. You can check if your JSON is formatted correctly here: https://jsonformatter.curiousconcept.com/#
By looking at it, I see there are 2 overused quote marks here:
First one is in the firt row, after "schema":
Second one is in the last row, between test.row"} and ,"deleted":false}
After deleting these two, it is now in the valid form. If you are asking a way to do this automatically, I don't know a way to do it. Maybe you can search for some python codes to validate and fix JSON format.
This is the valid format:
{
"subject":"TEST.KAFKAUSER.TEST-value",
"version":1,
"id":44,
"schema":{
"type":"record",
"name":"row",
"namespace":"test.kafkauser.test",
"fields":[
{
"name":"SCN",
"type":"long"
},
{
"name":"SEG_OWNER",
"type":"string"
},
{
"name":"TABLE_NAME",
"type":"string"
},
{
"name":"TIMESTAMP",
"type":{
"type":"long",
"connect.version":1,
"connect.name":"org.apache.kafka.connect.data.Timestamp",
"logicalType":"timestamp-millis"
}
},
{
"name":"SQL_REDO",
"type":"string"
},
{
"name":"OPERATION",
"type":"string"
},
{
"name":"data",
"type":[
"null",
{
"type":"record",
"name":"value",
"namespace":"",
"fields":[
{
"name":"ID",
"type":[
"null",
"double"
],
"default":null
},
{
"name":"NAME",
"type":[
"null",
"string"
],
"default":null
}
],
"connect.name":"value"
}
],
"default":null
},
{
"name":"before",
"type":[
"null",
"value"
],
"default":null
}
],
"connect.name":"test.kafkauser.test.row"
},
"deleted":false
}

Create document id from two value fields separated by underscore using Elasticsearch Sink Connector for Kafka

I am trying to load records from a Kafka topic to Elasticsearch using the Elasticsearch Sink Connector, but I'm struggling to construct the document ids the way I would like them. I would like the document id that is written to Elasticsearch to be a composition of two values separated by underscore from my kafka topic's message.
For example:
My Kafka topic value has the following Avro schema:
{
"type": "record",
"name": "SampleValue",
"namespace": "com.abc.test",
"fields": [
{
"name": "value1",
"type": [
"null",
{
"type": "int",
"java-class": "java.lang.Integer"
}
],
"default": null
},
{
"name": "value2",
"type": [
"null",
{
"type": "int",
"java-class": "java.lang.Integer"
}
],
"default": null
},
{
"name": "otherValue",
"type": [
"null",
{
"type": "int",
"java-class": "java.lang.Integer"
}
],
"default": null
}
]
}
I would like the document id that is written to Elasticsearch to be the combined values of value1 and value2 separated by an underscore. If the given value in avro looked like
{"value1": {"int": 123}, "value2": {"int": 456}, "value3": {"int": 0}}
then I would like the document id for Elasticsearch to be 123_456.
I can't figure out the correct way to chain transformations in my connector config to create a key that is composed of two values separated by an underscore.
I don't think there is a Single Message Transform out of the box that will do what you want.
You can either write your own, using the Transform API, or you can use a stream processor such as Kafka Streams or ksqlDB.

elasticsearch filebeat mapper_parsing_exception when using decode_json_fields

I have ECK setup and im using filebeat to ship logs from Kubernetes to elasticsearch.
Ive recently added decode_json_fields processor to my configuration, so that im able decode the json that is usually in the message field.
- decode_json_fields:
fields: ["message"]
process_array: false
max_depth: 10
target: "log"
overwrite_keys: true
add_error_key: true
However logs have stopped appearing since adding it.
example log:
{
"_index": "filebeat-7.9.1-2020.10.01-000001",
"_type": "_doc",
"_id": "wF9hB3UBtUOF3QRTBcts",
"_score": 1,
"_source": {
"#timestamp": "2020-10-08T08:43:18.672Z",
"kubernetes": {
"labels": {
"controller-uid": "9f3f9d08-cfd8-454d-954d-24464172fa37",
"job-name": "stream-hatchet-cron-manual-rvd"
},
"container": {
"name": "stream-hatchet-cron",
"image": "<redacted>.dkr.ecr.us-east-2.amazonaws.com/stream-hatchet:v0.1.4"
},
"node": {
"name": "ip-172-20-32-60.us-east-2.compute.internal"
},
"pod": {
"uid": "041cb6d5-5da1-4efa-b8e9-d4120409af4b",
"name": "stream-hatchet-cron-manual-rvd-bh96h"
},
"namespace": "default"
},
"ecs": {
"version": "1.5.0"
},
"host": {
"mac": [],
"hostname": "ip-172-20-32-60",
"architecture": "x86_64",
"name": "ip-172-20-32-60",
"os": {
"codename": "Core",
"platform": "centos",
"version": "7 (Core)",
"family": "redhat",
"name": "CentOS Linux",
"kernel": "4.9.0-11-amd64"
},
"containerized": false,
"ip": []
},
"cloud": {
"instance": {
"id": "i-06c9d23210956ca5c"
},
"machine": {
"type": "m5.large"
},
"region": "us-east-2",
"availability_zone": "us-east-2a",
"account": {
"id": "<redacted>"
},
"image": {
"id": "ami-09d3627b4a09f6c4c"
},
"provider": "aws"
},
"stream": "stdout",
"message": "{\"message\":{\"log_type\":\"cron\",\"status\":\"start\"},\"level\":\"info\",\"timestamp\":\"2020-10-08T08:43:18.670Z\"}",
"input": {
"type": "container"
},
"log": {
"offset": 348,
"file": {
"path": "/var/log/containers/stream-hatchet-cron-manual-rvd-bh96h_default_stream-hatchet-cron-73069980b418e2aa5e5dcfaf1a29839a6d57e697c5072fea4d6e279da0c4e6ba.log"
}
},
"agent": {
"type": "filebeat",
"version": "7.9.1",
"hostname": "ip-172-20-32-60",
"ephemeral_id": "6b3ba0bd-af7f-4946-b9c5-74f0f3e526b1",
"id": "0f7fff14-6b51-45fc-8f41-34bd04dc0bce",
"name": "ip-172-20-32-60"
}
},
"fields": {
"#timestamp": [
"2020-10-08T08:43:18.672Z"
],
"suricata.eve.timestamp": [
"2020-10-08T08:43:18.672Z"
]
}
}
In the filebeat logs i can see the following error:
2020-10-08T09:25:43.562Z WARN [elasticsearch] elasticsearch/client.go:407 Cannot
index event
publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0x36b243a0,
ext:63737745936, loc:(*time.Location)(nil)}, Meta:null,
Fields:{"agent":{"ephemeral_id":"5f8afdba-39c3-4fb7-9502-be7ef8f2d982","hostname":"ip-172-20-32-60","id":"0f7fff14-6b51-45fc-8f41-34bd04dc0bce","name":"ip-172-20-32-60","type":"filebeat","version":"7.9.1"},"cloud":{"account":{"id":"700849607999"},"availability_zone":"us-east-2a","image":{"id":"ami-09d3627b4a09f6c4c"},"instance":{"id":"i-06c9d23210956ca5c"},"machine":{"type":"m5.large"},"provider":"aws","region":"us-east-2"},"ecs":{"version":"1.5.0"},"host":{"architecture":"x86_64","containerized":false,"hostname":"ip-172-20-32-60","ip":["172.20.32.60","fe80::af:9fff:febe:dc4","172.17.0.1","100.96.1.1","fe80::6010:94ff:fe17:fbae","fe80::d869:14ff:feb0:81b3","fe80::e4f3:b9ff:fed8:e266","fe80::1c19:bcff:feb3:ce95","fe80::fc68:21ff:fe08:7f24","fe80::1cc2:daff:fe84:2a5a","fe80::3426:78ff:fe22:269a","fe80::b871:52ff:fe15:10ab","fe80::54ff:cbff:fec0:f0f","fe80::cca6:42ff:fe82:53fd","fe80::bc85:e2ff:fe5f:a60d","fe80::e05e:b2ff:fe4d:a9a0","fe80::43a:dcff:fe6a:2307","fe80::581b:20ff:fe5f:b060","fe80::4056:29ff:fe07:edf5","fe80::c8a0:5aff:febd:a1a3","fe80::74e3:feff:fe45:d9d4","fe80::9c91:5cff:fee2:c0b9"],"mac":["02:af:9f:be:0d:c4","02:42:1b:56:ee:d3","62:10:94:17:fb:ae","da:69:14:b0:81:b3","e6:f3:b9:d8:e2:66","1e:19:bc:b3:ce:95","fe:68:21:08:7f:24","1e:c2:da:84:2a:5a","36:26:78:22:26:9a","ba:71:52:15:10:ab","56:ff:cb:c0:0f:0f","ce:a6:42:82:53:fd","be:85:e2:5f:a6:0d","e2:5e:b2:4d:a9:a0","06:3a:dc:6a:23:07","5a:1b:20:5f:b0:60","42:56:29:07:ed:f5","ca:a0:5a:bd:a1:a3","76:e3:fe:45:d9:d4","9e:91:5c:e2:c0:b9"],"name":"ip-172-20-32-60","os":{"codename":"Core","family":"redhat","kernel":"4.9.0-11-amd64","name":"CentOS
Linux","platform":"centos","version":"7
(Core)"}},"input":{"type":"container"},"kubernetes":{"container":{"image":"700849607999.dkr.ecr.us-east-2.amazonaws.com/stream-hatchet:v0.1.4","name":"stream-hatchet-cron"},"labels":{"controller-uid":"a79daeac-b159-4ba7-8cb0-48afbfc0711a","job-name":"stream-hatchet-cron-manual-c5r"},"namespace":"default","node":{"name":"ip-172-20-32-60.us-east-2.compute.internal"},"pod":{"name":"stream-hatchet-cron-manual-c5r-7cx5d","uid":"3251cc33-48a9-42b1-9359-9f6e345f75b6"}},"log":{"level":"info","message":{"log_type":"cron","status":"start"},"timestamp":"2020-10-08T09:25:36.916Z"},"message":"{"message":{"log_type":"cron","status":"start"},"level":"info","timestamp":"2020-10-08T09:25:36.916Z"}","stream":"stdout"},
Private:file.State{Id:"native::30998361-66306", PrevId:"",
Finished:false, Fileinfo:(*os.fileStat)(0xc001c14dd0),
Source:"/var/log/containers/stream-hatchet-cron-manual-c5r-7cx5d_default_stream-hatchet-cron-4278d956fff8641048efeaec23b383b41f2662773602c3a7daffe7c30f62fe5a.log",
Offset:539, Timestamp:time.Time{wall:0xbfd7d4a1e556bd72,
ext:916563812286, loc:(*time.Location)(0x607c540)}, TTL:-1,
Type:"container", Meta:map[string]string(nil),
FileStateOS:file.StateOS{Inode:0x1d8ff59, Device:0x10302},
IdentifierName:"native"}, TimeSeries:false}, Flags:0x1,
Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400):
{"type":"mapper_parsing_exception","reason":"failed to parse field
[log.message] of type [keyword] in document with id
'56aHB3UBLgYb8gz801DI'. Preview of field's value: '{log_type=cron,
status=start}'","caused_by":{"type":"illegal_state_exception","reason":"Can't
get text on a START_OBJECT at 1:113"}}
It throws an error because apparently log.message is of type "keyword" however this does not exist in the index mapping.
I thought this maybe an issue with the "target": "log" so ive tried changing this to something arbitrary like "my_parsed_message" or "m_log" or "mlog" and i get the same error for all of them.
{"type":"mapper_parsing_exception","reason":"failed to parse field
[mlog.message] of type [keyword] in document with id
'J5KlDHUB_yo5bfXcn2LE'. Preview of field's value: '{log_type=cron,
status=end}'","caused_by":{"type":"illegal_state_exception","reason":"Can't
get text on a START_OBJECT at 1:217"}}
Elastic version: 7.9.2
The problem is that some of your JSON messages contain a message field that is sometimes a simple string and other times a nested JSON object (like in the case you're showing in your question).
After this index was created, the very first message that was parsed was probably a string and hence the mapping has been modified to add the following field (line 10553):
"mlog": {
"properties": {
...
"message": {
"type": "keyword",
"ignore_above": 1024
},
}
}
You'll find the same pattern for my_parsed_message (line 10902), my_parsed_logs (line 10742), etc...
Hence the next message that comes with message being a JSON object, like
{"message":{"log_type":"cron","status":"start"}, ...
will not work because it's an object, not a string...
Looking at the fields of your custom JSON, it seems you don't really have the control over either their taxonomy (i.e. naming) or what they contain...
If you're serious about willing to search within those custom fields (which I think you are since you're parsing the field, otherwise you'd just store the stringified JSON), then I can only suggest to start figuring out a proper taxonomy in order to make sure that they all get a standard type.
If all you care about is logging your data, then I suggest to simply disable the indexing of that message field. Another solution is to set dynamic: false in your mapping to ignore those fields, i.e. not modify your mapping.

AWS Data Pipeline: Upload CSV file from S3 to DynamoDB

I'm attempting to migrate CSV data from S3 to DynamoDB using Data Pipeline. The data is not in a DynamoDB export format but instead in a normal CSV.
I understand that Data Pipeline is more typically used as import or export of DynamoDB format rather than standard CSV. I think I've read across my Googling that is it possible to use a normal file but I haven't been able to put together something that works. The AWS documentation hasn't been terribly helpful either. I haven't been able to find reference posts that are relatively recent ( < 2 years old)
If this is possible, can anyone provide some insight on why my pipeline may not be working? I've pasted the pipeline and error message below. The error seems to indicate an issue plugging data into Dynamo, I'm guessing because it's not in the export format.
I'd do it in Lambda but the data load takes longer than 15 minutes.
Thanks
{
"objects": [
{
"myComment": "Activity used to run the hive script to import CSV data",
"output": {
"ref": "dynamoDataTable"
},
"input": {
"ref": "s3csv"
},
"name": "S3toDynamoLoader",
"hiveScript": "DROP TABLE IF EXISTS tempHiveTable;\n\nDROP TABLE IF EXISTS s3TempTable;\n\nCREATE EXTERNAL TABLE tempHiveTable (#{myDDBColDef}) \nSTORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' \nTBLPROPERTIES (\"dynamodb.table.name\" = \"#{myDDBTableName}\", \"dynamodb.column.mapping\" = \"#{myDDBTableColMapping}\");\n \nCREATE EXTERNAL TABLE s3TempTable (#{myS3ColDef}) \nROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\\n' LOCATION '#{myInputS3Loc}';\n \nINSERT OVERWRITE TABLE tempHiveTable SELECT * FROM s3TempTable;",
"id": "S3toDynamoLoader",
"runsOn": { "ref": "EmrCluster" },
"stage": "false",
"type": "HiveActivity"
},
{
"myComment": "The DynamoDB table that we are uploading to",
"name": "DynamoDB",
"id": "dynamoDataTable",
"type": "DynamoDBDataNode",
"tableName": "#{myDDBTableName}",
"writeThroughputPercent": "1.0",
"dataFormat": {
"ref": "DDBTableFormat"
}
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myLogUri}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"name": "EmrCluster",
"coreInstanceType": "m1.medium",
"coreInstanceCount": "1",
"masterInstanceType": "m1.medium",
"releaseLabel": "emr-5.29.0",
"id": "EmrCluster",
"type": "EmrCluster",
"terminateAfter": "2 Hours"
},
{
"myComment": "The S3 file that contains the data we're importing",
"directoryPath": "#{myInputS3Loc}",
"dataFormat": {
"ref": "csvFormat"
},
"name": "S3DataNode",
"id": "s3csv",
"type": "S3DataNode"
},
{
"myComment": "Format for the S3 Path",
"name": "S3ExportFormat",
"column": "not_used STRING",
"id": "csvFormat",
"type": "CSV"
},
{
"myComment": "Format for the DynamoDB table",
"name": "DDBTableFormat",
"id": "DDBTableFormat",
"column": "not_used STRING",
"type": "DynamoDBExportDataFormat"
}
],
"parameters": [
{
"description": "S3 Column Mappings",
"id": "myS3ColDef",
"default": "phoneNumber string,firstName string,lastName string, spend double",
"type": "String"
},
{
"description": "DynamoDB Column Mappings",
"id": "myDDBColDef",
"default": "phoneNumber String,firstName String,lastName String, spend double",
"type": "String"
},
{
"description": "Input S3 foder",
"id": "myInputS3Loc",
"default": "s3://POCproject-dev1-data/upload/",
"type": "AWS::S3::ObjectKey"
},
{
"description": "DynamoDB table name",
"id": "myDDBTableName",
"default": "POCproject-pipeline-data",
"type": "String"
},
{
"description": "S3 to DynamoDB Column Mapping",
"id": "myDDBTableColMapping",
"default": "phoneNumber:phoneNumber,firstName:firstName,lastName:lastName,spend:spend",
"type": "String"
},
{
"description": "DataPipeline Log Uri",
"id": "myLogUri",
"default": "s3://POCproject-dev1-data/",
"type": "AWS::S3::ObjectKey"
}
]
}
Error
[INFO] (TaskRunnerService-df-09432511OLZUA8VN0NLE_#EmrCluster_2020-03-06T02:52:47-0) df-09432511OLZUA8VN0NLE amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:108)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.writeBatch(DynamoDBClient.java:258)
at org.apache.hadoop.dynamodb.DynamoDBClient.putBatch(DynamoDBClient.java:215)
at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.write(AbstractDynamoDBRecordWriter.java:112)
at org.apache.hadoop.hive.dynamodb.write.HiveDynamoDBRecordWriter.write(HiveDynamoDBRecordWriter.java:42)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
... 18 more
Caused by: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
Have you tried this sample yet? It uses hive to import the CSV file to a DynamoDB table
https://github.com/aws-samples/data-pipeline-samples/tree/master/samples/DynamoDBImportCSV

Convert string field to JSON array or Avro Array in Nifi record

I am currently getting some data in csv format that has one field which is a string encoding of a JSON array like such:
CED7B5D9-0378-4A37-B746-D6ED7BB35593,"[{\"a\":1},{\"a\":2}]"
D000C576-112C-45BE-BA0F-5DB0E8AF409E,"[{\"a\":3}]"
With millions of lines per file, I'll want to use only Record-based processors. I would like to parse it with the following avro schema:
{
"type": "record",
"name": "test",
"fields": [
{"name": "id", "type": "string"},
{
"name": "json_array",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "array_item",
"fields": [
{"name": "a", "type": "int"}
]
}
}
}
]
}
But attempting to parse this file with ConvertRecord gives the error Cannot convert [[{"a":1},{"a":2}]] of type class java.lang.String to Object Array...
I think I want to use and UpdateRecord processor to parse the string as an object array, but I'm not sure which expression language function or record path function to use. Any suggestions?

Resources