AWS Data Pipeline: Upload CSV file from S3 to DynamoDB - amazon-data-pipeline

I'm attempting to migrate CSV data from S3 to DynamoDB using Data Pipeline. The data is not in a DynamoDB export format but instead in a normal CSV.
I understand that Data Pipeline is more typically used as import or export of DynamoDB format rather than standard CSV. I think I've read across my Googling that is it possible to use a normal file but I haven't been able to put together something that works. The AWS documentation hasn't been terribly helpful either. I haven't been able to find reference posts that are relatively recent ( < 2 years old)
If this is possible, can anyone provide some insight on why my pipeline may not be working? I've pasted the pipeline and error message below. The error seems to indicate an issue plugging data into Dynamo, I'm guessing because it's not in the export format.
I'd do it in Lambda but the data load takes longer than 15 minutes.
Thanks
{
"objects": [
{
"myComment": "Activity used to run the hive script to import CSV data",
"output": {
"ref": "dynamoDataTable"
},
"input": {
"ref": "s3csv"
},
"name": "S3toDynamoLoader",
"hiveScript": "DROP TABLE IF EXISTS tempHiveTable;\n\nDROP TABLE IF EXISTS s3TempTable;\n\nCREATE EXTERNAL TABLE tempHiveTable (#{myDDBColDef}) \nSTORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' \nTBLPROPERTIES (\"dynamodb.table.name\" = \"#{myDDBTableName}\", \"dynamodb.column.mapping\" = \"#{myDDBTableColMapping}\");\n \nCREATE EXTERNAL TABLE s3TempTable (#{myS3ColDef}) \nROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\\n' LOCATION '#{myInputS3Loc}';\n \nINSERT OVERWRITE TABLE tempHiveTable SELECT * FROM s3TempTable;",
"id": "S3toDynamoLoader",
"runsOn": { "ref": "EmrCluster" },
"stage": "false",
"type": "HiveActivity"
},
{
"myComment": "The DynamoDB table that we are uploading to",
"name": "DynamoDB",
"id": "dynamoDataTable",
"type": "DynamoDBDataNode",
"tableName": "#{myDDBTableName}",
"writeThroughputPercent": "1.0",
"dataFormat": {
"ref": "DDBTableFormat"
}
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myLogUri}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"name": "EmrCluster",
"coreInstanceType": "m1.medium",
"coreInstanceCount": "1",
"masterInstanceType": "m1.medium",
"releaseLabel": "emr-5.29.0",
"id": "EmrCluster",
"type": "EmrCluster",
"terminateAfter": "2 Hours"
},
{
"myComment": "The S3 file that contains the data we're importing",
"directoryPath": "#{myInputS3Loc}",
"dataFormat": {
"ref": "csvFormat"
},
"name": "S3DataNode",
"id": "s3csv",
"type": "S3DataNode"
},
{
"myComment": "Format for the S3 Path",
"name": "S3ExportFormat",
"column": "not_used STRING",
"id": "csvFormat",
"type": "CSV"
},
{
"myComment": "Format for the DynamoDB table",
"name": "DDBTableFormat",
"id": "DDBTableFormat",
"column": "not_used STRING",
"type": "DynamoDBExportDataFormat"
}
],
"parameters": [
{
"description": "S3 Column Mappings",
"id": "myS3ColDef",
"default": "phoneNumber string,firstName string,lastName string, spend double",
"type": "String"
},
{
"description": "DynamoDB Column Mappings",
"id": "myDDBColDef",
"default": "phoneNumber String,firstName String,lastName String, spend double",
"type": "String"
},
{
"description": "Input S3 foder",
"id": "myInputS3Loc",
"default": "s3://POCproject-dev1-data/upload/",
"type": "AWS::S3::ObjectKey"
},
{
"description": "DynamoDB table name",
"id": "myDDBTableName",
"default": "POCproject-pipeline-data",
"type": "String"
},
{
"description": "S3 to DynamoDB Column Mapping",
"id": "myDDBTableColMapping",
"default": "phoneNumber:phoneNumber,firstName:firstName,lastName:lastName,spend:spend",
"type": "String"
},
{
"description": "DataPipeline Log Uri",
"id": "myLogUri",
"default": "s3://POCproject-dev1-data/",
"type": "AWS::S3::ObjectKey"
}
]
}
Error
[INFO] (TaskRunnerService-df-09432511OLZUA8VN0NLE_#EmrCluster_2020-03-06T02:52:47-0) df-09432511OLZUA8VN0NLE amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:108)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.writeBatch(DynamoDBClient.java:258)
at org.apache.hadoop.dynamodb.DynamoDBClient.putBatch(DynamoDBClient.java:215)
at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.write(AbstractDynamoDBRecordWriter.java:112)
at org.apache.hadoop.hive.dynamodb.write.HiveDynamoDBRecordWriter.write(HiveDynamoDBRecordWriter.java:42)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
... 18 more
Caused by: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)

Have you tried this sample yet? It uses hive to import the CSV file to a DynamoDB table
https://github.com/aws-samples/data-pipeline-samples/tree/master/samples/DynamoDBImportCSV

Related

elasticsearch filebeat mapper_parsing_exception when using decode_json_fields

I have ECK setup and im using filebeat to ship logs from Kubernetes to elasticsearch.
Ive recently added decode_json_fields processor to my configuration, so that im able decode the json that is usually in the message field.
- decode_json_fields:
fields: ["message"]
process_array: false
max_depth: 10
target: "log"
overwrite_keys: true
add_error_key: true
However logs have stopped appearing since adding it.
example log:
{
"_index": "filebeat-7.9.1-2020.10.01-000001",
"_type": "_doc",
"_id": "wF9hB3UBtUOF3QRTBcts",
"_score": 1,
"_source": {
"#timestamp": "2020-10-08T08:43:18.672Z",
"kubernetes": {
"labels": {
"controller-uid": "9f3f9d08-cfd8-454d-954d-24464172fa37",
"job-name": "stream-hatchet-cron-manual-rvd"
},
"container": {
"name": "stream-hatchet-cron",
"image": "<redacted>.dkr.ecr.us-east-2.amazonaws.com/stream-hatchet:v0.1.4"
},
"node": {
"name": "ip-172-20-32-60.us-east-2.compute.internal"
},
"pod": {
"uid": "041cb6d5-5da1-4efa-b8e9-d4120409af4b",
"name": "stream-hatchet-cron-manual-rvd-bh96h"
},
"namespace": "default"
},
"ecs": {
"version": "1.5.0"
},
"host": {
"mac": [],
"hostname": "ip-172-20-32-60",
"architecture": "x86_64",
"name": "ip-172-20-32-60",
"os": {
"codename": "Core",
"platform": "centos",
"version": "7 (Core)",
"family": "redhat",
"name": "CentOS Linux",
"kernel": "4.9.0-11-amd64"
},
"containerized": false,
"ip": []
},
"cloud": {
"instance": {
"id": "i-06c9d23210956ca5c"
},
"machine": {
"type": "m5.large"
},
"region": "us-east-2",
"availability_zone": "us-east-2a",
"account": {
"id": "<redacted>"
},
"image": {
"id": "ami-09d3627b4a09f6c4c"
},
"provider": "aws"
},
"stream": "stdout",
"message": "{\"message\":{\"log_type\":\"cron\",\"status\":\"start\"},\"level\":\"info\",\"timestamp\":\"2020-10-08T08:43:18.670Z\"}",
"input": {
"type": "container"
},
"log": {
"offset": 348,
"file": {
"path": "/var/log/containers/stream-hatchet-cron-manual-rvd-bh96h_default_stream-hatchet-cron-73069980b418e2aa5e5dcfaf1a29839a6d57e697c5072fea4d6e279da0c4e6ba.log"
}
},
"agent": {
"type": "filebeat",
"version": "7.9.1",
"hostname": "ip-172-20-32-60",
"ephemeral_id": "6b3ba0bd-af7f-4946-b9c5-74f0f3e526b1",
"id": "0f7fff14-6b51-45fc-8f41-34bd04dc0bce",
"name": "ip-172-20-32-60"
}
},
"fields": {
"#timestamp": [
"2020-10-08T08:43:18.672Z"
],
"suricata.eve.timestamp": [
"2020-10-08T08:43:18.672Z"
]
}
}
In the filebeat logs i can see the following error:
2020-10-08T09:25:43.562Z WARN [elasticsearch] elasticsearch/client.go:407 Cannot
index event
publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0x36b243a0,
ext:63737745936, loc:(*time.Location)(nil)}, Meta:null,
Fields:{"agent":{"ephemeral_id":"5f8afdba-39c3-4fb7-9502-be7ef8f2d982","hostname":"ip-172-20-32-60","id":"0f7fff14-6b51-45fc-8f41-34bd04dc0bce","name":"ip-172-20-32-60","type":"filebeat","version":"7.9.1"},"cloud":{"account":{"id":"700849607999"},"availability_zone":"us-east-2a","image":{"id":"ami-09d3627b4a09f6c4c"},"instance":{"id":"i-06c9d23210956ca5c"},"machine":{"type":"m5.large"},"provider":"aws","region":"us-east-2"},"ecs":{"version":"1.5.0"},"host":{"architecture":"x86_64","containerized":false,"hostname":"ip-172-20-32-60","ip":["172.20.32.60","fe80::af:9fff:febe:dc4","172.17.0.1","100.96.1.1","fe80::6010:94ff:fe17:fbae","fe80::d869:14ff:feb0:81b3","fe80::e4f3:b9ff:fed8:e266","fe80::1c19:bcff:feb3:ce95","fe80::fc68:21ff:fe08:7f24","fe80::1cc2:daff:fe84:2a5a","fe80::3426:78ff:fe22:269a","fe80::b871:52ff:fe15:10ab","fe80::54ff:cbff:fec0:f0f","fe80::cca6:42ff:fe82:53fd","fe80::bc85:e2ff:fe5f:a60d","fe80::e05e:b2ff:fe4d:a9a0","fe80::43a:dcff:fe6a:2307","fe80::581b:20ff:fe5f:b060","fe80::4056:29ff:fe07:edf5","fe80::c8a0:5aff:febd:a1a3","fe80::74e3:feff:fe45:d9d4","fe80::9c91:5cff:fee2:c0b9"],"mac":["02:af:9f:be:0d:c4","02:42:1b:56:ee:d3","62:10:94:17:fb:ae","da:69:14:b0:81:b3","e6:f3:b9:d8:e2:66","1e:19:bc:b3:ce:95","fe:68:21:08:7f:24","1e:c2:da:84:2a:5a","36:26:78:22:26:9a","ba:71:52:15:10:ab","56:ff:cb:c0:0f:0f","ce:a6:42:82:53:fd","be:85:e2:5f:a6:0d","e2:5e:b2:4d:a9:a0","06:3a:dc:6a:23:07","5a:1b:20:5f:b0:60","42:56:29:07:ed:f5","ca:a0:5a:bd:a1:a3","76:e3:fe:45:d9:d4","9e:91:5c:e2:c0:b9"],"name":"ip-172-20-32-60","os":{"codename":"Core","family":"redhat","kernel":"4.9.0-11-amd64","name":"CentOS
Linux","platform":"centos","version":"7
(Core)"}},"input":{"type":"container"},"kubernetes":{"container":{"image":"700849607999.dkr.ecr.us-east-2.amazonaws.com/stream-hatchet:v0.1.4","name":"stream-hatchet-cron"},"labels":{"controller-uid":"a79daeac-b159-4ba7-8cb0-48afbfc0711a","job-name":"stream-hatchet-cron-manual-c5r"},"namespace":"default","node":{"name":"ip-172-20-32-60.us-east-2.compute.internal"},"pod":{"name":"stream-hatchet-cron-manual-c5r-7cx5d","uid":"3251cc33-48a9-42b1-9359-9f6e345f75b6"}},"log":{"level":"info","message":{"log_type":"cron","status":"start"},"timestamp":"2020-10-08T09:25:36.916Z"},"message":"{"message":{"log_type":"cron","status":"start"},"level":"info","timestamp":"2020-10-08T09:25:36.916Z"}","stream":"stdout"},
Private:file.State{Id:"native::30998361-66306", PrevId:"",
Finished:false, Fileinfo:(*os.fileStat)(0xc001c14dd0),
Source:"/var/log/containers/stream-hatchet-cron-manual-c5r-7cx5d_default_stream-hatchet-cron-4278d956fff8641048efeaec23b383b41f2662773602c3a7daffe7c30f62fe5a.log",
Offset:539, Timestamp:time.Time{wall:0xbfd7d4a1e556bd72,
ext:916563812286, loc:(*time.Location)(0x607c540)}, TTL:-1,
Type:"container", Meta:map[string]string(nil),
FileStateOS:file.StateOS{Inode:0x1d8ff59, Device:0x10302},
IdentifierName:"native"}, TimeSeries:false}, Flags:0x1,
Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400):
{"type":"mapper_parsing_exception","reason":"failed to parse field
[log.message] of type [keyword] in document with id
'56aHB3UBLgYb8gz801DI'. Preview of field's value: '{log_type=cron,
status=start}'","caused_by":{"type":"illegal_state_exception","reason":"Can't
get text on a START_OBJECT at 1:113"}}
It throws an error because apparently log.message is of type "keyword" however this does not exist in the index mapping.
I thought this maybe an issue with the "target": "log" so ive tried changing this to something arbitrary like "my_parsed_message" or "m_log" or "mlog" and i get the same error for all of them.
{"type":"mapper_parsing_exception","reason":"failed to parse field
[mlog.message] of type [keyword] in document with id
'J5KlDHUB_yo5bfXcn2LE'. Preview of field's value: '{log_type=cron,
status=end}'","caused_by":{"type":"illegal_state_exception","reason":"Can't
get text on a START_OBJECT at 1:217"}}
Elastic version: 7.9.2
The problem is that some of your JSON messages contain a message field that is sometimes a simple string and other times a nested JSON object (like in the case you're showing in your question).
After this index was created, the very first message that was parsed was probably a string and hence the mapping has been modified to add the following field (line 10553):
"mlog": {
"properties": {
...
"message": {
"type": "keyword",
"ignore_above": 1024
},
}
}
You'll find the same pattern for my_parsed_message (line 10902), my_parsed_logs (line 10742), etc...
Hence the next message that comes with message being a JSON object, like
{"message":{"log_type":"cron","status":"start"}, ...
will not work because it's an object, not a string...
Looking at the fields of your custom JSON, it seems you don't really have the control over either their taxonomy (i.e. naming) or what they contain...
If you're serious about willing to search within those custom fields (which I think you are since you're parsing the field, otherwise you'd just store the stringified JSON), then I can only suggest to start figuring out a proper taxonomy in order to make sure that they all get a standard type.
If all you care about is logging your data, then I suggest to simply disable the indexing of that message field. Another solution is to set dynamic: false in your mapping to ignore those fields, i.e. not modify your mapping.

Loading Numeric data into BigQuery with Avro files created with goavro

I am trying to figure out how to load dollar values into a Numeric column in BigQuery using an Avro file. I am using golang and the goavro package to generate the avro file.
It appears that the appropriate datatype in go to handle money is big.Rat.
BigQuery documentation indicates it should be possible to use Avro for this.
I can see from a few goavro test cases that encoding a *big.Rat into a fixed.decimal type is possible.
I am using a goavro.OCFWriter to encode data using a simple avro schema as follows:
{
"type": "record",
"name": "MyData",
"fields": [
{
"name": "ID",
"type": [
"string"
]
},
{
"name": "Cost",
"type": [
"null",
{
"type": "fixed",
"size": 12,
"logicalType": "decimal",
"precision": 4,
"scale": 2
}
]
}
]
}
I am attempting to Append data with the "Cost" field as follows:
map[string]interface{}{"fixed.decimal": big.NewRat(617, 50)}
This is successfully encoded, but the resulting avro file fails to load into BigQuery:
Err: load Table MyTable Job: {Location: ""; Message: "Error while reading data, error message: The Apache Avro library failed to parse the header with the following error: Missing Json field \"name\": {\"logicalType\":\"decimal\",\"precision\":4,\"scale\":2,\"size\":12,\"type\":\"fixed\"}"; Reason: "invalid"}
So am doing something wrong here... Hoping someone can point me in the right direction.
I figured it out. I need to use bytes.decimal instead of fixed.decimal
{
"type": "record",
"name": "MyData",
"fields": [
{
"name": "ID",
"type": [
"string"
]
},
{
"name": "Cost",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 4,
"scale": 2
}
]
}
]
}
Then encode similarly
map[string]interface{}{"bytes.decimal": big.NewRat(617, 50)}
And it works nicely!

JSON object stored in a database table. (How to access)

I have a column name payment_data in my database table. data was stored as JSON array using json_encode().
I want to parse the JSON array in my datatable. But I fail.
this is the data that I want to access
payment_data column
I updated my data structure to this.
Array[2][
{
"po_id": 43,
"full_name": "Dawn Zulita",
"level": "organization",
"payment_data": {
"product_id": "184",
"product_name": "Grading Org Product",
"student_name": {
"0": "Eloise Phillips",
"1": "Lara vel"
}
},
"date_purchase": "2018-08-10 10:38:08"
},
{
"po_id": 42,
"full_name": "QWerty You",
"level": "school",
"payment_data": {
"product_id": 185,
"product_name": "School Owner Manual Payment School Owner Manual Payment",
"student_name": {
"0": "Jai Who",
"1": "Charlie Putt",
"2": "Kevin Young"
}
},
"date_purchase": "2018-08-09 14:53:35"
}
]
I can now access the payment_data.product_name,
{
data: 'payment_data.product_id'
},
{
data: 'payment_data.product_name'
},
but the problem is I cannot access payment_data.student_name
error return Undefined index
{
data: 'payment_data.student_name'
},

Attempt to index document gives error: "only value lists are allowed in serialized settings"

When attempting to index the following document:
{
"branch": "master",
"classes": [
{
"content_count": 2,
"documentation": "",
"extends": [],
"generic": "",
"implements": [],
"line": 10,
"line_count": 36,
"modifiers": [
"public"
],
"name": "removeDuplicateFromString"
}
],
"commit_hash": "e53249ba2381d2f20f3d4493ad70e2da0abb3b05",
"contributors": [
{
"id": "7676016",
"name": "varunu28",
"url": "https://github.com/varunu28"
}
],
"enums": [],
"fields": [],
"filename": "removeDuplicateFromString.java",
"imports": [
{
"name": "java.io.BufferedReader",
"wildcard": false
},
{
"name": "java.io.InputStreamReader",
"wildcard": false
}
],
"interfaces": [],
"license": "",
"methods": [
{
"cyclomatic_complexity": 1,
"documentation": "",
"generic": "",
"line": 11,
"line_count": 9,
"modifiers": [
"public",
"static"
],
"name": "main",
"params": [
{
"name": "args",
"type": "String[]"
}
],
"parent": "removeDuplicateFromString",
"type_": "void"
},
{
"cyclomatic_complexity": 5,
"documentation": "",
"generic": "",
"line": 29,
"line_count": 16,
"modifiers": [
"public",
"static"
],
"name": "removeDuplicate",
"params": [
{
"name": "s",
"type": "String"
}
],
"parent": "removeDuplicateFromString",
"type_": "String"
}
],
"number_forks": 1695,
"number_stars": 4000,
"number_watchs": 394,
"package": "",
"path": "Others",
"repository": "TheAlgorithms/Java"
}
I get the following error:
{"error":{"root_cause":[{"type":"settings_exception","reason":"Failed to load settings from [{\"interfaces\":[],\"imports\":[{\"name\":\"java.io.BufferedReader\",\"wildcard\":false},{\"name\":\"java.io.InputStreamReader\",\"wildcard\":false}],\"package\":\"\",\"methods\":[{\"parent\":\"removeDuplicateFromString\",\"line_count\":9,\"line\":11,\"documentation\":\"\",\"name\":\"main\",\"cyclomatic_complexity\":1,\"modifiers\":[\"public\",\"static\"],\"params\":[{\"name\":\"args\",\"type\":\"String[]\"}],\"type_\":\"void\",\"generic\":\"\"},{\"parent\":\"removeDuplicateFromString\",\"line_count\":16,\"line\":29,\"documentation\":\"\",\"name\":\"removeDuplicate\",\"cyclomatic_complexity\":5,\"modifiers\":[\"public\",\"static\"],\"params\":[{\"name\":\"s\",\"type\":\"String\"}],\"type_\":\"String\",\"generic\":\"\"}],\"number_forks\":1695,\"classes\":[{\"implements\":[],\"line_count\":36,\"extends\":[],\"line\":10,\"documentation\":\"\",\"name\":\"removeDuplicateFromString\",\"content_count\":2,\"modifiers\":[\"public\"],\"generic\":\"\"}],\"repository\":\"TheAlgorithms/Java\",\"branch\":\"master\",\"commit_hash\":\"e53249ba2381d2f20f3d4493ad70e2da0abb3b05\",\"enums\":[],\"path\":\"Others\",\"license\":\"\",\"filename\":\"removeDuplicateFromString.java\",\"number_watchs\":394,\"contributors\":[{\"name\":\"varunu28\",\"id\":\"7676016\",\"url\":\"https://github.com/varunu28\"}],\"fields\":[],\"number_stars\":4000}]"}],"type":"settings_exception","reason":"Failed to load settings from [{\"interfaces\":[],\"imports\":[{\"name\":\"java.io.BufferedReader\",\"wildcard\":false},{\"name\":\"java.io.InputStreamReader\",\"wildcard\":false}],\"package\":\"\",\"methods\":[{\"parent\":\"removeDuplicateFromString\",\"line_count\":9,\"line\":11,\"documentation\":\"\",\"name\":\"main\",\"cyclomatic_complexity\":1,\"modifiers\":[\"public\",\"static\"],\"params\":[{\"name\":\"args\",\"type\":\"String[]\"}],\"type_\":\"void\",\"generic\":\"\"},{\"parent\":\"removeDuplicateFromString\",\"line_count\":16,\"line\":29,\"documentation\":\"\",\"name\":\"removeDuplicate\",\"cyclomatic_complexity\":5,\"modifiers\":[\"public\",\"static\"],\"params\":[{\"name\":\"s\",\"type\":\"String\"}],\"type_\":\"String\",\"generic\":\"\"}],\"number_forks\":1695,\"classes\":[{\"implements\":[],\"line_count\":36,\"extends\":[],\"line\":10,\"documentation\":\"\",\"name\":\"removeDuplicateFromString\",\"content_count\":2,\"modifiers\":[\"public\"],\"generic\":\"\"}],\"repository\":\"TheAlgorithms/Java\",\"branch\":\"master\",\"commit_hash\":\"e53249ba2381d2f20f3d4493ad70e2da0abb3b05\",\"enums\":[],\"path\":\"Others\",\"license\":\"\",\"filename\":\"removeDuplicateFromString.java\",\"number_watchs\":394,\"contributors\":[{\"name\":\"varunu28\",\"id\":\"7676016\",\"url\":\"https://github.com/varunu28\"}],\"fields\":[],\"number_stars\":4000}]","caused_by":{"type":"illegal_state_exception","reason":"only value lists are allowed in serialized settings"}},"status":500}
From which I've gathered that the main issues are either described in the part saying that:
{"type":"illegal_state_exception","reason":"only value lists are allowed in serialized settings"}}
Or:
"error":{"root_cause":[{"type":"settings_exception","reason":"Failed to load settings from [{\"interfaces\":[],\"imports\": ........
But I cannot find any information about this error or what it could be caused by. I've tried indexing both using a predefined index with mappings and to a non-existing index. Nothing seems to work.
Why can't I index this document?
It turns out that, as Farid mentioned in the comments section, I was using the wrong command when indexing from the command line.
The correct command to run is
curl -X POST -H 'Content-Type: application/json' [index location] -d [data]
Where the key is that you use POST and not PUT which is what I was doing.
Adding this for the ones using Kibana Dev Tools.
The key is to use an document type after an index name when adding the document
POST /{index name}/{document type}
{
request body (document) goes here.
}

Need to execute a ruby file from Amazon web service Data pipeline

i have a ruby file in my application and i need to call and execute a ruby file as background job from amazon web service data pipeline
i have given the json file below
#json file
{ "objects": [
{
"id": "ScheduleId4",
"startDateTime": "2013-08-01T00:00:00",
"name": "schedule",
"type": "Schedule",
"period": "15 Minutes"
},
{
"id": "DataNodeId2",
"schedule": {
"ref": "ScheduleId4"
},
"name": "Input",
"directoryPath": "s3://pipeline_test/input/",
"type": "S3DataNode"
},
{
"id": "ActivityId1",
"input": {
"ref": "DataNodeId2"
},
"schedule": {
"ref": "ScheduleId4"
},
"stdout": "s3://pipeline_test/logs",
"scriptUri": "s3://pipeline_test/input/sample.sh",
"name": "Shell",
"runsOn": {
"ref": "ResourceId5"
},
"stderr": "s3://pipeline_test/logs",
"type": "ShellCommandActivity",
"output": {
"ref": "DataNodeId3"
},
"stage": "true"
},
{
"terminateAfter": "1 Hours",
"id": "ResourceId5",
"schedule": {
"ref": "ScheduleId4"
},
"name": "Resource1",
"logUri": "s3://pipeline_test/logs/",
"type": "Ec2Resource"
},
{
"id": "Default",
"scheduleType": "timeseries",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DataNodeId3",
"schedule": {
"ref": "ScheduleId4"
},
"directoryPath": "s3://pipeline_test/output1/",
"name": "Output",
"type": "S3DataNode"
}
]
}
sample.sh
echo "Hello"
ruby sample.rb
sample.rb
puts "Hello world"
i have given correct path of sample.sh file. Still i am not to get the sample.rb calling or not.
Anyone tell me step by step procedure to follow it as i am newbie to amazon web service datapipeline.
Help me to solve it.
The default image launched by Data Pipeline does not actually have ruby on it. You'll have to build your own image and install ruby by hand first. Then, reference that image in your Resource by instanceId

Resources