Merge two Apache Avro schemas containing a common array using jq - shell

I've got two Apache Avro schemas (essentially JSON) - one being a "common" part across many schemas and another one as an . Looking for a way to merge them in a shell script.
base.avsc
{
"type": "record",
"fields": [
{
"name": "id",
"type": "string"
}
]
}
schema1.avsc
{
"name": "schema1",
"namespace": "test",
"doc": "Test schema",
"fields": [
{
"name": "property1",
"type": [
"null",
"string"
],
"default": null,
"doc": "Schema 1 specific field"
}
]
}
jq -s '.[0] * .[1]' base.avsc schema1.avsc doesn't merge the array for me:
{
"type": "record",
"fields": [
{
"name": "property1",
"type": [
"null",
"string"
],
"default": null,
"doc": "Schema 1 specific field"
}
],
"name": "schema1",
"namespace": "test",
"doc": "Test schema"
}
I don't expect to have same keys in the "fields" array. And "type": "record", could be moved into schema1.avsc if that makes it easier.
An expected result should be something like this (the order of the keys doesn't make a difference)
{
"name": "schema1",
"namespace": "test",
"doc": "Test schema",
"type": "record",
"fields": [
{
"name": "property1",
"type": [
"null",
"string"
],
"default": null,
"doc": "Schema 1 specific field"
},
{
"name": "id",
"type": "string"
}
]
}
Can't figure out how to write an expression in jq for what I want.

You need an addition (+) operator to perform a union of records from both the files and combine the common record fields from both the files as
jq -s '.[0] as $o1 | .[1] as $o2 | ($o1 + $o2) |.fields = ($o2.fields + $o1.fields) ' base.avsc schema1.avsc
Answer adopted from pkoppstein's comment on this GitHub post Merge arrays in two json files.
The jq manual says this under the addition operator +
Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object. If both objects contain a value for the same key, the object on the right of the + wins. (For recursive merge use the * operator.)

Here's a concise solution that avoids "slurping":
jq --argfile base base.avsc '
$base + .
| .fields += ($base|.fields)
' schema1.avsc
Or you could go with brevity:
jq -s '
.[0].fields as $f | add | .fields += $f
' base.avsc schema1.avsc

as an alternative solution, you may consider handling hierarchical json using a walk-path based unix utility jtc.
the ask here is mere a recursive merge, which with jtc looks like this:
bash $ <schema1.avsc jtc -mi base.avsc
{
"doc": "Test schema",
"fields": [
{
"default": null,
"doc": "Schema 1 specific field",
"name": "property1",
"type": [
"null",
"string"
]
},
{
"name": "id",
"type": "string"
}
],
"name": "schema1",
"namespace": "test",
"type": "record"
}
bash $
PS> Disclosure: I'm the creator of the jtc - shell cli tool for JSON operations

Related

Gaussian constraint in `normfactor`

I would like to understand how to impose a gaussian constraint with central value expected_yield and error expected_y_error on a normfactor modifier. I want to fit observed_data with a single sample MC_derived_sample. My goal is to extract the bu_y modifier such that the integral of MC_derived_sample scaled by bu_y is gaussian-constrained to expected_yield +/- expected_y_error.
My present attempt employs the normsys modifier as follows:
spec = {
"channels": [
{
"name": "singlechannel",
"samples": [
{
"name": "constrained_template",
"data": MC_derived_sample*expected_yield, #expect normalisation around 1
"modifiers": [
{"name": "bu_y", "type": "normfactor", "data": None },
{"name": "bu_y_constr", "type": "normsys",
"data":
{"lo" : 1 - (expected_y_error/expected_yield),
"hi" : 1 + (expected_y_error/expected_yield)}
},
]
},
]
},
],
"observations": [
{
"name": "singlechannel",
"data": observed_data,
}
],
"measurements": [
{
"name": "sig_y_extraction",
"config": {
"poi": "bu_y",
"parameters": [
{"name":"bu_y", "bounds": [[(1 - (5*expected_y_error/expected_yield), 1+(5*expected_y_error/expected_yield)]], "inits":[1.]},
]
}
}
],
"version": "1.0.0"
}
My thinking is that normsys will introduce a gaussian constraint about unity on the sample scaled by expected_yield.
Please can you provide me any feedback as to whether this approach is correct, please?
In addition, suppose I wanted to include a staterror modifier for the Barlow-Beeston lite implementation, would this be the correct way of doing so?
"samples": [
{
"name": "constrained_template",
"data": MC_derived_sample*expected_yield, #expect normalisation around 1
"modifiers": [
{"name": "BB_lite_uncty", "type": "staterror", "data": np.sqrt(MC_derived_sample)*expected_yield }, #assume poisson error and scale by central value of constraint
{"name": "bu_y", "type": "normfactor", "data": None },
{"name": "bu_y_constr", "type": "normsys",
"data":
{"lo" : 1 - (expected_y_error/expected_yield),
"hi" : 1 + (expected_y_error/expected_yield)}
},
]
}
Thanks a lot in advance for your help,
Blaise

How do I parse nested JSON with JQ into CSV-aggregated output?

I have a question that is an extension/followup to a previous question I've asked:
How do I concatenate dummy values in JQ based on field value, and then CSV-aggregate these concatenations?
In my bash script, when I run the following jq against my curl result:
curl -u someKey:someSecret someURL 2>/dev/null | jq -r '.schema' | jq -r -c '.fields'
I get back a JSON array as follows:
[
{"name":"id", "type":"int"},
{
"name": "agents",
"type": {
"type": "array",
"items": {
"name": "carSalesAgents",
"type": "record"
"fields": [
{
"name": "agentName",
"type": ["string", "null"],
"default": null
},
{
"name": "agentEmail",
"type": ["string", "null"],
"default": null
},
{
"name": "agentPhones",
"type": {
"type": "array",
"items": {
"name": "SalesAgentPhone",
"type": "record"
"fields": [
{
"name": "phoneNumber",
"type": "string"
}
]
}
},
"default": []
}
]
}
},
"default": []
},
{"name":"description","type":"string"}
]
Note: line breaks and indentation added here for ease of reading. This is all in reality a single blob of text.
My goal is to do a call with jq applied to return the following, given the example above (again lines and spaces added for readability, but only need to return valid JSON blob):
{
"id":1234567890,
"agents": [
{
"agentName": "xxxxxxxxxx",
"agentEmail": "xxxxxxxxxx",
"agentPhones": [
{
"phoneNumber": "xxxxxxxxxx"
},
{
"phoneNumber": "xxxxxxxxxx"
},
{
"phoneNumber": "xxxxxxxxxx"
}
]
},
{
"agentName": "xxxxxxxxxx",
"agentEmail": "xxxxxxxxxx",
"agentPhones": [
{
"phoneNumber": "xxxxxxxxxx"
},
{
"phoneNumber": "xxxxxxxxxx"
},
{
"phoneNumber": "xxxxxxxxxx"
}
]
}
],
"description":"xxxxxxxxxx"
}
To summarise, I am trying to automatically generate templated values that match the "schema" JSON shown above.
So just to clarify, the values for "name" (including their surrounding double-quotes) are concatenated with either:
:1234567890 ...when the "type" for that object is "int"
":xxxxxxxxxx" ...when the "type" for that object is "string"
...and when type is "array" or "record" the appropriate enclosures are added {} or [] with the nested content inside.
if its an array of records, generate TWO records for the output
The approach I have started down to cater for parsing nested content like this is to have a series of if-then-else's for every combination of each possible jq type.
But this is fast becoming very hard to manage and painful. From my initial scratch efforts...
echo '[{"name":"id","type":"int"},{"name":"test_string","type":"string"},{"name":"string3ish","type":["string","null"],"default":null}]' | jq -c 'map({(.name): (if .type == "int" then 1234567890 else (if .type == "string" then "xxxxxxxxxx" else (if .type|type == "array" then "xxARRAYxx" else "xxUNKNOWNxx" end) end) end)})|add'
I was wondering if anyone knew of a smarter way to do this in bash/shell with JQ.
PS: I have found alternate solutions for such parsing using Java and Python modules, but JQ is preferable for a unique case of limitations around portability. :)
Thanks!
jq supports functions. Those functions can recurse.
#!/usr/bin/env jq -f
# Ignore all but the first type, in the case of "type": ["string", "null"]
def takeFirstTypeFromArray:
if (.type | type) == "array" then
.type = .type[0]
else
.
end;
def sampleData:
takeFirstTypeFromArray |
if .type == "int" then
1234567890
elif .type == "string" then
"xxxxxxxxxx"
elif .type == "array" then # generate two entries for any test array
[(.items | sampleData), (.items | sampleData)]
elif .type == "record" then
(.fields | map({(.name): sampleData}) | add)
elif (.type | type) == "array" then
(.type[] | sampleData)
elif (.type | type) == "object" then
(.type | sampleData)
else
["UNKNOWN", .]
end;
map({(.name): sampleData}) | add

Data filter works with json data but not with csv data

In this vega chart, if I download and convert the flare-dependencies.json to csv using the following jq command,
jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | #csv' flare-dependencies.json > flare-dependencies.csv
And change the corresponding data property in the edge-bundling.vg.json file from:
{
"name": "dependencies",
"url": "data/flare-dependencies.json",
"transform": [
{
"type": "formula",
"expr": "treePath('tree', datum.source, datum.target)",
"as": "treepath",
"initonly": true
}
]
},
to
{
"name": "dependencies",
"url": "data/flare-dependencies.csv",
"format": { "type": "csv" },
"transform": [
{
"type": "formula",
"expr": "treePath('tree', datum.source, datum.target)",
"as": "treepath",
"initonly": true
}
]
},
The hovering effect wont work(the colors wont change when I hover edges/nodes.
I suspect that the issue is with this section:
"name": "selected",
"source": "dependencies",
"transform": [
{
"type": "filter",
"expr": "datum.source === active || datum.target === active"
}
]
What am I missing? How can I fix this?
JSON data is typed; that is, the file format distinguishes between string and numerical data. CSV data is untyped: all entries are expressed as strings.
The chart specification above requires some fields to be numerical. When you convert the input data to CSV, you must add a format specifier to specify numerical types for the numerical data columns.
In case of this chart you can use the following for the nodes data:
"format": {
"type": "tsv",
"parse": { "id": "number", "name": "string", "parent": "number" }
},
And the following for the links data:
"format": {
"type": "tsv",
"parse": { "source": "number", "target": "number" }
},

AWS Data Pipeline: Upload CSV file from S3 to DynamoDB

I'm attempting to migrate CSV data from S3 to DynamoDB using Data Pipeline. The data is not in a DynamoDB export format but instead in a normal CSV.
I understand that Data Pipeline is more typically used as import or export of DynamoDB format rather than standard CSV. I think I've read across my Googling that is it possible to use a normal file but I haven't been able to put together something that works. The AWS documentation hasn't been terribly helpful either. I haven't been able to find reference posts that are relatively recent ( < 2 years old)
If this is possible, can anyone provide some insight on why my pipeline may not be working? I've pasted the pipeline and error message below. The error seems to indicate an issue plugging data into Dynamo, I'm guessing because it's not in the export format.
I'd do it in Lambda but the data load takes longer than 15 minutes.
Thanks
{
"objects": [
{
"myComment": "Activity used to run the hive script to import CSV data",
"output": {
"ref": "dynamoDataTable"
},
"input": {
"ref": "s3csv"
},
"name": "S3toDynamoLoader",
"hiveScript": "DROP TABLE IF EXISTS tempHiveTable;\n\nDROP TABLE IF EXISTS s3TempTable;\n\nCREATE EXTERNAL TABLE tempHiveTable (#{myDDBColDef}) \nSTORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' \nTBLPROPERTIES (\"dynamodb.table.name\" = \"#{myDDBTableName}\", \"dynamodb.column.mapping\" = \"#{myDDBTableColMapping}\");\n \nCREATE EXTERNAL TABLE s3TempTable (#{myS3ColDef}) \nROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\\n' LOCATION '#{myInputS3Loc}';\n \nINSERT OVERWRITE TABLE tempHiveTable SELECT * FROM s3TempTable;",
"id": "S3toDynamoLoader",
"runsOn": { "ref": "EmrCluster" },
"stage": "false",
"type": "HiveActivity"
},
{
"myComment": "The DynamoDB table that we are uploading to",
"name": "DynamoDB",
"id": "dynamoDataTable",
"type": "DynamoDBDataNode",
"tableName": "#{myDDBTableName}",
"writeThroughputPercent": "1.0",
"dataFormat": {
"ref": "DDBTableFormat"
}
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "#{myLogUri}",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"name": "EmrCluster",
"coreInstanceType": "m1.medium",
"coreInstanceCount": "1",
"masterInstanceType": "m1.medium",
"releaseLabel": "emr-5.29.0",
"id": "EmrCluster",
"type": "EmrCluster",
"terminateAfter": "2 Hours"
},
{
"myComment": "The S3 file that contains the data we're importing",
"directoryPath": "#{myInputS3Loc}",
"dataFormat": {
"ref": "csvFormat"
},
"name": "S3DataNode",
"id": "s3csv",
"type": "S3DataNode"
},
{
"myComment": "Format for the S3 Path",
"name": "S3ExportFormat",
"column": "not_used STRING",
"id": "csvFormat",
"type": "CSV"
},
{
"myComment": "Format for the DynamoDB table",
"name": "DDBTableFormat",
"id": "DDBTableFormat",
"column": "not_used STRING",
"type": "DynamoDBExportDataFormat"
}
],
"parameters": [
{
"description": "S3 Column Mappings",
"id": "myS3ColDef",
"default": "phoneNumber string,firstName string,lastName string, spend double",
"type": "String"
},
{
"description": "DynamoDB Column Mappings",
"id": "myDDBColDef",
"default": "phoneNumber String,firstName String,lastName String, spend double",
"type": "String"
},
{
"description": "Input S3 foder",
"id": "myInputS3Loc",
"default": "s3://POCproject-dev1-data/upload/",
"type": "AWS::S3::ObjectKey"
},
{
"description": "DynamoDB table name",
"id": "myDDBTableName",
"default": "POCproject-pipeline-data",
"type": "String"
},
{
"description": "S3 to DynamoDB Column Mapping",
"id": "myDDBTableColMapping",
"default": "phoneNumber:phoneNumber,firstName:firstName,lastName:lastName,spend:spend",
"type": "String"
},
{
"description": "DataPipeline Log Uri",
"id": "myLogUri",
"default": "s3://POCproject-dev1-data/",
"type": "AWS::S3::ObjectKey"
}
]
}
Error
[INFO] (TaskRunnerService-df-09432511OLZUA8VN0NLE_#EmrCluster_2020-03-06T02:52:47-0) df-09432511OLZUA8VN0NLE amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :Caused by: java.lang.RuntimeException: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:108)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.writeBatch(DynamoDBClient.java:258)
at org.apache.hadoop.dynamodb.DynamoDBClient.putBatch(DynamoDBClient.java:215)
at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.write(AbstractDynamoDBRecordWriter.java:112)
at org.apache.hadoop.hive.dynamodb.write.HiveDynamoDBRecordWriter.write(HiveDynamoDBRecordWriter.java:42)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
... 18 more
Caused by: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: One or more parameter values were invalid: An AttributeValue may not contain an empty string (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: UM56KGVOU511P6LS7LP1N0Q4HRVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
Have you tried this sample yet? It uses hive to import the CSV file to a DynamoDB table
https://github.com/aws-samples/data-pipeline-samples/tree/master/samples/DynamoDBImportCSV

Filter objects in geojson based on a specific key

I try to edit a geojson file to keep only objects that have the key "name".
The filter works but I can't find a way to keep the other objects and, specifically, the geometry and redirect the whole stuff to a new geojson file. Is there a way to display the whole object after filtering one of its children objects?
Here is an example of my data. The first object has the "name" property and the second hasn't:
{
"features": [
{
"type": "Feature",
"id": "way/24824633",
"properties": {
"#id": "way/24824633",
"highway": "tertiary",
"lit": "yes",
"maxspeed": "50",
"name": "Rue de Kleinbettingen",
"surface": "asphalt"
},
"geometry": {
"type": "LineString",
"coordinates": [
[
5.8997935,
49.6467825
],
[
5.8972561,
49.6467445
]
]
}
},
{
"type": "Feature",
"id": "way/474396855",
"properties": {
"#id": "way/474396855",
"highway": "path"
},
"geometry": {
"type": "LineString",
"coordinates": [
[
5.8020608,
49.6907648
],
[
5.8020695,
49.6906054
]
]
}
}
]
}
Here is what I tried, using jq
cat file.geojson | jq '.features[].properties | select(has("name"))'
The "geometry" is also a child of "features" but I can't find a way to make the selection directly from the "features" level. Is there some way to do that? Or a better path to the solution?
So, the required ouput is:
{
"type": "Feature",
"id": "way/24824633",
"properties": {
"#id": "way/24824633",
"highway": "tertiary",
"lit": "yes",
"maxspeed": "50",
"name": "Rue de Kleinbettingen",
"surface": "asphalt"
},
"geometry": {
"type": "LineString",
"coordinates": [
[
5.8997935,
49.6467825
],
[
5.8972561,
49.6467445
]
]}}
You can assign the filtered list back to .features:
jq '.features |= map(select(.properties|has("name")))'

Resources