Merge two array of objects with common key using jq command - bash

I have two datasets:
data1='[
{ "bookings": 2984, "timestamp": 1675854900 },
{ "bookings": 2967, "timestamp": 1675855200 }
]'
data2='[
{ "errors": 51, "timestamp": 1675854900 },
{ "errors": 90, "timestamp": 1675855200 }
]'
I want the output to be:
combined='[
{ "errors": 51, bookings: 2984, "timestamp": 1675854900 },
{ "errors": 90, bookings: 2967, "timestamp": 1675855200 }
]'
Can this be achieved by shell scripting and jq command?
Assume that timestamp will always be present and will always have a common value across two datasets. Even the order is same.

This last paragraph just caught my attention:
Assume that timestamp will always be present and will always have a common value across two datasets. Even the order is same.
If this is truly the case then it is reasonable to assume that both arrays have the same length and their items are aligned respectively. Thus, there's no need to build up a hash-based INDEX as accessing the items by their numeric keys (positions within the arrays) can already be achieved in constant time.
jq -n --argjson data1 "$data1" --argjson data2 "$data2" '
$data1 | [keys[] | $data2[.] + $data1[.]]
'
[
{
"errors": 51,
"timestamp": 1675854900,
"bookings": 2984
},
{
"errors": 90,
"timestamp": 1675855200,
"bookings": 2967
}
]

A simple JOIN operation could do:
jq -n --argjson data1 "$data1" --argjson data2 "$data2" '
[JOIN(INDEX($data1[]; .timestamp); $data2[]; .timestamp | #text; add)]
'
[
{
"errors": 51,
"timestamp": 1675854900,
"bookings": 2984
},
{
"errors": 90,
"timestamp": 1675855200,
"bookings": 2967
}
]
I'm getting this error: jq: error: JOIN/4 is not defined at <top-level>, line 2: [JOIN(INDEX($data1[]; .timestamp); $data2[]; .timestamp | #text; add)] jq: 1 compile error
You are probably using an older version of jq. JOIN and INDEX were introduced in jq 1.6. Either define them yourself by taking their definitions from source, or take those definitions and modify them to fit your very use case (both work well with jq 1.5).
Definitions from source:
jq -n --argjson data1 "$data1" --argjson data2 "$data2" '
def INDEX(stream; idx_expr):
reduce stream as $row ({}; .[$row | idx_expr | tostring] = $row);
def JOIN($idx; stream; idx_expr; join_expr):
stream | [., $idx[idx_expr]] | join_expr;
[JOIN(INDEX($data1[]; .timestamp); $data2[]; .timestamp | #text; add)]
'
Adapted to your use case:
jq -n --argjson data1 "$data1" --argjson data2 "$data2" '
($data1 | with_entries(.key = (.value.timestamp | #text))) as $ix
| $data2 | map(. + $ix[.timestamp | #text])
'

In general, if you find JOIN a bit tricky to understand or use, then consider using INDEX for this type of problem. In the present case, you could get away with a trivially simple approach, e.g.:
jq -n --argjson data1 "$data1" --argjson data2 "$data2" '
INDEX($data1[]; .timestamp) as $dict
| $data2 | map( . + $dict[.timestamp|tostring])

Another way to do this is to build a map from timestamps to error counts, and perform a lookup in it.
jq -n '
input as $data1
| input as $data2
| ($data2
| map({ "key": (.timestamp | tostring), "value": .errors })
| from_entries
) as $errors_by_timestamp
| $data1 | map(.errors = $errors_by_timestamp[(.timestamp | tostring)])
' <<<"$data1 $data2"

By the way, I have trying to this answer from AI since morning and finally it also gave me correct solution this time
#!/bin/bash
data1='[
{ "bookings": 2984, "timestamp": 1675854900 },
{ "bookings": 2967, "timestamp": 1675855200 }
]'
data2='[
{ "errors": 51, "timestamp": 1675854900 },
{ "errors": 90, "timestamp": 1675855200 }
]'
combined=$(jq -n --argjson d1 "$data1" --argjson d2 "$data2" '
[ $d1, $d2 ] | transpose[] | group_by(.timestamp) | map(
reduce .[] as $i ({}; . * $i)
)
')
echo "$combined"
Just pasting it here for you guys in case you didn't think of this method

Related

using jq and gnu parallel together

I have a jq command which I am trying to parallelise using GNU parallel but for some reason I am not able to get it to work.
The vanilla jq query is:
jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | #csv' results.json > test.out
I have tried to use it with parallel like so:
parallel -j0 --keep-order --spreadstdin "jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | #csv'" < results.json > test.json
but I get some bizzare compile error:
jq: error: syntax error, unexpected '|', expecting '$' or '[' or '{' (Unix shell quoting issues?) at <top-level>, line 1:
._id as | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ , .[0:rindex( Electronic address:)] ] | #csv
jq: 1 compile error
I think it does not like something re: quoting things in the string, but the error is a bit unhelpful.
UPDATE
Looking at other threads, I managed to construct this:
parallel -a results.json --results test.json -q jq -r '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | #csv'
but now it complains:
parallel: Error: Command line too long (76224 >= 63664) at input 0:
:(
An aexample (firstline) of the json file:
{
"_index": "corpuspm",
"_type": "_doc",
"_id": "6786777",
"_score": 1,
"_source": {
"CitationTextHeader": {
"Article": {
"AuthorList": [
{
"Affiliation": {
"Affiliation": "title, society, American Pediatric Society. address#hotmail.com."
}
}
]
}
}
}
}
results.json is a large file containing a json on each line
You could use --spreadstdin and -n1 to linewise spread the input into your jq filter. Without knowing about the structure of your input JSONs, I have just copied over your "vanilla" filter:
< results.json > test.out parallel -j0 -n1 -k --spreadstdin 'jq -r '\''
._id as $id | ._source.CitationTextHeader.Article.AuthorList[]?
| .Affiliation.Affiliation | [$id, .[0:rindex(" Electronic address:")]]
| #csv
'\'
Without more info this will be a guess:
doit() {
jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | #csv'
}
export -f doit
cat results.json | parallel --pipe doit > test.out
It reads blocks of ~1 MB from results.json which it passes to doit.
If that works, you may be able to speed up the processing with:
parallel --block -1 -a results.json --pipepart doit > test.out
It will on-the-fly split up results.json into n parts (where n = number of CPU threads). Each part will be piped into doit. The overhead of this is quite small.
Add --keep-order if you need the output to be in the same order as input.
If your disks are slow and your CPU is fast, this may be even faster:
parallel --lb --block -1 -a results.json --pipepart doit > test.out
It will buffer in RAM instead of in tempfiles. --keep-order will, however, not be useful here because the output from job 2 will only be read after job 1 is done.

How to get the content of a JSON without showing the names of the key values

I am new to Bash and I am currently trying to get the content of a JSON without showing the names of the key values.
This is how the JSON looks like (part of it):
[
{
"V1": 65,
"V2": "Female",
"V3": 0.7,
"V4": 0.1,
"V5": 187,
"V6": 16,
"V7": 18,
"V8": 6.8,
"V9": 3.3,
"V10": 0.9,
"Class": 1
},
{
"V1": 62,
"V2": "Male",
"V3": 10.9,
"V4": 5.5,
"V5": 699,
"V6": 64,
"V7": 100,
"V8": 7.5,
"V9": 3.2,
"V10": 0.74,
"Class": 1
},
{
"V1": 62,
"V2": "Male",
"V3": 7.3,
"V4": 4.1,
"V5": 490,
"V6": 60,
"V7": 68,
"V8": 7,
"V9": 3.3,
"V10": 0.89,
"Class": 1
}
]
This is my script
#!/bin/bash
echo "Albumin =3";
echo "Age Sex Albumin Proteins";
echo "******";
echo " "
echo "Women";
echo "--------------";
cat csvjson.json | jq -c '.[] | {V1, V2, V8, V9} | select(.V9 ==3) | select(.V2 =="Female")';
echo " "
echo "Men";
echo "-------------";
cat csvjson.json | jq -c '.[] | {V1, V2, V8, V9} | select(.V9 ==3) | select(.V2 =="Male")';
This is what the script shows
Women
--------------
{"V1":38,"V2":"Female","V8":5.6,"V9":3}
{"V1":38,"V2":"Female","V8":5.6,"V9":3}
{"V1":32,"V2":"Female","V8":6,"V9":3}
{"V1":31,"V2":"Female","V8":6,"V9":3}
{"V1":19,"V2":"Female","V8":5.5,"V9":3}
{"V1":38,"V2":"Female","V8":7,"V9":3}
{"V1":20,"V2":"Female","V8":6.1,"V9":3}
{"V1":32,"V2":"Female","V8":7,"V9":3}
{"V1":42,"V2":"Female","V8":6.7,"V9":3}
Men
-------------
{"V1":72,"V2":"Male","V8":7.4,"V9":3}
{"V1":60,"V2":"Male","V8":6.3,"V9":3}
{"V1":33,"V2":"Male","V8":5.4,"V9":3}
{"V1":60,"V2":"Male","V8":6.8,"V9":3}
{"V1":60,"V2":"Male","V8":7.4,"V9":3}
{"V1":60,"V2":"Male","V8":7,"V9":3}
{"V1":72,"V2":"Male","V8":6.2,"V9":3}
And this is what I want to show
Women
--------------
38,Female,3, 5.6
38,Female,3, 5.6
32,Female,3, 6
31,Female,3, 6
19,Female,3, 5.5
38,Female,3, 7
20,Female,3, 6.1
32,Female,3, 7
42,Female,3, 6.7
Men
--------------
72,Male,3, 7.4
60,Male,3, 6.3
33,Male,3, 5.4
60,Male,3, 6.8
60,Male,3, 7.4
60,Male,3, 7
72,Male,3, 6.2
So, how can I hide the key values and only show the content of the JSON after doing the filters I did?
This can be accomplished entirely within jq (although some constraints are not all clear, so please comment and I will update the code):
jq --raw-output '
group_by(.V2)[]
| if first.V2 == "Male" then "Men" else "Women" end,
"--------------",
(
.[]
| select(.V9 == 3.3) # this filters to matching records
| [.V1, .V2, .V9, .V8]
| join(",")
)
' csvjson.json
Demo
Demo stand-alone jq script and code bloc language highlight for use here in stack sites using pmf's answer.
#!/usr/bin/env -S jq --raw-output --from-file
group_by(.V2)[]
| if first.V2 == "Male" then "Men" else "Women" end,
"--------------",
(
.[]
| select(.V9 == 3.3) # this filters to matching records
| [.V1, .V2, .V9, .V8]
| join(",")
)

How can I pass a variable in aws cli command in bash?

I am unable to pass a variable in the tag-user cli command.
A=$(aws iam list-user-tags --user-name user --query 'Tags[].{Key:Key,Value:Value}' | grep -B2 "Description" | grep Value | awk -F ":" '{print $2}' | tr -d '",'| awk '$1=$1')
aws iam list-user-tags --user-name user --query 'Tags[].{Key:Key,Value:Value}' | grep -B2 "Description" | grep Value
"Value": "Used for SSO",
A=Used for SSO
passing the value of A to the below CLI :
aws iam tag-user --user-name azure-sso-user --tags "[{"Key": "own:team","Value": "test#test.com"},{"Key": "security","Value": "Service"},{"Key": "comment","Value": "$A"}]"
This is the error I get:
Error parsing parameter '--tags': Invalid JSON:
[{Key: own:team,Value: test#test.com},{Key: security,Value: Service},{Key: own:comment,Value: Used
This worked:
aws iam tag-user --user-name user --tags '[{"Key": "own:team","Value": "test#test.com"},{"Key": "security","Value": "Service"},{"Key": "own:comment","Value": "'"$A"'"}]'
That is, using the following:
[
{
"Key": "own:team",
"Value": "test#test.com"
},
{
"Key": "security",
"Value": "Service"
},
{
"Key": "own:comment",
"Value": "'"
$A
"'"
}
]

Shell script - Sorting 'AWS cloudwatch metrics' json array based on the “Timestamp” property with raw output including statistics

I am running aws cli
aws cloudwatch get-metric-statistics --metric-name CPUUtilization --start-time 2010-02-20T12:00:00 --end-time 2010-02-20T15:00:00 --period 60 --namespace AWS/EC2 --extended-statistics p80 --dimensions Name=InstanceId,Value=i-0b123423423
the output comes as
{
"Label": "CPUUtilization",
"Datapoints": [
{
"Timestamp": "2020-02-20T12:15:00Z",
"Unit": "Percent",
"ExtendedStatistics": {
"p80": 0.16587132264856133
}
},
How do i get the output in the below format's (2 Columns)
19.514049550078127 2020-02-13T20:15:00Z
12.721997782508938 2020-02-13T19:15:00Z
13.318820949213313 2020-02-13T18:15:00Z
15.994192991030545 2020-02-13T17:15:00Z
18.13096421299414 2020-02-13T16:15:00Z
with Heading as CPUUtilization (2 columns)
CPUUtilization
19.514049550078127 2020-02-13T20:15:00Z
12.721997782508938 2020-02-13T19:15:00Z
13.318820949213313 2020-02-13T18:15:00Z
15.994192991030545 2020-02-13T17:15:00Z
18.13096421299414 2020-02-13T16:15:00Z
And in single column
19.514049550078127
12.721997782508938
13.318820949213313
15.994192991030545
18.13096421299414
How can achieve this ?
Assuming the input file is input.json, then:
To output in the 2 columns format:
jq -r '.Datapoints[] | [.ExtendedStatistics.p80, .Timestamp] | #tsv' input.json | sort -nr
With Heading as CPUUtilization (2 columns):
echo CPUUtilization; jq -r '.Datapoints[] | [.ExtendedStatistics.p80, .Timestamp] | #tsv' input.json | sort -nr
And in single column:
jq -r '.Datapoints[] | [.ExtendedStatistics.p80] | #tsv' input.json | sort -nr

Parsing JSON file-jq [duplicate]

This question already has answers here:
jq not working on tag name with dashes and numbers
(2 answers)
Closed 4 years ago.
Whole file:https://1drv.ms/u/s!AizscpxS0QM4hJpEkp12VPHiKO_gBg
Using this command i get part bellow (get latest job)
jq '.|[ .executions[] | select(.job.name != null) | select(.job.name) ]
| sort_by(.id)
| reverse
| .[0] ' 1.json
{
"argstring": null,
"date-ended": {
"date": "2018-04-03T17:43:38Z",
"unixtime": 1522777418397
},
"date-started": {
"date": "2018-04-03T17:43:34Z",
"unixtime": 1522777414646
},
"description": "",
"executionType": "user",
"failedNodes": [
"172.30.61.88"
],
"href": "http://172.30.61.88:4440/api/21/execution/126",
"id": 126,
"job": {
"averageDuration": 4197,
"description": "",
"group": "",
"href": "http://172.30.61.88:4440/api/21/job/271cbcec-5042-4d52-b794-ede2056b2ab8",
"id": "271cbcec-5042-4d52-b794-ede2056b2ab8",
"name": "aa",
"permalink": "http://172.30.61.88:4440/project/demo/job/show/271cbcec-5042-4d52-b794-ede2056b2ab8",
"project": "demo"
},
"permalink": "http://172.30.61.88:4440/project/demo/execution/show/126",
"project": "demo",
"status": "failed",
"user": "administrator"
I managed to extract job name and status, now want to get date-ended.date ?
jq '.|[ .executions[] |select(.job.name != null) | select(.job.name) ]
| sort_by(.id)
| reverse
| .[0]
| "\(.status), \(.job.name)"' 1.json
With the "-r" command-line option, the following filter:
[.executions[] | select(.job.name != null)]
| sort_by(.id)
| reverse
| .[0]
| [.status, .job.name, ."date-ended".date]
| #csv
produces:
"failed","aa","2018-04-03T17:43:38Z"
An important point that you might have missed is that "-" is a "special" character in that it can signify negation or subtraction.
If your jq does not support the syntax ."date-ended".date, then you could fall back to the basic syntax: (.["date-ended"] | .date)
I guess you have troubles extracting .date-ended.date because the name contains a dash that is interpreted by jq as subtraction.
The solution is listed in the documentation:
If the key contains special characters, you need to surround it with double quotes like this: ."foo$", or else .["foo$"].
This means the last filter of your jq program should be:
"\(.status), \(.job.name), \(."date-ended".date)"

Resources