using jq and gnu parallel together - bash

I have a jq command which I am trying to parallelise using GNU parallel but for some reason I am not able to get it to work.
The vanilla jq query is:
jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | #csv' results.json > test.out
I have tried to use it with parallel like so:
parallel -j0 --keep-order --spreadstdin "jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | #csv'" < results.json > test.json
but I get some bizzare compile error:
jq: error: syntax error, unexpected '|', expecting '$' or '[' or '{' (Unix shell quoting issues?) at <top-level>, line 1:
._id as | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ , .[0:rindex( Electronic address:)] ] | #csv
jq: 1 compile error
I think it does not like something re: quoting things in the string, but the error is a bit unhelpful.
UPDATE
Looking at other threads, I managed to construct this:
parallel -a results.json --results test.json -q jq -r '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | #csv'
but now it complains:
parallel: Error: Command line too long (76224 >= 63664) at input 0:
:(
An aexample (firstline) of the json file:
{
"_index": "corpuspm",
"_type": "_doc",
"_id": "6786777",
"_score": 1,
"_source": {
"CitationTextHeader": {
"Article": {
"AuthorList": [
{
"Affiliation": {
"Affiliation": "title, society, American Pediatric Society. address#hotmail.com."
}
}
]
}
}
}
}

results.json is a large file containing a json on each line
You could use --spreadstdin and -n1 to linewise spread the input into your jq filter. Without knowing about the structure of your input JSONs, I have just copied over your "vanilla" filter:
< results.json > test.out parallel -j0 -n1 -k --spreadstdin 'jq -r '\''
._id as $id | ._source.CitationTextHeader.Article.AuthorList[]?
| .Affiliation.Affiliation | [$id, .[0:rindex(" Electronic address:")]]
| #csv
'\'

Without more info this will be a guess:
doit() {
jq --raw-output '._id as $id | ._source.CitationTextHeader.Article.AuthorList[]? | .Affiliation.Affiliation | [ $id, .[0:rindex(" Electronic address:")] ] | #csv'
}
export -f doit
cat results.json | parallel --pipe doit > test.out
It reads blocks of ~1 MB from results.json which it passes to doit.
If that works, you may be able to speed up the processing with:
parallel --block -1 -a results.json --pipepart doit > test.out
It will on-the-fly split up results.json into n parts (where n = number of CPU threads). Each part will be piped into doit. The overhead of this is quite small.
Add --keep-order if you need the output to be in the same order as input.
If your disks are slow and your CPU is fast, this may be even faster:
parallel --lb --block -1 -a results.json --pipepart doit > test.out
It will buffer in RAM instead of in tempfiles. --keep-order will, however, not be useful here because the output from job 2 will only be read after job 1 is done.

Related

How to read each cell of a column in csv and take each as input for jq in bash

I am trying to read each cell of CSV and treat it as an input for the JQ command. Below is my code:
line.csv
| Line |
|:---- |
| 11 |
| 22 |
| 33 |
Code to read CSV:
while read line
do
echo "Line is : $line"
done < line.csv
Output:
Line is 11
Line is 22
jq Command
jq 'select(.scan.line == '"$1"') | .scan.line,"|", .scan.service,"|", .scan.comment_1,"|", .scan.comment_2,"|", .scan.comment_3' linescan.json | xargs
I have a linescan.json which have values for line, service, comment_1, comment_2, comment_3
I want to read each value of csv and treat the input in jq query where $1 is mentioned.
Given the input files and desired output:
line.csv
22,Test1
3389,Test2
10,Test3
linescan.json
{
"scan": {
"line": 3389,
"service": "Linetest",
"comment_1": "Line is tested1",
"comment_2": "Line is tested2",
"comment_3": "Line is tested3"
}
}
desired output:
Test2 | 3389 | Linetest | Line is tested1 | Line is tested2 | Line is tested3
Here's a solution with jq:
This is a shot in the dark, as you didn't specify what your output should look like:
jq -sr --rawfile lineArr line.csv '
(
$lineArr | split("\n") | del(.[-1]) | .[] | split(",")
) as [$lineNum,$prefix] |
.[] | select(.scan.line == ($lineNum | tonumber)) |
[
$prefix,
.scan.line,
.scan.service,
.scan.comment_1,
.scan.comment_2,
.scan.comment_3
] |
join(" | ")
' linescan.json
Update: with jq 1.5:
#!/bin/bash
jq -sr --slurpfile lineArr <(jq -R 'split(",")' line.csv) '
($lineArr | .[]) as [$lineNum,$prefix] |
.[] | select(.scan.line == ($lineNum | tonumber)) |
[
$prefix,
(.scan.line | tostring),
.scan.service,
.scan.comment_1,
.scan.comment_2,
.scan.comment_3
] |
join(" | ")
' linescan.json

how to add secuential number getting output with jq

I'm getting some values with jq command like these:
curl xxxxxx | jq -r '.[] | ["\(.job.Name), \(.atrib.data)"]' | #tsv' | column -t -s ","
It gives me:
AAAA PENDING
ZZZ FAILED BAD
What I want is that I get is a first field with a secuencial number (1 ....) like these:
1 AAA PENDING
2 ZZZ FAILED BAD
......
Do you know if it's possible? Thanks!
One way would be to start your pipeline with:
range(0;length) as $i | .[$i]
You then can use $i in the remainder of the program.

Using jq and outputting specific columns with formatting

Can anyone help me to understand how I can print countryCode followed by connectionName and load with a percentage symbol all on one line nicely formatted - all using jq - not using sed, column or any other unix external command. I cannot seem print anything other than the one column
curl --silent "https://api.surfshark.com/v3/server/clusters" | jq -r -c "map(select(.countryCode == "US" and .load <= "99")) | sort_by(.load) | limit(20;.[]) | [.countryCode, .connectionName, .load] | (.[1])
Is this what you wanted ?
curl --silent "https://api.surfshark.com/v3/server/clusters" |
jq -r -c 'map(select(.countryCode == "US" and .load <= 99)) |
sort_by(.load) |
limit(20;.[]) |
"\(.countryCode) \(.connectionName) \(.load)%"'

JQ - Argument list too long error - Large Input

I use Jq to perform some filtering on a large json file using :
paths=$(jq '.paths | to_entries | map(select(.value[].tags | index("Filter"))) | from_entries' input.json)
and write the result to a new file using :
jq --argjson prefix "$paths" '.paths=$prefix' input.json > output.json
But this ^ fails as $paths has a very high line count (order of 100,000).
Error :
jq: Argument list too long
I also went through : /usr/bin/jq: Argument list too long error bash , understood the same problem there, but did not get the solution.
In general, assuming your jq allows it, you could use —argfile or —slurpfile but in your case you can simply avoid the issue by invoking jq just once instead of twice. For example, to keep things clear:
( .paths | to_entries | map(select(.value[].tags | index("Filter"))) | from_entries ) as $prefix
| .paths=$prefix
Even better, simply use |=:
.paths |= ( to_entries | map(select(.value[].tags | index("Filter"))) | from_entries)
or better yet, use with_entries.

Merge multiple jq invocations to sort and limit the content of a stream of objects

I have a json stream of updates for products, and I'm trying to get the last X versions sorted by version (they are sorted by release date currently).
It looks like jq can't sort a stream of objects directly, sort_by only works on arrays, and I couldn't find a way to collect a stream into an array that doesn't involve piping the output of jq -c to jq -s.
My current solution:
< xx \
jq -c '.[] | select(.platform | contains("Unix"))' \
| jq -cs 'sort_by(.version) | reverse | .[]' \
| head -5 \
| jq -C . \
| less
I expected to be able to use
jq '.[] | select(...) | sort_by(.version) | limit(5) | reverse'
but I couldn't find a thing that limits and sort_by doesn't work on non arrays.
I am testing this on atlassian's json for releases: https://my.atlassian.com/download/feeds/archived/bamboo.json
In jq you can always contain the results to an array using the [..] that put the results to an array for the subsequent functions to operate on. Your given requirement could be simply done as
jq '[.[] | select(.platform | contains("Unix"))] | sort_by(.version) | limit(5;.[])'
See it working on jq playground tested on v1.6
and with added reverse() function, introduce an another level of array nesting. Use reverse[] to dump the objects alone
jq '[[.[] | select(.platform | contains("Unix"))] | sort_by(.version) | limit(5;.[]) ] | reverse'

Resources