My Awk command sorts, but unexpectedly omits duplicates - sorting

Im trying to sort this file by a specific field, and I want to do it all in awk:
"firstName": "gdrgo", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "lastName": "222",dfg
"xxxxx": "John", "firstName": "beto", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "lastName": "111","xxxxx": "John",
"xxxxx": "John", "firstName": "beto", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "lastName": "111","xxxxx": "John",
"xxxxx": "John", "xxxxx": "John", "firstName": "beto2", "xxxxx": "John","lastName": "555", "xxxxx": "John","xxxxx": "John",
"xxxxx": "John", "xxxxx": "John", "firstName": "beto2", "xxxxx": "John","lastName": "444", "xxxxx": "John","xxxxx": "John",
"firstName": "gdrgo", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "xxxxx": "John", "lastName": "222",dfg
"xxxxx": "John", "xxxxx": "John", "firstName": "beto2", "xxxxx": "John","lastName": "444", "xxxxx": "John","xxxxx": "John",
I use this command:
awk -F'.*"firstName": "|",.*"lastName": "|",' '{b[$3]=$0} END{for(i in b){print i}}' sumacomando
which outputs:
111
222
444
555
but I expected:
111
111
222
222
444
444
555
That is, while the actual output is seemingly sorted, as desired, it is unexpectedly missing duplicate values.

The ordering of keys/indices in awk's arrays, which are always associative arrays (dictionaries), is an implementation detail - no particular order is guaranteed; in your case the output just happened to be sorted.
keys are unique, so if $3 in more than 1 input row has the same value, the b[$3]=... assignments overwrite each other - the last one wins.
You therefore:
have to use a sequentially indexed array to store your 3rd field values ($3)
have to sort the resulting array by its values later.
Per the POSIX Awk spec, Awk has no built-in sort functions, but GNU awk does, enabling the following solution with its asort() function:
awk -F'.*"firstName": "|",.*"lastName": "|",' '
{ b[++n]=$3 } END{ asort(b); for(i=1;i<=n;++i) print b[i] }
' sumacomando
Note that this doesn't include storing the associated full lines ($0).
If you also want to store the associated full lines while still performing the sorting in (GNU) Awk, it gets more complicated:
awk -F'.*"firstName": "|",.*"lastName": "|",' '
# Use a compound key to store the value of $3 plus a sequential index
# to disambiguate, and store the input row ($0) as the value.
{ vals[$3,++n]=$0 }
END{
# Sort by compound key using the helper function defined below.
asorti(vals, names, "cmp_func");
# Output the first half of the compound key, i.e., the value of $3,
# followed by the associated input row.
for(i=1;i<=n;++i) print gensub(SUBSEP ".*$", "", 1, names[i]), vals[names[i]]
}
# Helper sort function that splits the compound key into its components
# - $3 value and sequential index - and compares the $3 values alphabetically
# and the indices numerically.
function cmp_func(i1, v1, i2, v2) {
split(i1, tokens1, SUBSEP)
split(i2, tokens2, SUBSEP)
if (tokens1[1] < tokens2[1]) return -1
if (tokens1[1] > tokens2[1]) return 1
i1 = int(tokens1[2])
i2 = int(tokens2[2])
if (i1 < i2) return -1
if (i1 > i2) return 1
return 0
}
' sumacomando
Piping to sort as an alternative solution greatly simplifies matters:
awk -F'.*"firstName": "|",.*"lastName": "|",' '{ print $3, $0 }' sumacomando | sort -k1,1
Note, however, that the pure Awk solution above preserves the input order among duplicate $3 values, which the sort-assisted solution does not.
Conversely, the pure Awk solution needs to store all input in memory at once, whereas the sort utility is optimized to work with large input sets and uses temporary files on demand.

your choice of field separator is unconventional, perhaps better use this instead
awk -F'[:,]' '{for(i=1;i<=NF;i++)
if($i~"\"lastName\"")
{gsub(/"/,"",$(i+1));
print $(i+1)}}' file | sort
if your awk has asort function you can do this instead
awk -F'[:,]' '{for(i=1;i<=NF;i++)
if($i~"\"lastName\"")
{gsub(/"/,"",$(i+1));
a[++c]=$(i+1)}}
END {asort(a);
for(k=1;k in a;k++) print a[k]}' file

#victorhernandezzero: #try: I tried with a different approach, I hope it may help you/all too. With only single awk(no other commands).
awk '/lastName/{getline;while(!$0){getline};A[$0]} END{num=asorti(A, B);for(i=1;i<=num;i++){print B[i]}}' RS='[: ",]' Input_file
EDIT1: Above solution will not give the duplicates which you need, special thanks to mklement0 for letting me know, following may help you in same too.
awk '/lastName/{getline;while(!$0){getline};A[++j]=$0} END{num=asort(A, B);for(i=1;i<=num;i++){print B[i]}}' RS='[: ",\n]' Input_file

Related

Convert JSON file to have each set of curly brackets on a single line

I have a simple JSON file like this:
{
"user_id": 123,
"value": 99
},
{
"user_id": 122,
"value": 100
}
but I need it to look like this:
{"user_id": 123, "value": 99}
{"user_id": 122, "value": 100}
Basically every set of curly brackets should be on its own line. I was hoping it would be simple with jq but I'm quite new to it. Thank you in advance.
jq can be used to wrap the input (as raw text) inside [...], with the result being parsable by its builtin fromjson filter. Split the resulting array into separate objects again, and use -c to output each object on a single line.
$ cat old.txt
{
"user_id": 123,
"value": 99
},
{
"user_id": 122,
"value": 100
}
$ jq -csR '"[\(.)]" | fromjson |.[]' old.txt
{"user_id":123,"value":99}
{"user_id":122,"value":100}
If you get rid of the comma between the two objects (which is definitely invalid JSON), then just use jq's --compact-output (or -c) option and the identity filter ..
jq --compact-output '.'
{"user_id":123,"value":99}
{"user_id":122,"value":100}
Demo

JQ filter formatting

I have a JSON file I'd like to filter by the ID field, and show the matching Body and Source fields.
Format of the JSON file to filter
[
{
"timestamp" : 1638550971085,
"id" : "54f",
"body" : "Orange",
"source" : "827261"
},
{
"timestamp" : 1638550971096,
"id" : "54f",
"body" : "Apple",
"source" : "137261"
},
{
"timestamp" : 1638550971126,
"id" : "5da",
"body" : "Pear",
"source" : "1da61"
}
]
In this example I would like to filter where id = 54f and show the Timestamp (Unixtime converted to local time), Body and Source fields that match, ideally as shown below.
[Timestamp] Orange 827261
[Timestamp] Apple 137261
I have tried this command, but it is showing extra body / source fields outside of the SQL filter. It also adds a line break between printing the body and source, and ideally I'd like this printed on one line (tab separated). I also don't know how to convert the timestamp to localtime string.
more file.json | jq '.[] | select(.Id=="54f").body, .source'
Your JSON input is not proper JSON as it has
commas after the .source field but no following field
no commas between the elements of the top-level array
no quotes around the objects' field names
You'd need to address these issues first before proceeding. This is how it should look like:
[
{
"timestamp": 1638550971085,
"id": "54f",
"body": "Orange",
"source": "827261"
},
{
"timestamp": 1638550971096,
"id": "54f",
"body": "Apple",
"source": "137261"
},
{
"timestamp": 1638550971126,
"id": "5da",
"body": "Pear",
"source": "1da61"
}
]
Then you can go with this
localtime (available since jq 1.6) converts a timestamp of seconds (so, divide yours by 1000) since the Unix epoch into a so-called "broken down time" object (see the manual) which you can either process using strftime (see the answer from David Conrad), or parse yourself manually. With .[:3] | .[1] += 1 | join("-") I provided a rather primitive example for demonstration purposes which concatenates its first three items (year, month, day) with dashes in between, after incrementing the second item (as the month has a 0-based encoding) - for padding with zeroes check out one of the answers over here
#tsv creates tabs between the columns
jq -r '
.[]
| select(.id == "54f")
| [(.timestamp / 1000 | localtime | .[:3] | join("-")), .body, .source]
| #tsv
' file.json
2021-12-3 Orange 827261
2021-12-3 Apple 137261
Demo
As the other answer states, your JSON is not correct. After fixing that, you can filter and extract the data as that answer suggests, but use the strftime function to format the dates properly:
jq -r '.[] | select(.id == "54f")
| [(.timestamp / 1000 | localtime | strftime("%Y-%m-%d")), .body, .source]
| #tsv' file.json
The use of strftime("%Y-%m-%d") is critical to both displaying the correct month and formatting the date with leading zeroes on single-digit months and days.
2021-12-03 Orange 827261
2021-12-03 Apple 137261

Bash looping over returned Json using curl

So i make a curl command to a url which returns an array of objects
response= $(curl --locaiton --request GET "http://.....")
I need to iterate over the returned json and extract a single value..
The json is as follows:
{
data:[
{"name": "ABC", "value": 1},
{"name": "EFC", "value": 4},
{"name": "CEC", "value": 3}
]
}
Is there anyway in BASh i can extract the second object value.. by perhaps iterating and doing an IF
Use jq
A simple example to extract the 2nd entry of the array would be:
RESPONSE='{"data":[{"name":"ABC","value": 1},{"name":"EFC","value":4},{"name":"CEC","value":3}]}';
EXTRACTED=$(echo -n "$RESPONSE" | jq '.data[1]');
echo $EXTRACTED
jq is the most commonly used command/tool to parse JSON data from a shell script. Here is an example with your data:
#!/usr/bin/env sh
# JSON response
response='
{
"data": [
{"name": "ABC", "value": 1},
{"name": "EFC", "value": 4},
{"name": "CEC", "value": 3}]
}'
# Name of entry
name='EFC'
# Get value of entry
value=$( jq --null-input --raw-output --arg aName "$name" \
"$response"' | .data[] | select(.name == $aName) | .value')
# Print it out
printf 'value for %s is: %s\n' "$name" "$value"
Alternatively jq can be used to transform the whole JSON name value objects array, into a Bash associative array declaration:
#!/usr/bin/env bash
# JSON response
response='
{
"data": [
{"name": "ABC", "value": 1},
{"name": "EFC", "value": 4},
{"name": "CEC", "value": 3}]
}'
# Name of entry
name='EFC'
# Covert all JSON array name value entries into a Bash associative array
# shellcheck disable=SC2155 # safe generated declaration
declare -A entries="($( jq --null-input --raw-output \
"$response"' | .data[] | ( "[" + ( .name | #sh ) + "]=" + (.value | #sh) )'))"
# Print it out
printf 'value for %s is: %s\n' "$name" "${entries[$name]}"

curl with Variable not work - Shell Script

when I execute a normal curl via a shell script functioniert es.
This work:
curl -s -v -X POST --data '{
"zoneConfig": {
"userID": "'$userid'",
"name": "'$myName'",
"id":"'$id'"
},
"delete": [
{
"id": "ID1"
},
{
"id": "ID2"
}
]
}' https://urlToAPI
But as soon as I put "delete" in a variable I get an undefined error from the API vendor
This is not working
delete='{
"id": "ID1"
},
{
"id": "ID2"
}'
curl -s -v -X POST --data '{
"zoneConfig": {
"userID": "'$userid'",
"name": "'$myName'",
"id":"'$id'"
},
"delete": [
'$deleteValues'
]
}' https://urlToAPI
But I don't understand the difference as both configurations are the same?
When interpolating, the value is split on whitespace.[1]
As such,
a='a b c'
prog $a
is equivalent to
prog 'a' 'b' 'c'
This splitting doesn't occur if the interpolation occurs inside of double-quotes.
As such,
a='a b c'
prog "$a"
is equivalent to
prog 'a b c'
Therefore, you need to change
$deleteValues
to
"$deleteValues"
To be precise, the IFS env var controls how the value is split. It's normally set such that splitting occurs on spaces, tabs and line feeds.

sed replace every word with single quotes with double quotes

I'm trying to parse a file with single quotes, and want to change it to double quotes.
Sample data :
{'data': ['my',
'my_other',
'my_other',
'my_other',
'This could 'happen' <- and this ones i want to keep',
],
'name': 'MyName'},
'data2': {'type': 'x86_64',
'name': 'This',
'type': 'That',
'others': 'bla bla 'bla' <- keep this ones too',
'version': '21237821972'}}}
Desired output :
{"data": ["my",
"my_other",
"my_other",
"my_other",
"This could 'happen' <- and this ones i want to keep"
],
"name": "MyName"},
"data2": {"type": "x86_64",
"name": "This",
"type": "That",
"others": "bla bla 'bla' <- keep this ones too",
"version": "21237821972"}}}
I've already tried to do some regex with sed, but unlucky.
I understand why this is not working for me, just don't know how to go further to get data as i want.
sed -E -e "s/( |\{|\[)'(.*)'(\:|,|\]|\})/\1\"\2\"\3/g"
Cheers,
I am no expert in jq so as per OP's question trying to answer in awk to substitute ' to " here.
awk -v s1="\"" '!/This could/ && !/others/{gsub(/\047/,s1) } /This could/ || /others/{gsub(/\047/,s1,$1)} 1' Input_file
Output will be as follows.
{"data": ["my",
"my_other",
"my_other",
"my_other",
"This could 'happen' <- and this ones i want to keep',
],
"name": "MyName"},
"data2": {"type": "x86_64",
"name": "This",
"type": "That",
"others": 'bla bla 'bla' <- keep this ones too',
"version": "21237821972"}}}
We know that ‘sed’ command can search for a pattern and can replace that pattern with user provided new one
For example sed “s/pattern1/pattern2/g” filename.txt
Now the ‘sed’ command will search for pattern1 and if found it will replace with pattern2
For your requirement you just need to apply this rule. See below
First
sed "s/^\'/\"/g” yourfile
This will search for every newline with character ‘ in the file and replace with “
Next requirement is to search for pattern ‘: and replace with “:
So add one more condition to it separated by ;
sed "s/^\'/\"/g; s/\':/\":/g” yourfile
Just follow this algorithm till you reach you requirement
The final should be look like:-
sed "s/^\'/\"/g; s/\':/\":/g;s/{\'/{\"/g;s/\[\'/\[\"/g;s/\',/\",/g;s/\'}/\"}/g;s/: \'/: \"/g;" yourfile > newfil
(If the above command gives you error just use the command at the very beginning)
finally
mv newfile yourfile

Resources