Replace a multiline pattern using Perl, sed, awk - bash

I need to concatenate multiple JSON files, so
...
"tag" : "description"
}
]
[
{
"tag" : "description"
...
into this :
...
"tag" : "description"
},
{
"tag" : "description"
...
So I need to replace the pattern ] [ with ,, but the new line character makes me crazy...
I used several methods, I list some of them:
sed
sed -i '/]/,/[/{s/./,/g}' file.json
but I get this error:
sed: -e expression #1, char 16: unterminated address regex
I tried to delete all the newlines
following this example
sed -i ':a;N;$!ba;s/\n/ /g' file.json
and the output file has "^M". Although I modified this file in unix, I used the dos2unix command on this file but nothing happens. I tried then to include the special character "^M" on the search but with worse results
Perl
(as proposed here)
perl -i -0pe 's/]\n[/\n,/' file.json
but I get this error:
Unmatched [ in regex; marked by <-- HERE in m/]\n[ <-- HERE / at -e line 1.

I would like to concatenate several JSON files.
If I understand correctly, you have something like the following (where letters represent valid JSON values):
to_combine/file1.json: [a,b,c]
to_combine/file2.json: [d,e,f]
And from that, you want the following:
combined.json: [a,b,c,d,e,f]
You can use the following to achieve this:
perl -MJSON::XS -0777ne'
push #data, #{ decode_json($_) };
END { print encode_json(\#data); }
' to_combine/*.json >combined.json
As for the problem with your Perl solution:
[ has a special meaning in regex patterns. You need to escape it.
You only perform one replacement.
-0 doesn't actually turn on slurp mode. Use -0777.
You place the comma after the newline, when it would be nicer before the newline.
Fix:
cat to_combine/*.json | perl -0777pe's/\]\n\[/,\n/g' >combined.json

Note that a better way to combine multiple JSON files is to parse them all, combine the parsed data structure, and reencode the result. Simply changing all occurrences of ][ to a comma , may alter data instead of markup
sed is a minimal program that will operate only on a single line of a file at a time. Perl encompasses everything that sed or awk will do and a huge amount more besides, so I suggest you stick with it
To change all ]...[ pairs in file.json (possibly separated by whitespace) to a single comma, use this
perl -0777 -pe "s/\]\s*\[/,/g" file.json > file2.json
The -0 option specifies an octal line separator, and giving it the value 777 makes perl read the entire file at once
One-liners are famously unintelligible, and I always prefer a proper program file, which would look like this
join_brackets.pl
use strict;
use warnings 'all';
my $data = do {
local $/;
<>;
}
$data =~ s/ \] \s* \[ /,/gx;
print $data;
and you would run it as
perl join_brackets.pl file.json > joined.json

I tried with example in your question.
$ sed -rn '
1{$!N;$!N}
$!N
/\s*}\s*\n\s*]\s*\n\s*\[\s*\n\s*\{\s*/M {
s//\},\n\{/
$!N;$!N
}
P;D
' file
...
"tag" : "description"
},
{
"tag" : "description"
...
...
"tag" : "description"
},
{
"tag" : "description"
...

Related

Replace a pattern with the output of a command in sed

Suppose I have some text file (json in this case):
{
"data": [
{
"timestamp": 1577856103107
},
{
"timestamp": 1577869991302
}
]
}
And I want to replace a pattern (in this case a UNIX millisecond timestamp) with a more readable date format.
I'm trying with this:
$ sed -E 's/(.*)([0-9]{13})/echo "\1\\"$(date --date="#$((\2\/1000))" --iso-8601=seconds)\\""/e' example.json
{
"data": [
{
timestamp: "2020-01-01T00:21:43-05:00"
},
{
timestamp: "2020-01-01T04:13:11-05:00"
}
]
}
This is somewhat ok, but I don't understand why the quotes arround timestamp get lost.
This command works:
sed -E 's/(.*)"(timestamp)"(: )([0-9]{13})/echo "\1\\"\2\\"\3\\"$(date --date="#$((\4\/1000))" --iso-8601=seconds)"\\"/e' example.json
{
"data": [
{
"timestamp": "2020-01-01T00:21:43-05:00"
},
{
"timestamp": "2020-01-01T04:13:11-05:00"
}
]
}
I also don't understand why I need double backslashes \\ to ouput a double-quote " in the right side of this sed command.
Is there a better way (or tool) to solve this?
I'm on sed (GNU sed) 4.8 and zsh 5.8 (x86_64-pc-linux-gnu), thanks!
Is there a better way (or tool) to solve this?
Using sed for manipulating json things is very crude. You can't parse json with regex. I (strongly) suggest to use json-aware tools, like jq.
jq '.data[].timestamp |= (. / 1000 | strftime("%Y-%m-%dT%H:%M:%SZ"))'
but I don't understand why the quotes arround timestamp get lost.
The:
echo "\1\\"$(date --date="#$((\2\/1000))" --iso-8601=seconds)\\""
is "substituted" to:
echo ""timestamp": \"$(date --date="#$((1577856103107/1000))" --iso-8601=seconds)\""
^^
^------------------------------------------------------------------^
Then is passed to shell and quotes are re-evalulated according to shell rules.
Matching (.*) is really doing nothing, just match only the part you want to substitute. You could instead match only the part you want to substitute:
sed '/"timestamp":/s/([0-9]{13})/echo "\\"$(date --date="#$((\1\/1000))" --iso-8601=seconds)\\""/e'
why I need double backslashes \ to ouput a double-quote " in the right side of this sed command.
First \\ is interpreted by sed into single \.
$ echo a | sed 's/.*/single slash: \\/'
single slash: \
Then the result of sed command is passed to shell where all shell parsing rules are done one again.

Bash/*NIX: split a file into multiple files on a substring

Variants of this question have been asked and answered before, but I find that my sed/grep/awk skills are far too rudimentary to work from those to a custom solution since I hardly ever work in shell scripts.
I have a rather large (100K+ lines) text file in which each line defines a GeoJSON object, each such object including a property called "county" (there are, all told, 100 different counties). Here's a snippet:
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 4, "vDEM": 0, "vREP": 2, "vUNA": 2, "vTOT": 4}, "geometry": {"type":"Polygon","coordinates":[[[-79.537429,35.843303],[-79.542428,35.843303],[-79.542428,35.848302],[-79.537429,35.848302],[-79.537429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"NEW HANOVER", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.532429,35.843303],[-79.537428,35.843303],[-79.537428,35.848302],[-79.532429,35.848302],[-79.532429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.527429,35.843303],[-79.532428,35.843303],[-79.532428,35.848302],[-79.527429,35.848302],[-79.527429,35.843303]]]}},
I need to split this into 100 separate files, each containing one county's GeoJSONs, and each named xxxx_bins_2016.json (where xxxx is the county's name). I'd also like the final character (comma) at the end of each such file to go away.
I'm doing this in Mac OSX, if that matters. I hope to learn a lot by studying any solutions you could suggest, so if you feel like taking the time to explain the 'why' as well as the 'what' that would be fantastic. Thanks!
EDITED to make clear that there are different county names, some of them two-word names.
jq can kind of do this; it can group the input and output one line of text per group. The shell then takes care of writing each line to an appropriately named file. jq itself doesn't really have the ability to open files for writing that would allow you to do this in a single process.
jq -Rn -c '[inputs[:-1]|fromjson] | group_by(.properties.county)[]' tmp.json |
while IFS= read -r line; do
county=$(jq -r '.[0].properties.county' <<< $line)
jq -r '.[]' <<< "$line" > "$county.txt"
done
[inputs[:-1]|fromjson] reads each line of your file as a string, strips the trailing comma, then parses the line as JSON and wraps the lines into a single array. The resulting array is sorted and grouped by county name, then written to standard output, one group per line.
The shell loop reads each line, extracts the county name from the first element of the group with a call to jq, then uses jq again to write each element of the group to the appropriate file, again one element per line.
(A quick look at https://github.com/stedolan/jq/issues doesn't appear to show any requests yet for an output function that would let you open and write to a file from inside a jq filter. I'm thinking of something like
jq -Rn '... | group_by(.properties.county) | output("\(.properties.county).txt")' tmp.json
without the need for the shell loop.)
If using string parsing rather than proper JSON parsing to extract the county name is acceptable - brittle in general, but would work in this simple case - consider Sam Tolton's GNU awk answer, which has the potential to be by far the simplest and fastest solution.
To complement chepner's excellent answer with a variation that focuses on performance:
jq -Rrn '[inputs[:-1]|fromjson] | .properties.county + "|" + (.|tostring)' file |
awk -F'|' '{ print $2 > ($1 "_bins_2016.json") }'
Shell loops are avoided altogether, which should speed up the operation.
The general idea is:
Use jq to trim the trailing , from each input line, interpret the trimmed string as JSON, extract the county name, then output the trimmed JSON strings prepended with the county name and a distinct separator, |.
Use an awk command to split each line into the prepended county name and the trimmed JSON string, which allows awk to easily construct the output filename and write the JSON string to it.
Note: The awk command keeps all output files open until the script has finished, which means that, in your case, 100 output files will be open simultaneously - a number that shouldn't be a problem, however.
In cases where it is a problem, you can use the following variation, in which jq first sorts the lines by county name, which then allows awk to immediately close the previous output field whenever the next county is reached in the input:
jq -Rrn '
[inputs[:-1]|fromjson] | sort_by(.properties.county)[] |
.properties.county + "|" + (.|tostring)
' file |
awk -F'|' '
prevCounty != $1 { if (outFile) close(outFile); outFile = $1 "_bins_2016.json" }
{ print $2 > outFile; prevCounty = $1 }
'
A simpler version of chepner's answer:
while IFS= read -r line
do
countyName=$(jq --raw-output '.properties.county' <<<"${line: : -1}")
jq <<< "${line: : -1}" >> "$countyName"_bins_2016.json
done<file
The idea is to filter the county name using a jq filter after stripping the , from each line of your input file. Then the line is passed to jq as plain stream to produce a JSON file in prettified format.
If you are from a relatively older version of bash (< 4.0) use "${line%?}" over "${line: : -1}"
For example with the change above, one of your county becomes,
cat ALAMANCE_bins_2016.json
{
"type": "Feature",
"properties": {
"county": "ALAMANCE",
"vBLA": 0,
"vWHI": 0,
"vDEM": 0,
"vREP": 0,
"vUNA": 0,
"vTOT": 0
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-79.527429,
35.843303
],
[
-79.532428,
35.843303
],
[
-79.532428,
35.848302
],
[
-79.527429,
35.848302
],
[
-79.527429,
35.843303
]
]
]
}
}
Note: The current solution could be performance intensive as reading file line by line is an expensive operation, and equally invoking jq for each of the lines.
This will do what you want minus getting rid of the last comma:-
gawk 'match($0, /"county":"([^"]+)/, array){ print >array[1]"_bins_2016.json" }' INPUT_FILE
This will output files in the current path with a filename in the format COUNTRY NAME_bins_2016.json.
The script goes line by line and uses a regex to match the exact term "country":" followed by 1 or more characters that aren't a ". It captures the characters within the quotes and then uses it as part of the filename to append the current line to.
To remove the trailing comma from all .json files in the current path you could use:-
sed -i '$ s/,$//' *.json
If you were certain that the last char was always a comma, a faster solution would be to use truncate:-
truncate -s-1 *.json
Last part taken from this answer: https://stackoverflow.com/a/40568723/1453798
Here is a quickie script that will do the job. It has the virtue of working on most systems without having to install any other tools.
IFS=$'\n'
counties=( $( sed 's/^.*"county":"//;s/".*$//' counties.txt ) )
unset IFS
for county in "${!counties[#]}"
do
county="${counties[$i]}"
filename="$county".out.txt
echo "'$filename'"
grep "\"$county\"" counties.txt > "$filename"
done
The setting of IFS to \n allows the array elements to contain spaces. The sed command strips off all the text up to the start of the county name and all the text after it. The for loop is the form that allows iterating over the array. Finally, the grep command needs to have double quotes around the search string so that counties that are substrings of other counties don't accidentally get put into the wrong file.
See this section of the GNU BASH Reference Manual for more info.

bash replacing character at certain position on a certain line

I have a file that looks like this:
[
{
"ncyc" : 28817,
"icels" : 128,
"jcels" : 128,
"t" : 0.185896E-006,
"dt" : 0.955602E-012,
"dtcour" : 0.100000E+021,
"dti" : 0.100000E+021,
"dtc" : 0.262902E-011,
"dtvol" : 0.239735E-010,
"dthall" : 0.100000E+021,
"dtlaser" : -0.925596E+062,
"dtmax" : 0.200000E-009,
}
]
I want to delete the last comma of this file. It appears at the 14th line at position 34. I could do this manually if it was one file but I have to do this for 300 files
sed is your friend:
sed -i.bak '14s/,[[:blank:]]*$//' file ...
This is a bit fragile: it assumes the line to remove is always the 14th, not necessarily the line before the closing brace.
Depending on the platform sed or awk might have varying results, perl might be more flexible:
perl -i.bak -00pe 's/,(?!.*,)//s' file
# , matches a comma.
# (?!.*,) negative lookahead asserts no comma after matched comma.
# s is a DOTALL modifier matching newline characters also.
This is a straightforward ed one-liner:
ed foo.json <<EOF
?,?s/,\([^,]*\)$/\1/
wq
EOF
That line can be broken into an address and a command.
The address is ?,?, namely the previous line matching the regular expression ,.
The command is s/re/replacement/, where the regular expression is ,\([^,]*\)$ (a literal ,, a captured group of zero or more character that are not ,, and the end of the line), and the replacement is \1 (the first captured group).
Technically it's a two-line ed script, wq to save and quit.
You could invoke this in a loop with find, for instance:
find . -name '*.json' | while read name ; do
ed -s $name <<EOF
H
[…ed commands…]
wq
EOF
done
I've also added ed -s to suppress the file size message, and H to output verbose errors instead of the infamous ?.
Thanks for the answers. I was easily able to solve the question myself using Python:
f=open(fjson, 'r')
data= f.readlines()
ndx=len(data)
data[ndx-3]= data[ndx-3].replace(',', '')

remove only *some* fullstops from a csv file

If I have lines like the following:
1,987372,987372,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,.,1.293,12.23,0.989,0.973,D,.,.,.,.,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
how can I replace all instances of ,., with ,?,
I want to preserve actual decimal places in the numbers so I can't just do
sed 's/./?/g' file
however when doing:
sed 's/,.,/,?,/g' file
this only appears to work in some cases. i.e. there are still instances of ,., hanging around.
anyone have any pointers?
Thanks
This should work :
sed ':a;s/,\.,/,?,/g;ta' file
With successive ,., strings, after a substitution succeeded, next character to be processed will be the following . that doesn't match the pattern, so with you need a second pass.
:a is a label for upcoming loop
,\., will match dot between commas. Note that the dot must be escaped because . is for matching any character (,a, would match with ,.,).
g is for general substitution
ta tests previous substitution and if it succeeded, loops to :a label for remaining substitutions.
Using sed it is possible by running a loop as shown in above answer however problem is easily solved using perl command line with lookarounds:
perl -pe 's/(?<=,)\.(?=,)/?/g' file
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
This command doesn't need a loop because instead of matching surrounding commas we're just asserting their position using a lookbehind and lookahead.
All that's necessary is a single substitution
$ perl -pe 's/,\.(?=,)/,?/g' dots.csv
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
You have an example using sed style regular expressions. I'll offer an alternative - parse the CSV, and then treat each thing as a 'field':
#!/usr/bin/perl
use strict;
use warnings;
#iterate input row by row
while ( <DATA> ) {
#remove linefeeds
chomp;
#split this row on ,
my #row = split /,/;
#iterate each field
foreach my $field ( #row ) {
#replace this field with "?" if it's "."
$field = "?" if $field eq ".";
}
#stick this row together again.
print join ",", #row,"\n";
}
__DATA__
1,987372,987372,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,.,1.293,12.23,0.989,0.973,D,.,.,.,.,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
This is more verbose than it needs to be, to illustrate the concept. This could be reduced down to:
perl -F, -lane 'print join ",", map { $_ eq "." ? "?" : $_ } #F'
If your CSV also has quoting, then you can break out the Text::CSV module, which handles that neatly.
You just need 2 passes since the trailing , found on a ,., match isn't available to match the leading , on the next ,.,:
$ sed 's/,\.,/,?,/g; s/,\.,/,?,/g' file
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
The above will work in any sed on any OS.

Replace "\n" with newline in awk

I'm tailing logs and they output \n instead of newlines.
I thought I'd pipe the tail to awk and do a simple replace, however I cannot seem to escape the newline in the regex. Here I'm demonstrating my problem with cat instead of tail:
test.txt:
John\nDoe
Sara\nConnor
cat test.txt | awk -F'\\n' '{ print $1 "\n" $2 }'
Desired output:
John
Doe
Sara
Connor
Actual output:
John\nDoe
Sara\nConnor
So it looks like \\n does not match the \n between the first and last names in test.txt but instead the newline at the end of each line.
It looks like \\n is not the right way of escaping in the terminal right? This way of escaping works fine in e.g. Sublime Text:
How about this?
$ cat file
John\nDoe
Sara\nConnor
$ awk '{gsub(/\\n/,"\n")}1' file
John
Doe
Sara
Connor
Using GNU's sed, the solution is pretty simple as #hek2mgl already answered (and that IMHO is the way it should work everywhere, but unfortunately doesn't).
But it's bit tricky when doing it on Mac OS X and other *BSD UNIXes.
The best way looks like this:
sed 's/\\n/\'$'\n''/g' <<< 'ABC\n123'
Then of course there's still AWK, #AvinashRaj has the correct answer if you'd like to use that.
Why use either awk or sed for this? Use perl!
perl -pe 's/\\n/\n/g' file
By using perl you avoid having to think about posix compliance, and it will typically give better performance, and it will be consistent across all (most) platforms.
This will work with any sed on any system as it is THE portable way to use newlines in sed:
$ sed 's/\\n/\
/' file
John
Doe
Sara
Connor
If it is possible for your input to contain a line like foo\\nbar and the \\ is intended to be an escaped backslash then you cannot use a simple substitution approach like you've asked for.
I have struggled with this problem before, but I discovered the cleanest way is to use the builtin printf
printf "$(cat file.txt)" | less
Here is a real world example dealing with aws iam embeded json policy in the output, the file file.txt contains:
{
"registryId": "111122223333",
"repositoryName": "awesome-repo",
"policyText": "{\n \"Version\" : \"2008-10-17\",\n \"Statement\" : [ {\n \"Sid\" : \"AllowPushPull\",\n \"Effect\" : \"Allow\",\n \"Principal\" : {\n \"AWS\" : [ \"arn:aws:iam::444455556666:root\", \"arn:aws:iam::444455556666:user/johndoe\" ]\n },\n \"Action\" : [ \"ecr:BatchCheckLayerAvailability\", \"ecr:BatchGetImage\", \"ecr:CompleteLayerUpload\", \"ecr:DescribeImages\", \"ecr:DescribeRepositories\", \"ecr:GetDownloadUrlForLayer\", \"ecr:InitiateLayerUpload\", \"ecr:PutImage\", \"ecr:UploadLayerPart\" ]\n } ]\n}"
}
after applying the above (without the less) you get:
{
"registryId": "111122223333",
"repositoryName": "awesome-repo",
"policyText": "{
"Version" : "2008-10-17",
"Statement" : [ {
"Sid" : "AllowPushPull",
"Effect" : "Allow",
"Principal" : {
"AWS" : [ "arn:aws:iam::444455556666:root", "arn:aws:iam::444455556666:user/johndoe" ]
},
"Action" : [ "ecr:BatchCheckLayerAvailability", "ecr:BatchGetImage", "ecr:CompleteLayerUpload", "ecr:DescribeImages", "ecr:DescribeRepositories", "ecr:GetDownloadUrlForLayer", "ecr:InitiateLayerUpload", "ecr:PutImage", "ecr:UploadLayerPart" ]
} ]
}"
}
Note that the value for "policyText" is itself a string containing json.
I would use sed:
sed 's/\\n/\n/g' file
In addition to the accepted answer, OP asked about tail, and on some unix variants, eg ubuntu you need to add -W interactive to awk
tail -f error.log | awk -W interactive '{gsub(/\\n/,"\n")}1'

Resources