awk - how to replace semicolon in string in csv file? - bash

I need to manage smtp logfile handling in my company.
These logfiles need to be imported to MSSQL, so it is my job to provide this data.
I got strange undelivery message with a ";" in the string, I need to replace this with a comma.
So what I got:
Sender;Recipient;Operation;Answer;Error;Servername
bla#bla.com;rockit#sohard.com;RCPT TO;450;+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions;+try+later;M0641
Mention the ";" in the Answer field after "restrictions", dunno why the mail server sends semicolons, maybe to annoy me :P
I tried following with awk after I did a lot of research:
awk 'BEGIN{FS=OFS=";"} {for (i=5;i<=NF;i++) gsub (";",",",$i)} 1' myfile.csv
This command actually works but it seems it does nothing with my file, the ";" in the error field remains. What I am missing here ?

Replacing the fifth and later ; with ,
$ awk -F\; '{for (i=1;i<=NF;i++) printf "%s%s",$i,(i==NF?ORS:(i<=4?";":","))}' myfile.csv
Sender;Recipient;Operation;Answer;Error,Servername
bla#bla.com;rockit#sohard.com;RCPT TO;450;+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641
How it works:
-F\;
This sets the field separator for input to ;.
for (i=1;i<=NF;i++) printf "%s%s",$i,(i==NF?ORS:(i<=4?";":","))
This loops over every field and prints the field followed by (a) ORS if we are on the last field, or (b) , if were are on field 5 or later, or (c) ; if we are on one of the first four fields.
Replacing all ; with ,
Try:
$ awk -F\; '{$1=$1} 1' OFS=, myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
bla#bla.com,rockit#sohard.com,RCPT TO,450,+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641
How it works:
-F\;
This sets the field separator on input to a semicolon.
$1=$1
This causes awk to think the the line has been changed so that awk will update the output line to use the new field separator.
1
This tells awk to print the line.
OFS=,
This sets the field separator on output to a comma.
Alternative #1
$ awk '{gsub(/;/, ",")} 1' myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
bla#bla.com,rockit#sohard.com,RCPT TO,450,+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641
Alternative #2
$ sed 's/;/,/g' myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
bla#bla.com,rockit#sohard.com,RCPT TO,450,+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641

I think your problem is replacing the unquotes delimiters in your logical 4th field in a five field wide input. Although this script is repetitious should be easier to understand
$ awk '{n=split($0,a,";");
for(i=1; i<4; i++) printf "%s;", a[i];
for(i=4; i<n-1; i++) printf "%s,", a[i];
printf "%s;%s\n", a[n-1], a[n]}' file
A better way to write the same based on #Ed Morton's comments
$ awk -F';' '{for(i=1; i<NF-1; i++) printf "%s"(i<4?FS:","), $i;
print $(NF-1) FS $NF}' file
For the input
1;2;3;4a;4b;4c;5
1;2;3;4;5
it generates
1;2;3;4a,4b,4c;5
1;2;3;4;5

If the offending semi-colons only appear in your 5th field then you can do this using GNU awk for the 3rd arg to match():
$ awk 'match($0,/(([^;]+;){4})(.*)(;[^;]+$)/,a){gsub(/;/,",",a[3]); print a[1] a[3] a[4]}' file
bla#bla.com;rockit#sohard.com;RCPT TO;450;+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later;M0641

If your fifth ; should be removed, append $6 to $5 and advance accordingly. This could be done with for loop (there are examples in SO) but since the fault is so near the end, we'll just do this in a simpler way:
$ awk 'BEGIN {FS=OFS=";"} NR==1 {nf=NF} NF==(nf+1) {$5=$5 "," $6; $6=$7; NF=nf} 1' file
Explained:
BEGIN {FS=OFS=";"} # set separator
NR==1 {nf=NF} # get field count from the first record (6)
NF==(nf+1) { # if record is one field longer:
$5=$5 "," $6 # append $6 to $5, comma-separated
$6=$7 # set $7 (NF) to $6 (nf)
NF=nf # reset NF
} 1 # output
Testing: Running the program and sending the output to cut -d\; -f 5 outputs:
Error
+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later

Related

AWK: search substring in first file against second

I have the following files:
data.txt
Estring|0006|this_is_some_random_text|more_text
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here
allids.txt (here the columns are separated by semicolon; the real input is tab-delimited)
Estring|0006;MAR0593
Fstring|0002;MAR0592
Fstring|0028;MAR1195
please note: data.txt: the important part is here the first two "columns" = name|number)
Now I want to use awk to search the first part (name|number) of data.txt in allids.txt and output the second column (starting with MAR)
so my expected output would be (again tab-delimited):
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
I do not know now how to search that first conserved part within awk, the rest should then be:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$0], [$1] }' data.txt allids.txt
I would use a set of field delimiters, like this:
awk -F'[|\t;]' 'NR==FNR{a[$1"|"$2]=$0; next}
$1"|"$2 in a {print a[$1"|"$2]"\t"$NF}' data.txt allids.txt
In your real-data example you can remove the ;. It is in here just to be able to reproduce the example in the question.
Here is another awk that uses a different field separator for both files:
awk -F ';' 'NR==FNR{a[$1]=FS $2; next} {k=$1 FS $2}
k in a{$0=$0 a[k]} 1' allids.txt FS='|' data.txt
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
This command uses ; as FS for allids.txt and uses | as FS for data.txt.

Shell script to add values to a specific column

I have semicolon-separated columns, and I would like to add some characters to a specific column.
aaa;111;bbb
ccc;222;ddd
eee;333;fff
to the second column I want to add '#', so the output should be;
aaa;#111;bbb
ccc;#222;ddd
eee;#333;fff
I tried
awk -F';' -OFS=';' '{ $2 = "#" $2}1' file
It adds the character but removes all semicolons with space.
You could use sed to do your job:
# replaces just the first occurrence of ';', note the absence of `g` that
# would have made it a global replacement
sed 's/;/;#/' file > file.out
or, to do it in place:
sed -i 's/;/;#/' file
Or, use awk:
awk -F';' '{$2 = "#"$2}1' OFS=';' file
All the above commands result in the same output for your example file:
aaa;#111;bbb
ccc;#222;ddd
eee;#333;fff
#atb: Try:
1st:
awk -F";" '{print $1 FS "#" $2 FS $3}' Input_file
Above will work only when your Input_file has 3 fields only.
2nd:
awk -F";" -vfield=2 '{$field="#"$field} 1' OFS=";" Input_file
Above code you could put any field number and could make it as per your request.
Here I am making field separator as ";" and then taking a variable named field which will have the field number in it and then that concatenating "#" in it's value and 1 is for making condition TRUE and not making and action so by default print action will happen of current line.
You just misunderstood how to set variables. Change -OFS to -v OFS:
awk -F';' -v OFS=';' '{ $2 = "#" $2 }1' file
but in reality you should set them both to the same value at one time:
awk 'BEGIN{FS=OFS=";"} { $2 = "#" $2 }1' file

awk OFS not producing expected value

I have a file
[root#nmk~]# cat file
abc>
sssd>
were>
I run both these variations of the awk commands
[root#nmk~]# cat file | awk -F\> ' { print $1}' OFS=','
abc
sssd
were
[root#nmk~]# cat file | awk -F\> ' BEGIN { OFS=","} { print $1}'
abc
sssd
were
[root#nmk~]#
But my expected output is
abc,sssd,were
What's missing in my commands ?
You're just a bit confused about the meaning/use of FS, OFS, RS and ORS. Take another look at the man page. I think this is what you were trying to do:
$ awk -F'>' -v ORS=',' '{print $1}' file
abc,sssd,were,$
but this is probably closer to the output you really want:
$ awk -F'>' '{rec = rec (NR>1?",":"") $1} END{print rec}' file
abc,sssd,were
or if you don't want to buffer the whole output as a string:
$ awk -F'>' '{printf "%s%s", (NR>1?",":""), $1} END{print ""}' file
abc,sssd,were
awk -F\> -v ORS="" 'NR>1{print ","$1;next}{print $1}' file
to print newline at the end:
awk -F\> -v ORS="" 'NR>1{print ","$1;next}{print $1} END{print "\n"}' file
output:
abc,sssd,were
Each line of input in awk is a record, so what you want to set is the Output Record Separator, ORS. The OFS variable holds the Output Field Separator, which is used to separate different parts of each line.
Since you are setting the input field separator, FS, to >, and OFS to ,, an easy way to see how these work is to add something on each line of your file after the >:
awk 'BEGIN { FS=">"; OFS=","} {$1=$1} 1' <<<$'abc>def\nsssd>dsss\nwere>wolf'
abc,def
sssd,dsss
were,wolf
So you want to set the ORS. The default record separator is newline, so whatever you set ORS to effectively replaces the newlines in the input. But that means that if the last line of input has a newline - which is usually the a case - that last line will also get a copy of your new ORS:
awk 'BEGIN { FS=">"; ORS=","} 1' <<<$'abc>def\nsssd>dsss\nwere>wolf'
abc>def,sssd>dsss,were>wolf,
It also won't get a newline at all, because that newline was interpreted as an input record separator and turned into the output record separator - it became the final comma.
So you have to be a little more explicit about what you're trying to do:
awk 'BEGIN { FS=">" } # split input on >
(NR>1) { printf "," } # if not the first line, print a ,
{ printf "%s", $1 } # print the first field (everything up to the first >)
END { printf "\n" } # add a newline at the end
' <<<$'abc>\nsssd>\nwere>'
Which outputs this:
abc,sssd,were
Through sed,
$ sed ':a;N;$!ba;s/>\n/,/g;s/>$//' file
abc,sssd,were
Through Perl,
$ perl -00pe 's/>\n(?=.)/,/g;s/>$//' file
abc,sssd,were

How to print a range of columns in a CSV in AWK? [duplicate]

This question already has answers here:
Extract specific columns from delimited file using Awk
(8 answers)
Closed 4 years ago.
With awk, I can print any column within a CSV, e.g., this will print the 10th column in file.csv.
awk -F, '{ print $10 }' file.csv
If I need to print columns 5-10, including the comma, I only know this way:
awk -F, '{ print $5","$6","$7","$8","$9","$10 }' file.csv
This method is not so good if I want to print many columns. Is there a simpler syntax for printing a range of columns in a CSV in awk?
The standard way to do this in awk is using a for loop:
awk -v s=5 -v e=10 'BEGIN{FS=OFS=","}{for (i=s; i<=e; ++i) printf "%s%s", $i, (i<e?OFS:ORS)}' file
However, if your delimiter is simple (as in your example), you may prefer to use cut:
cut -d, -f5-10 file
Perl deserves a mention (using -a to enable autosplit mode):
perl -F, -lane '$"=","; print "#F[4..9]"' file
You can use a loop in awk to print columns from 5 to 10:
awk -F, '{ for (i=5; i<=10; i++) print $i }' file.csv
Keep in mind that using print it will print each columns on a new line. If you want to print them on same line using OFS then use:
awk -F, -v OFS=, '{ for (i=5; i<=10; i++) printf("%s%s", $i, OFS) }' file.csv
With GNU awk for gensub():
$ cat file
a,b,c,d,e,f,g,h,i,j,k,l,m
$
$ awk -v s=5 -v n=6 '{ print gensub("(([^,]+,){"s-1"})(([^,]+,){"n-1"}[^,]+).*","\\3","") }' file
e,f,g,h,i,j
s is the start position and n is the number of fields to print from that point on. Or if you prefer to specify start and end:
$ awk -v s=5 -v e=10 '{ print gensub("(([^,]+,){"s-1"})(([^,]+,){"e-s"}[^,]+).*","\\3","") }' file
e,f,g,h,i,j
Note that this will only work with single-character field separators since it relies on being able to negate the FS in a character class.

creating a ":" delimited list in bash script using awk

I have following lines
380:<CHECKSUM_VALIDATION>
393:</CHECKSUM_VALIDATION>
437:<CHECKSUM_VALIDATION>
441:</CHECKSUM_VALIDATION>
I need to format it as below
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Is it possible to achieve above output using "awk"? [I'm using bash]
Thanks you!
Here you go:
awk -F '[:<>/]+' '{ n = $1; getline; print $2 ":" n ":" $1 }'
Explanation:
Set the field separator with -F to be a sequence of a mix of :<>/ characters, this way the first field will be the number, and the second will be CHECKSUM_VALIDATION
Save the first field in variable n and read the next line (which would overwrite $1)
Print the line: a combination of the number from the previous line, and the fields on the current line
Another approach without using getline:
awk -F '[:<>/]+' 'NR % 2 { n = $1 } NR % 2 == 0 { print $2 ":" n ":" $1 }'
This one uses the record counter NR to determine whether it's time to print: if NR is odd, save the first field in n, if NR is even, then print.
You can try this sed,
sed 'N; s/\([0-9]\+\):<\(.*\)>\n\([0-9]\+\):<\(.*\)>/\2:\1:\3/' file.txt
Test:
sat:~$ sed 'N; s/\([0-9]\+\):<\(.*\)>\n\([0-9]\+\):<\(.*\)>/\2:\1:\3/' file.txt
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Another way:
awk -F: '/<C/ {printf "CHECKSUM_VALIDATION:%d:",$1; next} {print $1}'
Here is one gnu awk
awk -F"[:\n<>]" 'NR==1{print $3,$1,$5;f=$3;next} $3{print f,$3,$7}' OFS=":" RS="</CH" file
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Based on Jonas post and avoiding getline, this awk should do:
awk -F '[:<>/]+' '/<C/ {f=$1;next} { print $2,f,$1}' OFS=\: file
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441

Resources