Use awk to fix CSV file with commas in unenclosed fields - macos

I have a CSV file which looks like:
height, comment, name
152, he was late, for example, on Tuesday, Fred
162, , Sam
I cannot parse this file because it includes a variable number of unenclosed commas in the comment field (but no other fields). I would like to fix the file using awk (which is very new to me) so that the commas in the second field become semi-colons:
height, comment, name
152, he was late; for example; on Tuesday, Fred
162, , Sam
(Enclosing the entire field in quotes will not solve my problem because my CSV parser does not understand quotes.)
So far I am looking at using NF to work out the number of unenclosed commas and then replacing them using gsub with an unpleasant regex, but I feel I should be able to leverage awk to write a more readable program and I am not sure NF behaves this way.

Essentially just a brute-force solution, but fairly easy to understand. Invoke with
$ awk -F "," -f test.awk test.dat
The awk file.
$ cat test.awk
{
printf "%s, ", $1
if (NF > 3) {
for (i = 2; i < NF; i++) {
printf "%s;", $i
}
printf ", "
}
else {
printf "%s, ", $2
}
printf "%s\n", $NF
}

$ cat file
height, comment, name
152, he was late, for example, on Tuesday, Fred
162, , Sam
$ awk -v OFS=, '{
height = comment = name = $0
sub(/,.*$/,"",height)
sub(/^.*,/,"",name)
gsub(/^[^,]+,|,[^,]+$/,"",comment)
gsub(/,/,";",comment)
print height, comment, name
}' file
height, comment, name
152, he was late; for example; on Tuesday, Fred
162, , Sam

Related

How to reorder the elements of each line in a file using sed and/or awk following a dynamic format

I currently have a file with each line containing ordered data. For example:
Peter:Connor:14:40kg
George:Head:56:60kg
I have a listing function that takes as an argument a "format" string.
That string contains abbreviations representing each possible element of the list. In this example, the abbreviations would be:
%N, %S, %A, %W
Those abbreviations can be preceded or followed by any amount of characters.
I want to print the data so that it fits the received string format, replacing each abbreviation with their corresponding element in the list. For example, I might receive:
{%A} [%W] %S %N
or
%S|%N|%A[[%W]]
And I would need to reorder the data so that it fits the demanded format. Since it's an argument in the function, I have no way to know what I will receive beforehand.
{14} [40kg] Connor Peter
and for the 2nd example
Connor|Peter|14[[40kg]]
How can I use awk to do this?
Assuming that I might receive... leads to you being able to pass a string with that value to awk:
$ cat tst.awk
BEGIN {
FS = ":"
tmp = fmt
sub(/^[^[:alpha:]]+/,"",tmp)
split(tmp,flds,/[^[:alpha:]]+/)
gsub(/[[:alpha:]]+/,"%s",fmt)
fmt = fmt ORS
}
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
next
}
{ printf fmt, $(f[flds[1]]), $(f[flds[2]]), $(f[flds[3]]), $(f[flds[4]]) }
$ awk -v fmt='{age} [kilo] surname name' -f tst.awk file
{14} [40kg] Connor Peter
{56} [60kg] Head George
$ awk -v fmt='surname|name|age[[kilo]]' -f tst.awk file
Connor|Peter|14[[40kg]]
Head|George|56[[60kg]]
For the above to work there has to be something that names the columns You could hard-code that in the script if you like but I added it as a header line to your CSV instead:
$ cat file
name:surname:age:kilo
Peter:Connor:14:40kg
George:Head:56:60kg
awk 'BEGIN{FS=":"; OFS=" "}{print "{"$3"}","["$4"]",$2,$1}' inputFile
gives:
{14} [40kg] Connor Peter
{56} [60kg] Head George
and
awk 'BEGIN{FS=":"; OFS="|"}{print $2,$1,$3"[["$2"]]"}' inputFile
yields
Connor|Peter|14[[Connor]]
Head|George|56[[Head]]

AWK to display a column based on Column name and remove header and last delimiter

Id,responseId,name,test1,test2,bcid,stype
213,A_123456,abc,test,zzz,987654321,alpha
412,A_234566,xyz,test,xxx,897564322,gama
125,A_456314,ttt,qa,yyy,786950473,delta
222,A_243445,hds,test,fff,643528290,alpha
456,A_466875,sed,test,hhh,543819101,beta
I want to extract columns responseId, and bcid from above. I found an answer which is really close
awk -F ',' -v cols=responseID,bcid '(NR==1){n=split(cols,cs,",");for(c=1;c<=n;c++){for(i=1;i<=NF;i++)if($(i)==cs[c])ci[c]=i}}{for(i=1;i<=n;i++)printf "%s" FS,$(ci[i]);printf "\n"}' <file_name>
however, it prints "," in the end and the header as shown below.
responseId,bcid,
A_123456,987654321,
A_234566,897564322,
A_456314,786950473,
A_243445,643528290,
A_466875,543819101,
How can I make it to not print the header and the "," after bcid??
Input
$ cat infile
Id,responseId,name,test1,test2,bcid,stype
213, A_123456, abc, test, zzz, 987654321, alpha
412, A_234566, xyz, test, xxx, 897564322, gama
125, A_456314, ttt, qa, yyy, 786950473, delta
222, A_243445, hds, test, fff, 643528290, alpha
456, A_466875, sed, test, hhh, 543819101, beta
Script
$ cat byname.awk
FNR==1{
split(header,h,/,/);
for(i=1; i in h; i++)
{
for(j=1; j<=NF; j++)
{
if(tolower(h[i])==tolower($j)){ d[i]=j; break }
}
}
next
}
{
for(i=1; i in h; i++)
printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"");
print "";
}
How to execute ?
$ awk -v FS=, -v OFS=, -v header="responseID,bcid" -f byname.awk infile
A_123456, 987654321
A_234566, 897564322
A_456314, 786950473
A_243445, 643528290
A_466875, 543819101
One-liner
$ awk -v FS=, -v OFS=, -v header="responseID,bcid" 'FNR==1{split(header,h,/,/);for(i=1; i in h; i++){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){ d[i]=j; break }}}next}{for(i=1; i in h; i++)printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"");print "";}' infile
A_123456, 987654321
A_234566, 897564322
A_456314, 786950473
A_243445, 643528290
A_466875, 543819101
try:
awk '{NR==1?FS=",":FS=", ";$0=$0} {print $2 OFS $(NF-1)}' OFS=, Input_file
Checking if line is 1st line then making delimiter as "," and other lines making field separator as ", " then printing the 2nd field and 2nd last field. Setting OFS(output field separator) as ,

How to convert date with awk

My file temp.txt
ID53,20150918,2015-09-19,,0,CENTER<br>
ID54,20150911,2015-09-14,,0,CENTER<br>
ID55,20150911,2015-09-14,,0,CENTER
I need to replace and convert the 2nd field (yyyymmdd) for seconds
I try it, but only the first line is replaced
awk -F"," '{ ("date -j -f ""%Y%m%d"" ""20150918"" ""+%s""") | getline $2; print }' OFS="," temp.txt
and tried to like this
awk -F"," '{system("date -j -f ""%Y%m%d"" "$2" ""+%s""") | getline $2; print }' temp.txt
the output is:
1442619474
sh: 0: command not found
ID53,20150918,2015-09-19,,0,CENTER
1442014674
ID54,20150911,2015-09-14,,0,CENTER
1442014674
ID55,20150911,2015-09-14,,0,CENTER
Using gsub also could not
awk -F"," '{gsub($2,"system("date -j -f ""%Y%m%d"" "$2" ""+%s""")",$2); print}' OFS="," temp.txt
awk: syntax error at source line 1
context is
{gsub($2,"system("date -j -f ""%Y%m%d"" "$2" >>> ""+% <<< s""")",$2); print}
awk: illegal statement at source line 1
extra )
I need the output to be so. How to?
ID53,1442619376,2015-09-19,,0,CENTER
ID54,1442014576,2015-09-14,,0,CENTER
ID55,1442014576,2015-09-14,,0,CENTER
This GNU awk script should make it. If it is not yet installed on your mac, I suggest installing macport and then GNU awk. You can also install a decent version of bash, date and other important utilities for which the default are really disappointing on OSX.
BEGIN { FS = ","; OFS = FS; }
{
y = substr($2, 1, 4);
m = substr($2, 5, 2);
d = substr($2, 7, 2);
$2 = mktime(y " " m " " d " 00 00 00");
print;
}
Put it in a file (e.g. txt2ts.awk) and process your file with:
$ awk -f txt2ts.awk data.txt
ID53,1442527200,2015-09-19,,0,CENTER<br>
ID54,1441922400,2015-09-14,,0,CENTER<br>
ID55,1441922400,2015-09-14,,0,CENTER
Note that we do not have the same timestamps. I let you try to understand where it comes from, it is another problem.
Explanations: substr(s, m, n) returns the n-characters sub-string of s that starts at position m (starting with 1). mktime("YYYY MM DD HH MM SS") converts the date string into a timestamp (seconds since epoch). FS and OFS are the input and output filed separators, respectively. The commands between the curly braces of the BEGIN pattern are executed at the beginning only while the others are executed on each line of the file.
You could use substr:
printf "%s-%s-%s", substr($6,0,4), substr($6,5,2), substr($6,7,2)
Assuming that the 6th field was 20150914, this would produce 2015-09-14

Awk splitting a line by spaces where there are spaces in each field

I've got an R summary table like so:
employee salary startdate
John Doe :1 Min. :21000 Min. :2007-03-14
Jolie Hope:1 1st Qu.:22200 1st Qu.:2007-09-18
Peter Gynn:1 Median :23400 Median :2008-03-25
Mean :23733 Mean :2008-10-02
3rd Qu.:25100 3rd Qu.:2009-07-13
Max. :26800 Max. :2010-11-01
and I need to produce an output csv file like so:
employee,,salary,,startdate,,
John Doe,1,Min.,21000,Min.,2007-03-14
Jolie Hope,1,1st Qu.,22200,1st Qu.,2007-09-18
Peter Gynn,1,Median,23400,Median,2008-03-25
,,Mean,23733,Mean,2008-10-02
,,3rd Qu.,25100,3rd Qu.,2009-07-13
,,Max.,26800,Max.,2010-11-01
so that in excel it looks something like this:
However it doesn't suffice to split the fields by one or more white spaces,
awk -F "[ ]+" '{ print $3 }'
It works for the header, but not for the remaining lines:
salary
Doe
Hope:1
Gynn:1
:23733
Qu.:25100
:26800
Is this problem solvable using awk (and maybe sed)?
sed '1 {
s/^[[:space:]]*\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}\([^[:space:]]\{1,\}\)[[:space:]]\{1,\}[[:space:]]\{1,\}\([^[:space:]]\{1,\}\)/\1,,\2,,\3,/
b
}
s/[[:space:]]\{1,\}:/:/g
/^[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\(.[^[:space:]]*\)/ {
s//\1,\2,\3,\4,\5,\6/
b
}
/^[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)[[:space:]]*\([^:]\{1,\}\):\([^[:space:]]*\)/ {
s//,,\1,\2,\3,\4/
b
}
' YourFile
sed one, just for the fun if you need to adapt a bit in this ArachnoRegEx
awk is lot more interesting in this case mainly for any adaptation to add later but if you only have access to sed ...
This uses GNU awk for FIELDWIDTHS, etc. and relies on the first line of input after the header always having all fields populated. It includes the positions that are just :s as output fields, I expect you can figure out how to skip those if you do want to use this solution:
$ cat tst.awk
BEGIN { OFS="," }
NR==1 {
for (i=1;i<=NF;i++) {
printf "%s%s", $i, (i<NF?OFS OFS OFS:ORS)
}
next
}
NR==2 {
tail = $0
while ( match(tail,/([^:]+):(\S+(\s+|$))/,a) ) {
FIELDWIDTHS = FIELDWIDTHS length(a[1]) " 1 " length(a[2]) " "
tail = substr(tail,RSTART+RLENGTH)
}
$0 = $0
}
{
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
}
print
}
$ awk -f tst.awk file
employee,,,salary,,,startdate
John Doe,:,1,Min.,:,21000,Min.,:,2007-03-14
Jolie Hope,:,1,1st Qu.,:,22200,1st Qu.,:,2007-09-18
Peter Gynn,:,1,Median,:,23400,Median,:,2008-03-25
,,,Mean,:,23733,Mean,:,2008-10-02
,,,3rd Qu.,:,25100,3rd Qu.,:,2009-07-13
,,,Max.,:,26800,Max.,:,2010-11-01

Get next field/column width awk

I have a dataset of the following structure:
1234 4334 8677 3753 3453 4554
4564 4834 3244 3656 2644 0474
...
I would like to:
1) search for a specific value, eg 4834
2) return the following field (3244)
I'm quite new to awk, but realize it is a simple operation. I have created a bash-script that asks the user for the input, and attempts to return the following field.
But I can't seem to get around scoping in AWK. How do I parse the input value to awk?
#!/bin/bash
read input
cat data.txt | awk '
for (i=1;i<=NF;i++) {
if ($i==input) {
print $(i+1)
}
}
}'
Cheers and thanks in advance!
UPDATE Sept. 8th 2011
Thanks for all the replies.
1) It will never happen that the last number of a row is picked - still I appreciate you pointing this out.
2) I have a more general problem with awk. Often I want to "do something" with the result found. In this case I would like to output it to xclip - an application which read from standard input and copies it to the clipboard. Eg:
$ echo Hi | xclip
Unfortunately, echo doesn't exist for awk, so I need to return the value and echo it. How would you go about this?
#!/bin/bash
read input
cat data.txt | awk '{
for (i=1;i<=NF;i++) {
if ($i=='$input') {
print $(i+1)
}
}
}'
Don't over think it!
You can create an array in awk with the split command:
split($0, ary)
This will split the line $0 into an array called ary. Now, you can use array syntax to find the particular fields:
awk '{
size = split($0, ary)
for (i=1; i < size ;i++) {
print ary[i]
}
print "---"
}' data.txt
Now, when you find ary[x] as the field, you can print out ary[x+1].
In your example:
awk -v input=$input '{
size = split($0, ary)
for (i=1; i<= size ;i++) {
if ($i == ary[i]) {
print ary[i+1]
}
}
}' data.txt
There is a way of doing this without creating an array, but it's simply much easier to work with arrays in situations like this.
By the way, you can eliminate the cat command by putting the file name after the awk statement and save creating an extraneous process. Everyone knows creating an extraneous process kills a kitten. Please don't kill a kitten.
You pass shell variable to awk using -v option. Its cleaner/nicer than having to put quotes.
awk -v input="$input" '
for(i=1;i<=NF;i++){
if ($i == input ){
print "Next value: " $(i+1)
}
}
' data.txt
And lose the useless cat.
Here is my solution: delete everything up to (and including) the search field, then the field you want to print out is field #1 ($1):
awk '/4834/ {sub(/^.* * 4834 /, ""); print $1}' data.txt

Resources