Parsing a CSV file using gawk - bash

How do you parse a CSV file using gawk? Simply setting FS="," is not enough, as a quoted field with a comma inside will be treated as multiple fields.
Example using FS="," which does not work:
file contents:
one,two,"three, four",five
"six, seven",eight,"nine"
gawk script:
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
printf "---------------------------\n"
}
bad output:
field #1: one
field #2: two
field #3: "three
field #4: four"
field #5: five
---------------------------
field #1: "six
field #2: seven"
field #3: eight
field #4: "nine"
---------------------------
desired output:
field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------

The gawk version 4 manual says to use FPAT = "([^,]*)|(\"[^\"]+\")"
When FPAT is defined, it disables FS and specifies fields by content instead of by separator.

The short answer is "I wouldn't use gawk to parse CSV if the CSV contains awkward data", where 'awkward' means things like commas in the CSV field data.
The next question is "What other processing are you going to be doing", since that will influence what alternatives you use.
I'd probably use Perl and the Text::CSV or Text::CSV_XS modules to read and process the data. Remember, Perl was originally written in part as an awk and sed killer - hence the a2p and s2p programs still distributed with Perl which convert awk and sed scripts (respectively) into Perl.

You can use a simple wrapper function called csvquote to sanitize the input and restore it after awk is done processing it. Pipe your data through it at the start and end, and everything should work out ok:
before:
gawk -f mypgoram.awk input.csv
after:
csvquote input.csv | gawk -f mypgoram.awk | csvquote -u
See https://github.com/dbro/csvquote for code and documentation.

If permissible, I would use the Python csv module, paying special attention to the dialect used and formatting parameters required, to parse the CSV file you have.

csv2delim.awk
# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
# delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
# repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '
# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
# -v delim delimiter, defaults to tab
# -v repl replacement char, defaults to ~
# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt
# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present
BEGIN {
if (delim == "") delim = "\t"
if (repl == "") repl = "~"
print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################
}
{
#if ($0 ~ repl) {
# print "Replacement character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr"
#}
if ($0 ~ delim) {
print "Temp delimiter character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr"
print " replaced by " repl > "/dev/stderr"
}
gsub(delim, repl)
$0 = gensub(/([^,])\"\"/, "\\1'", "g")
# $0 = gensub(/\"\"([^,])/, "'\\1", "g") # not needed above covers all cases
out = ""
#for (i = 1; i <= length($0); i++)
n = length($0)
for (i = 1; i <= n; i++)
if ((ch = substr($0, i, 1)) == "\"")
inString = (inString) ? 0 : 1 # toggle inString
else
out = out ((ch == "," && ! inString) ? delim : ch)
print out
}
END {
print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr"
}
test.csv
"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec ond,"third"
"first" , "second","th ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3
test.bat
rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk -f csv2delim.awk test.csv > test.txt
gawk -v delim=; -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk -v repl=` -f csv2delim.awk test.csv > testr.txt

I am not exactly sure whether this is the right way to do things. I would rather work on a csv file in which either all values are to quoted or none. Btw, awk allows regexes to be Field Separators. Check if that is useful.

{
ColumnCount = 0
$0 = $0 "," # Assures all fields end with comma
while($0) # Get fields by pattern, not by delimiter
{
match($0, / *"[^"]*" *,|[^,]*,/) # Find a field with its delimiter suffix
Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
gsub(/^ *"?|"? *,$/, "", Field) # Strip delimiter text: comma/space/quote
Column[++ColumnCount] = Field # Save field without delimiter in an array
$0 = substr($0, RLENGTH + 1) # Remove processed text from the raw data
}
}
Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.
This implementation is slow, but it appears to emulate the FPAT/patsplit() feature found in gawk >= 4.0.0 mentioned in a previous answer.
Reference

Here's what I came up with. Any comments and/or better solutions would be appreciated.
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) {
f[++n] = $i
if (substr(f[n],1,1)=="\"") {
while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") {
f[n] = sprintf("%s,%s", f[n], $(++i))
}
}
}
for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
print "----------------------------------\n"
}
The basic idea is that I loop through the fields, and any field which starts with a quote but does not end with a quote gets the next field appended to it.

Perl has the Text::CSV_XS module which is purpose-built to handle the quoted-comma weirdness.
Alternately try the Text::CSV module.
perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){#f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv
Produces this output:
field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---
Here's a human-readable version.
Save it as parsecsv, chmod +x, and run it as "parsecsv file.csv"
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) {
if ($csv->parse($line)) {
my #f = $csv->fields();
for my $n (0..$#f) {
print "field #$n: $f[$n]\n";
}
print "---\n";
}
}
You may need to point to a different version of perl on your machine, since the Text::CSV_XS module may not be installed on your default version of perl.
Can't locate Text/CSV_XS.pm in #INC (#INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.
If none of your versions of Perl have Text::CSV_XS installed, you'll need to:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS

Related

Printing last column of csv file [duplicate]

The intent of this question is to provide a canonical answer.
Given a CSV as might be generated by Excel or other tools with embedded newlines and/or double quotes and/or commas in fields, and empty fields like:
$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",
What's the most robust way efficiently using awk to identify the separate records and fields:
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
so it can be used as those records and fields internally by the rest of the awk script.
A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.
The solution must tolerate the end of record just being LF (\n) as is typical for UNIX files rather than CRLF (\r\n) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. \" instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.
If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT):
$ echo 'foo,"field,""with"",commas",bar' |
awk -v FPAT='[^,]*|("([^"]|"")*")' '{for (i=1; i<=NF;i++) print i " <" $i ">"}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>
or the equivalent using any awk:
$ echo 'foo,"field,""with"",commas",bar' |
awk -v fpat='[^,]*|("([^"]|"")*")' -v OFS=',' '{
rec = $0
$0 = ""
i = 0
while ( (rec!="") && match(rec,fpat) ) {
$(++i) = substr(rec,RSTART,RLENGTH)
rec = substr(rec,RSTART+RLENGTH+1)
}
for (i=1; i<=NF;i++) print i " <" $i ">"
}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>
See https://www.gnu.org/software/gawk/manual/gawk.html#More-CSV for info on the specific FPAT setting I use above.
If all you actually want to do is convert your CSV to individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields then all you need is this, again using GNU awk for multi-char RS and RT:
$ awk -v RS='"([^"]|"")*"' -v ORS= '{gsub(/\n/," ",RT); gsub(/,/,";",RT); print $0 RT}' file.csv
"rec1; fld1",,"rec1"";""fld3.1 ""; fld3.2","rec1 fld4"
"rec2; fld1.1 fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3;fld2""",
Otherwise, though, the general, robust, portable solution to identify the fields that will work with any modern awk* is:
$ cat decsv.awk
function buildRec( fpat,fldNr,fldStr,done) {
CurrRec = CurrRec $0
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\\\" for \-separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
$0 = ""
fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( gsub(/^"|"$/,"",fldStr) ) {
gsub(/""/, "\"", fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# If your input has \-separated fields, use FS="\\\\"; OFS="\\"
BEGIN { FS=OFS="," }
!buildRec() { next }
{
printf "Record %d:\n", ++recNr
for (i=1;i<=NF;i++) {
# To replace newlines with blanks add gsub(/\n/," ",$i) here
printf " $%d=<%s>\n", i, $i
}
print "----"
}
.
$ awk -f decsv.awk file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
The above assumes UNIX line endings of \n. With Windows \r\n line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.
It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably \n but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.
*I say "modern awk" above because there's apparently extremely old (i.e. circa 2000) versions of tawk and mawk1 still around which have bugs in their gsub() implementation such that gsub(/^"|"$/,"",fldStr) would not remove the start/end "s from fldStr. If you're using one of those then get a new awk, preferably gawk, as there could be other issues with them too but if that's not an option then I expect you can work around that particular bug by changing this:
if ( gsub(/^"|"$/,"",fldStr) ) {
to this:
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
Thanks to the following people for identifying and suggesting solutions to the stated issues with the original version of this answer:
#mosvy for escaped double quotes within fields.
#datatraveller1 for multiple contiguous pairs of escaped quotes in a field and null fields at the end of records.
Related: also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.
An improvement upon #EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).
gawk -v FPAT='[^,]*|("[^"]*")+' ...
This STILL
isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.
assumes GNU awk (gawk), a standard awk won't do.
Example:
$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1'
a||""|"y""ck"|"""x,y,z"|" "|12
$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v FPAT='[^,]*|("[^"]*")+' '{
for(i=1; i<=NF;i++){
if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }
print "<"$i">"
}
}'
<a>
<>
<>
<y"ck>
<"x,y,z>
< >
<12>
This is exactly what csvquote is for - it makes things simple for awk and other command line data processing tools.
Some things are difficult to express in awk. Instead of running a single awk command and trying to get awk to handle the quoted fields with embedded commas and newlines, the data gets prepared for awk by csvquote, so that awk can always interpret the commas and newlines it finds as field separators and record separators. This makes the awk part of the pipeline simpler. Once awk is done with the data, it goes back through csvquote -u to restore the embedded commas and newlines inside quoted fields.
csvquote file.csv | awk -f my_awk_script | csvquote -u
EDIT:
For a complete description on csvquote, see: How it works. this also explains the `` characters which are shown in places where there was a carriage return.
csvquote file.csv | awk -f decsv.awk | csvquote -u
(for the source of decsv.awk see answer from Ed Morton )
outut:
Record 1:
$1=<rec1 fld1>
$2=<>
$3=<rec1","fld3.1",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3fld2">
$3=<>
----
I have found csvkit a really useful toolkit to handle with csv files in command line.
line='test,t2,t3,"t5,"'
echo $line | csvcut -c 4
"t5,"
echo 'foo,"field,""with"",commas",bar' | csvcut -c 3
bar
It also contains csvstat, csvstack etc. tools which are also very handy.
cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",
csvcut -c 1 file.csv
"rec1, fld1"
"rec2, fld1.1
fld1.2"
""""""
csvcut -c 3 file.csv
"rec1"",""fld3.1
"",
fld3.2"
""
""
Awk (gawk) actually provides extensions, one of which being csv processing, which is the most robust way to do so with gawk in my opinion. The extension takes care of many gotchas and parses the csv for you.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print a[1]}'
The output is:
James T. Kirk
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print a[1] }, aka prints first csv field of the line.
If you're using one of the common AWK interpreters (Gawk, onetrueawk, mawk), the other solutions are your best bet. However, if you're able to use a different interpreter, frawk and GoAWK have proper CSV support built-in.
frawk is a very fast AWK implementation written in Rust. Use -i csv to process input in CSV mode. Note that frawk is not quite POSIX compatible (see differences).
GoAWK is a POSIX-compatible AWK implementation written in Go. Also supports -i csv mode, as well as -H (parse header row) with #"named_field" syntax (read more). Disclaimer: I'm the author of GoAWK.
With file.csv as per the question, you can simply use an AWK script with a regular for loop over the fields as follows:
$ cat records.awk
{
printf "Record %d:\n", NR
for (i=1; i<=NF; i++)
printf " $%d=<%s>\n", i, $i
print "----"
}
Then use either frawk -i csv or goawk -i csv to get the expected output. For example:
$ frawk -i csv -f records.awk file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
$ goawk -i csv -f records.awk file.csv
Record 1:
... same as above ...
----

Convert each integer to a simple ASCII graph

I have a file with a bunch of integers like this:
6
2
3
4
3
The goal is to convert those integers in stats like in a videogame, for example, if the number is 6, the stats must be ******----, if the number is 4 the result must be ****------.
I tried the following piece of code but it doesn't work:
# Here I put all the int in a variable.
intNumber=`cat /home/intNumbers.txt`
# This for is a loop to print as much * as the number says.
for i in `seq 1 $intNumber`
do
echo -n "*"
# This for loop is for printing - until reacing 10.
for j in `seq $intNumber 10`
do
echo -n "-"
done
done
With Perl:
perl -ne 'print("*" x $_, "-" x (10-$_), "\n")' file
$_ contains current row
Output:
******----
**--------
***-------
****------
***-------
You may use this awk:
awk '{s = sprintf("%*s", $1, ""); gsub(/ /, "*", s); p = sprintf("%*s", 10-$1, ""); gsub(/ /, "-", p); print s p}' file
******----
**--------
***-------
****------
***-------
A more readable version:
awk '{
s = sprintf("%*s", $1, "")
gsub(/ /, "*", s)
p = sprintf("%*s", 10-$1, "")
gsub(/ /, "-", p)
print s p
}' file
Another awk, keepin' it simple, sir:
$ awk '
BEGIN {
s="**********----------"
}
{
print substr(s,11-$1,10)
}' file
Output:
******----
**--------
***-------
****------
***-------
Similar for bash:
#!/bin/bash
s="**********----------"
while IFS= read -r line
do
echo "${s:((10-$line)):10}"
done < file
A more generic approach for awk could be, for example:
$ awk -v m=10 '{ # desired maximum number of chars
t="" # temp var
for(i=1;i<=m;i++) # loop to max
if(i<=$1) # up to threshold value from file
sub(/^/,"*",t) # prepend a *
else # after threshold
sub(/$/,"-",t) # append a -
print t
}' file
Some input checking could be in order.
Minimizing the work you have to do per input line for efficiency:
$ cat tst.awk
BEGIN {
lgth = 10
curr = base = sprintf("%*s",lgth,"")
gsub(/ /,"*",curr)
gsub(/ /,"-",base)
}
{ print substr(curr,1,$1) substr(base,$1+1) }
$ awk -f tst.awk file
******----
**--------
***-------
****------
***-------
or borrowing #JamesBrowns idea of indexing into a single string:
$ cat tst.awk
BEGIN {
lgth = 10
curr = base = sprintf("%*s",lgth,"")
gsub(/ /,"*",curr)
gsub(/ /,"-",base)
line = curr base
}
{ print substr(line,(lgth-$1)+1,lgth) }
$ awk -f tst.awk file
******----
**--------
***-------
****------
***-------
Don't read the entire input file into memory. Instead, process one line at a time.
The following also demonstrates how to do this more succinctly in Bash.
#!/bin/bash
ten='----------'
while IFS='' read -r num; do
printf -v graph '%10.10s' "${ten:$num}"
echo "${graph// /\*}"
done < intNumbers.txt
printf -v graph places the output in the variable graph, and we then use a Bash parameter substitution to replace the space padding from printf with asterisks.
Demo: https://ideone.com/wE6fpm
Doing this entirely in Bash is attractive if you end up doing this a lot; you generally want to avoid external processes especially in repeated code (though of course don't repeat yourself; put this in a function then.)
If you genuinely want to convert a file of numbers into a bunch of graphs, a single Awk process is still much better; the shell isn't particularly good at that. I'm imagining you have a different application where you occasionally need to format a number as a graph in various places in a Bash script.
A riff on tripleee's answer:
# Repeat a character a specified number of times
#
# parameters
# - character
# - count
#
# usage: strRepeat "*" 60
#
str_repeat() {
local char=$1 count=$2
local result
# string of count spaces
printf -v result "%*s" "$count" ""
# replace spaces with the char
echo "${result// /$char}"
}
while read -r num; do
printf '%s%s\n' "$(str_repeat '*' "$num")" "$(str_repeat '-' $((10-num)))"
done < intNumbers.txt
Another GAWK solution:
awk -v OFS="*" '{NF=($1+1); $1=""; print gensub(/ /,"-","g",sprintf("%-10s",$0))}' file
# -v OFS="*"
# [Sets the output field separator to an asterisk]
# NF=($1+1)
# [Defines the number of fields in the line as 1 more
# than the value in the first (and only) input field.
# If you printed the first line, it would now be "6******"]
# $1=""
# [Deletes the first field in the line, leaving only asterisks]
# sprintf("%-10s",$0)
# [Formats the line as 10 characters, left-justified.
# The missing characters on the right are blank spaces]
# gensub(/ /,"-","g",sprintf("%-10s",$0))
# [Replaces all blank spaces in the sprintf-formatted
# line with hyphens]
# print
# [Prints the line as modified by gensub]
With Python:
python -c "import sys
> for s in sys.stdin: n=int(s); print('*'*n + '-'*(10-n))" < file

Unix Parse Varying Named Value into seperate rows

We are getting a varying length input file as mentioned below. The text is varying length.
Input file:
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
The text here has named value pair as the content and it's of varying length. Please note that the name in the text column can contain a semi colon. We are trying to parse the input but we are not able handle it via AWK or BASH
Desired Output:
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
The below snipped of code works for ID=2, but doesn't for ID=1
echo "2|name1=value1;name2=value2;name6=;name7=value7;name8=value8" | while IFS="|"; read id text;do dsc=`echo $text|tr ';' '\n'`;echo "$dsc" >tmp;done
cat tmp
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
echo "1|name1=value1;name3;name4=value2;name5=value5" | while IFS="|"; read id text;do dsc=`echo $text|tr ';' '\n'`;echo "$dsc" >tmp;sed -i "s/^/${id}\|/g" tmp;done
cat tmp
1|name1=value1
1|name3
1|name4=value2
1|name5=value5
Any help is greatly appreciated.
Could you please try following, written and tested with shown samples in GNU awk with new version of it. Since OP's awk version is old so if anyone having old version of awk then try changing it to awk --re-interval
awk '
BEGIN{
FS=OFS="|"
}
FNR==1{ next }
{
first=$1
while(match($0,/(name[0-9]+;?){1,}=(value[0-9]+)?/)){
print first,substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}' Input_file
Output will be as follows.
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
Explanation: Adding detailed explanation for above(following is for explanation purposes only).
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="|" ##Setting FS and OFS wiht | here.
}
FNR==1{ next } ##If line is first line then go next, do not print anything.
{
first=$1 ##Creating first and setting as first field here.
while(match($0,/(name[0-9]+;?){1,}=(value[0-9]+)?/)){
##Running while loop which has match which has a regex of matching name and value all mentioned permutations and combinations.
print first,substr($0,RSTART,RLENGTH) ##Printing first and sub string(currently matched one)
$0=substr($0,RSTART+RLENGTH) ##Saving rest of the line into current line.
}
}' Input_file ##Mentioning Input_file name here.
Sample data:
$ cat name.dat
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
One awk solution:
awk -F"[|;]" ' # use "|" and ";" as input field delimiters
FNR==1 { next } # skip header line
{ pfx=$1 "|" # set output prefix to field 1 + "|"
printpfx=1 # set flag to print prefix
for ( i=2 ; i<=NF ; i++ ) # for fields 2 to NF
{
if ( printpfx) { printf "%s", pfx ; printpfx=0 } # if print flag == 1 then print prefix and clear flag
if ( $(i) ~ /=/ ) { printf "%s\n", $(i) ; printpfx=1 } # if current field contains "=" then print it, end this line of output, reset print flag == 1
if ( $(i) !~ /=/ ) { printf "%s;", $(i) } # if current field does not contain "=" then print it and include a ";" suffix
}
}
' name.dat
The above generates:
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
A Bash solution:
#!/usr/bin/env bash
while IFS=\| read -r id text || [ -n "$id" ]; do
IFS=\; read -r -a kv_arr < <(printf %s "$text")
printf "$id|%s\\n" "${kv_arr[#]}"
done < <(tail -n +2 a.txt)
A plain POSIX shell solution:
#!/usr/bin/env sh
# Chop the header line from the input file
tail -n +2 a.txt |
# While reading id and text Fields Separated by vertical bar
while IFS=\| read -r id text || [ -n "$id" ]; do
# Sets the separator to a semicolon
IFS=\;
# Print each semicolon separated field formatted on
# its own line with the ID
# shellcheck disable=SC2086 # Explicit split on semicolon
printf "$id|%s\\n" $text
done
Input a.txt:
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
Output:
1|name1=value1
1|name3
1|name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
You have some good answers and an accepted one already. Here is a much shorter gnu awk command that can also do the job:
awk -F '|' 'NR > 1 {
for (s=$2; match(s, /([^=]+=[^;]*)(;|$)/, m); s=substr(s, RLENGTH+1))
print $1 FS m[1]
}' file.txt
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8

Awk/sed replace newlines

Intro:
I have been given a CSV file in which the field delimiter is the pipe characted (i.e., |).
This file has a pre-defined number of fields (say N). I can discover the value of N by reading the header of the CSV file, which we can assume to be correct.
Problem:
Some of the fields contain a newline character by mistake, which makes the line appear shorter than required (i.e., it has M fields, with M < N).
What I need to create is a sh script (not bash) to fix those lines.
Attempted solution:
I tried creating the following script to try fixing the file:
if [ $# -ne 1 ]
then
echo "Usage: $0 <filename>"
exit
fi
# get first line
first_line=$(head -n 1 $1)
# get number of fields
num_separators=$(echo "$first_line" | tr -d -c '|' | awk '{print length}')
cat $1 | awk -v numFields=$(( num_separators + 1 )) -F '|' '
{
totRecords = NF/numFields
# loop over lines
for (record=0; record < totRecords; record++) {
output = ""
# loop over fields
for (i=0; i<numFields; i++) {
j = (numFields*record)+i+1
# replace newline with question mark
sub("\n", "?", $j)
output = output (i > 0 ? "|" : "") $j
}
print output
}
}
'
However, the newline character is still present.
How can I fix that problem?
Example of the CSV:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a
newline
Foo|Bar|Baz
Expected output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
* I don't care about the replacement, it could be a space, a question mark, whatever except a newline or a pipe (which would create a new field)
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { reqdNF = NF; printf "%s", $0; next }
{ printf "%s%s", (NF < reqdNF ? " " : ORS), $0 }
END { print "" }
$ awk -f tst.awk file.csv
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a newline
Foo|Bar|Baz
If that's not what you want then edit your question to provide more truly representative sample input and associated output.
Based on the assumption that the last field may contain one newline. Using tac and sed:
tac file.csv | sed -n '/|/!{h;n;x;H;x;s/\n/ * /p;b};p' | tac
Output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
How it works. Read the file backwards, sed is easier without forward references. If a line has no '|' separator, /|/!, run the block of code in curly braces {};, otherwise just p print the line. The block of code:
h; stores the delimiter-less line in sed's hold buffer.
n; fetches another line, since we're reading backwards, this is the line that should be appended to.
x; exchange hold buffer and pattern buffer.
H; append pattern buffer to hold buffer.
x; exchange newly appended lines to pattern buffer, now there's two lines in one buffer.
s/\n/ * /p; replace the middle linefeed with a " * ", now there's only one longer line; and print.
b start again, leave the code block.
Re-reverse the file with tac; done.

Modify content inside quotation marks, BASH

Good day to all,
I was wondering how to modify the content inside quotation marks and left unmodified the outside.
Input line:
,,,"Investigacion,,, desarrollo",,,
Output line:
,,,"Investigacion, desarrollo",,,
Initial try:
sed 's/\"",,,""*/,/g'
But nothing happens, thanks in advance for any clue
The idiomatic awk way to do this is simply:
$ awk 'BEGIN{FS=OFS="\""} {sub(/,+/,",",$2)} 1' file
,,,"Investigacion, desarrollo",,,
or if you can have more than one set of quoted strings on each line:
$ cat file
,,,"Investigacion,,, desarrollo",,,"foo,,,,bar",,,
$ awk 'BEGIN{FS=OFS="\""} {for (i=2;i<=NF;i+=2) sub(/,+/,",",$i)} 1' file
,,,"Investigacion, desarrollo",,,"foo,bar",,,
This approach works because everything up to the first " is field 1, and everything from there to the second " is field 2 and so on so everything between "s is the even-numbered fields. It can only fail if you have newlines or escaped double quotes inside your fields but that'd affect every other possible solution too so you'd need to add cases like that to your sample input if you want a solution that handles it.
Using a language that has built-in CSV parsing capabilities like perl will help.
perl -MText::ParseWords -ne '
print join ",", map { $_ =~ s/,,,/,/; $_ } parse_line(",", 1, $_)
' file
,,,"Investigacion, desarrollo",,,
Text::ParseWords is a core module so you don't need to download it from CPAN. Using the parse_line method we set the delimiter and a flag to keep the quotes. Then just do simple substitution and join the line to make your CSV again.
Using egrep, sed and tr:
s=',,,"Investigacion,,, desarrollo",,,'
r=$(egrep -o '"[^"]*"|,' <<< "$s"|sed '/^"/s/,\{2,\}/,/g'|tr -d "\n")
echo "$r"
,,,"Investigacion, desarrollo",,,
Using awk:
awk '{ p = ""; while (match($0, /"[^"]*,{2,}[^"]*"/)) { t = substr($0, RSTART, RLENGTH); gsub(/,+/, ",", t); p = p substr($0, 1, RSTART - 1) t; $0 = substr($0, RSTART + RLENGTH); }; $0 = p $0 } 1'
Test:
$ echo ',,,"Investigacion,,, desarrollo",,,' | awk ...
,,,"Investigacion, desarrollo",,,
$ echo ',,,"Investigacion,,, desarrollo",,,",,, "' | awk ...
,,,"Investigacion, desarrollo",,,", "

Resources