Parsing a CSV file using shell - bash

My shell is a bit rusty so I would greatly appreciate some help in parsing the following data.
Each row in the input file contains data separated by comma.
[name, record_timestamp, action, field_id, field_name, field_value, number_of_fields]
The rows are instructions to create or update information about persons. So for example the first line says that the person John Smith will be created and that the following 6 rows will contain information about him.
The field_id number always represent the same field.
input.csv
John Smith,2017-03-03 11:56:02,create,,,,6
,,,,1,BIRTH_DATE,1985-02-16,,
,,,,2,BIRTH_CITY,Portland,,
,,,,3,SEX,Male,,
,,,,5,CITY,Seattle,,
,,,,7,EMPLOYER,Microsoft,,
,,,,9,MARRIED,Yes,,
Susan Anderson,2017-03-01 12:09:36,create,,,,8
,,,,1,BIRTH_DATE,1981-09-12,,
,,,,2,BIRTH_CITY,San Diego,,
,,,,3,SEX,Female,,
,,,,5,CITY,Palo Alto,,
,,,,7,EMPLOYER,Facebook,,
,,,,8,SALARY,5612,,
,,,,9,MARRIED,No,,
,,,,10,TELEPHONE,5107586290,,
Brad Bradly,2017-02-29 09:15:12,update,,,,3
,,,,3,SEX,Male,,
,,,,7,EMPLOYER,Walmart,,
,,,,9,MARRIED,No,,
Sarah Wilson,2017-02-28 16:21:39,update,,,,5
,,,,2,BIRTH_CITY,Miami,,
,,,,3,SEX,Female,,
,,,,7,EMPLOYER,Disney,,
,,,,8,SALARY,5110,,
,,,,9,MARRIED,Yes,,
I want to parse each of these persons into comma separated strings that looks like this:
name,birth date,birth city,sex,employer,salary,marrage status,record_timestamp
but we should only output such a string if both the fields birth date and birth city or both the fields employer and salary are available for that person. Otherwise just leave it empty (see example below).
Given our input above the output should then be
John Smith,1985-02-16,Portland,Male,,,Yes,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,Facebook,5612,No,2017-03-01 12:09:36
Sarah Wilson,,,Female,Disney,5110,Yes,2017-02-28 16:21:39
I've figured out that I should probably do something along the following lines. But then I cannot figure out how to implement an inner loop or if there is some other way to proceed.
#!/bin/bash
IFS=','
cat test.txt | while read -a outer
do
echo ${outer[0]}
#...
done
Thanks in advance for any advice!

A UNIX shell is an environment from which to call UNIX tools (and manipulate files and processes) with a language to sequence those calls. It is NOT a tool to manipulate text.
The standard UNIX tool to manipulate text is awk:
$ cat tst.awk
BEGIN {
numFlds=split("name BIRTH_DATE BIRTH_CITY SEX EMPLOYER SALARY MARRIED timestamp",nr2name)
FS=OFS=","
}
$1 != "" {
prtRec()
rec["name"] = $1
rec["timestamp"] = $2
next
}
{ rec[$6] = $7 }
END { prtRec() }
function prtRec( fldNr) {
if ( ((rec["BIRTH_DATE"] != "") && (rec["BIRTH_CITY"] != "")) ||
((rec["EMPLOYER"] != "") && (rec["SALARY"] != "")) ) {
for (fldNr=1; fldNr<=numFlds; fldNr++) {
printf "%s%s", rec[nr2name[fldNr]], (fldNr<numFlds ? OFS : ORS)
}
}
delete rec
}
$ awk -f tst.awk file
John Smith,1985-02-16,Portland,Male,Microsoft,,Yes,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,Facebook,5612,No,2017-03-01 12:09:36
Sarah Wilson,,Miami,Female,Disney,5110,Yes,2017-02-28 16:21:39
Any time you have records consisting of name+value data like you do, the approach that results in by far the simplest, clearest, most robust, and easiest to enhance/debug code is to first populate an array (rec[] above) containing the values indexed by the names. Once you have that array it's trivial to print and/or manipulate the contents by their names.

Use awk or something like
while IFS=, read -r name timestamp action f_id f_name f_value nr_fields; do
if [ -n "${name}" ]; then
# proces startrecord, store the fields you need for the next line
else
# process next record
fi
done < test.txt

awk to the rescue!
awk -F, 'function pr(a) {if(!(7 in a && 8 in a)) a[7]=a[8]="";
if(!(1 in a && 2 in a)) a[1]=a[2]="";
for(i=0;i<=10;i++) printf "%s,",a[i];
printf "%s\n", a["ts"]}
NR>1 && $1!="" {pr(a); delete a}
$1!="" {a[0]=$1; a["ts"]=$2}
$1=="" {a[$5]=$7}
END {pr(a)}' file
this should cover the general case, and conditioned fields. You may want to filter out the other fields you don't need.
This will print for your input
John Smith,1985-02-16,Portland,Male,,Seattle,,,,Yes,,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,,Palo Alto,,Facebook,5612,No,5107586290,2017-03-01 12:09:36
Brad Bradly,,,Male,,,,,,No,,2017-02-29 09:15:12
Sarah Wilson,,,Female,,,,Disney,5110,Yes,,2017-02-28 16:21:39

Avoid IFS hacks like the plague. They are ugly stuff.
Play with the -d option to read to specify the comma as delimiter.

Related

awk or other shell to convert delimited list into a table

So what I have is a huge csv akin to this:
Pool1,Shard1,Event1,10
Pool1,Shard1,Event2,20
Pool1,Shard2,Event1,30
Pool1,Shard2,Event4,40
Pool2,Shard1,Event3,50
etc
Which is not ealisy readable. Eith there being only 4 types of events I'm useing spreadsheets to convert this into the following:
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,
Only events are limited to 4, pools and shards can be indefinite really. But the events may be missing from the lines - not all pools/shards have all 4 events every day.
So I tried doing this within an awk in the shell script that gathers the csv in the first place, but I'm failing spectacuraly, no working code can even be shown since it's producing zero results.
Basically I tried sorting the CSV reading the first two fields of a row, comparing to previous row and if matching comparing the third field to a set array of event strings then storing the fouth field in a variable respective to the event, and one the first two fileds are not matching - finally print the whole line including variables.
Sorry for the one-liner, testing and experimenting directly in the command line. It's embarassing, it does nothing.
awk -F, '{if (a==$1&&b==$2) {if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4}} else {printf $a","$b","$r","$d","$p","$t"\n"; a=$1 ; b=$2 ; if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4} ; a=$1; b=$2}} END {printf "\n"}'
You could simply use an assoc array: awk -F, -f parse.awk input.csv with parse.awk being:
{
sub(/Event/, "", $3);
res[$1","$2][$3]=$4;
}
END {
for (name in res) {
printf("%s,%s,%s,%s,%s\n", name, res[name][1], res[name][2], res[name][3], res[name][4])
}
}
Order could be confused by awk, but my test output is:
Pool2,Shard1,,,50,
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
PS: Please use an editor to write awk source code. Your one-liner is really hard to read. Since I used a different approach, I did not even try do get it "right"... ;)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $1 OFS $2 }
key != prev {
if ( NR>1 ) {
print prev, f["Event1"], f["Event2"], f["Event3"], f["Event4"]
delete f
}
prev = key
}
{ f[$3] = $4 }
END { print key, f["Event1"], f["Event2"], f["Event3"], f["Event4"] }
$ sort file | awk -f tst.awk
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,

Assign the value of awk-for loop variable to a bash variable

content within the tempfile
123 sam moore IT_Team
235 Rob Xavir Management
What i'm trying to do is get input from user and search it in the tempfile and output of search should give the column number
Code I have for that
#!/bin/bash
set -x;
read -p "Enter :" sword6;
awk 'BEGIN{IGNORECASE = 1 }
{
for(i=1;i<=NF;i++) {
if( $i ~ "'$sword6'$" )
print i;
}
} ' /root/scripts/pscripts/tempprint.txt;
This exactly the column number
Output
Enter : sam
2
What i need is the value of i variable should be assigned to bash variable so i can call as per the need in script.
Any help in this highly appreciated.
I searched to find any existing answer but not able to find any. If any let me know please.
first of all, you should pass your shell variable to awk in this way (e.g. sword6)
awk -v word="$sword6" '{.. if($i ~ word)...}`...
to assign shell variable by the output of other command:
shellVar=$(awk '......')
Then you can continue using $shellVar in your script.
regarding your awk codes:
if user input some special chars, your script may fail, e.g .*
if one column had matched userinput multiple times, you may have duplicated output.
if your file had multi-columns matching user input, you may want to handle it.
You just need to capture the output of awk. As an aside, I would pass sword6 as an awk variable, not inject it via string interpolation.
i=$(awk -v w="$sword6" '
BEGIN { IGNORECASE = 1 }
{ for (i=1;i<=NF;i++) {
if ($i ~ w"$") { print i; }
}
}' /root/scripts/pscipts/tempprint.txt)
Following script may help you on same too.
cat script.ksh
echo "Please enter the user name:"
read var
awk -v val="$var" '{for(i=1;i<=NF;i++){if(tolower($i)==tolower(val)){print i,$i}}}' Input_file
If tempprint.txt is big
awk -v w="$word6" '
BEGIN { IGNORECASE = 1 }
"$0 ~ \\<w\\>" {
for(i=1;i<=NF;i++)
if($i==w)print i
}' tempprint.txt

Using awk to print a new column without apostrophes or spaces

I'm processing a text file and adding a column composed of certain components of other columns. A new requirement to remove spaces and apostrophes was requested and I'm not sure the most efficient way to accomplish this task.
The file's content can be created by the following script:
content=(
john smith thomas blank 123 123456 10
jane smith elizabeth blank 456 456123 12
erin "o'brien" margaret blank 789 789123 9
juan "de la cruz" carlos blank 1011 378943 4
)
# put this into a tab-separated file, with the syntactic (double) quotes above removed
printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' "${content[#]}" >infile
This is what I have now, but it fails to remove spaces and apostrophes:
awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 tolower(substr($2,0,3)); }' infile > outfile
This throws an error "sub third parameter is not a changeable object", which makes sense since I'm trying to process output instead of input, I guess.
awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 sub("'\''", "",tolower(substr($2,0,3))); }' infile > outfile
Is there a way I can print a combination of column 6 and part of column 2 in lower case, all while removing spaces and apostrophes from the output to the new column? Worst case scenario, I can just create a new file with my first command and process that output with a new awk command, but I'd like to do it in one pass is possible.
The second approach was close, but for order of operations:
awk -F "\t" '
BEGIN { OFS="\t"; }
{
var=$2;
sub("['\''[:space:]]", "", var);
var=substr(var, 0, 3);
print $1,$2,$3,$5,$6,$7,$6 var;
}
'
Assigning the contents you want to modify to a variable lets that variable be modified in-place.
Characters you want to remove should be removed before taking the substring, since otherwise you shorten your 3-character substring.
It's a guess since you didn't provide the expected output but is this what you're trying to do?
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
abbr = $2
gsub(/[\047[:space:]]/,"",abbr)
abbr = tolower(substr(abbr,1,3))
print $1,$2,$3,$5,$6,$7,$6 abbr
}
$ awk -f tst.awk infile
john smith thomas 123 123456 10 123456smi
jane smith elizabeth 456 456123 12 456123smi
erin o'brien margaret 789 789123 9 789123obr
juan de la cruz carlos 1011 378943 4 378943del
Note that the way to represent a ' in a '-enclosed awk script is with the octal \047 (which will continue to work if/when you move your script to a file, unlike if you relied on "'\''" which only works from the command line), and that strings, arrays, and fields in awk start at 1, not 0, so your substr(..,0,3) is wrong and awk is treating the invalid start position of 0 as if you had used the first valid start position which is 1.
The "sub third parameter is not a changeable object" error you were getting is because sub() modifies the object you call it with as the 3rd argument but you're calling it with a literal string (the output of tolower(substr(...))) and you can't modify a literal string - try sub(/o/,"","foo") and you'll get the same error vs if you used var="foo"; sub(/o/,"",var) which is valid since you can modify the content of variables.

Analyze a control table by Shell Script

A shell script is analysing a control table to get the right parameter for it's processing.
Currently, it is simple - using grep, it points to the correct line, awk {print $n} determines the right columns.
Columns are separated by space only. No special rules, just values separated by space.
All is fine and working, the users like it.
As long as none of the columns is left empty. For last colum, it's ok to leave it empty, but if somebody does not fill in a column in the mid, it confuses the awk {print $n} logic.
Of course, one could as the users to fill in every entry, or one could just define the column delimiter as ";" .
In case something is skipped, one could use " ;; " However, I would prefer not to change table style.
So the question is:
How to effectively analyze a table having blanks in colum values? Table is like this:
ApplikationService ServerName PortNumber ControlValue_1 ControlValue_2
Read chavez.com 3599 john doe
Write 3345 johnny walker
Update curiosity.org jerry
What might be of some help:
If there is a value set in a column, it is (more a less precise) under its column header description.
Cheers,
Tarik
You don't say what your desired output is but this shows you the right approach:
$ cat tst.awk
NR==1 {
print
while ( match($0,/[^[:space:]]+[[:space:]]*/) ) {
width[++i] = RLENGTH
$0 = substr($0,RSTART+RLENGTH)
}
next
}
{
i = 0
while ( (fld = substr($0,1,width[++i])) != "" ) {
gsub(/^ +| +$/,"",fld)
printf "%-*s", width[i], (fld == "" ? "[empty]" : fld)
$0 = substr($0,width[i]+1)
}
print ""
}
$
$ awk -f tst.awk file
ApplikationService ServerName PortNumber ControlValue_1 ControlValue_2
Read chavez.com 3599 john doe
Write [empty] 3345 johnny walker
Update curiosity.org [empty] jerry [empty]
It uses the width of the each field in the title line to determine the width of every field in every line of the file, and then just replaces empty fields with the string "[empty]" and left-aligns every field just to pretty it up a bit.

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

Resources