Using awk to split line with multiple string delimiters - bash

I have a file called pet_owners.txt that looks like:
petOwner:Jane,petName:Fluffy,petType:cat
petOwner:John,petName:Oreo,petType:dog
...
petOwner:Jake,petName:Lucky,petType:dog
I'd like to use awk to split the file using the delimiters: 'petOwner', 'petName', and 'petType' so that I can extract the pet owners and pet types. My desired output is:
Jane,cat
John,dog
...
Jake,dog
So far I've tried:
awk < pet_owners.txt -F'['petOwner''petName''petType']' '{print $1 $3}'
but the result is a bunch of newlines.
Any ideas for how I can achieve this?

$ awk -F'[:,]' -v OFS=',' '{print $2,$6}' file
Jane,cat
John,dog
Jake,dog
As for why your attempt wasn't working, mainly it's because [ and ] in the context of a regular expression are the "bracket expression" delimiters and what goes inside that is a set of characters (which may be individual characters, ranges, lists, and/or classes) so when you wrote:
-F'['petOwner''petName''petType']'
that would set FS to the set of characters p, e, t, etc. not the set of strings petOwner, etc. The multiple internal 's are canceling each other out as you jump in/out of shell for no reason exactly as if you had written -F'[petOwnerpetNamepetType]' given there's no metacharacters in there that the shell would expand.
To set FS to a set of strings (actually regexps so watch out for metachars) would be:
-F'petOwner|petName|petType'

you can also write the delimiters in this form instead of char set
$ awk -F'pet(Owner|Name|Type):' '{print $2,$4}' file
Jane, cat
John, dog
Jake, dog

You can also define what a field is, instead of defining what a separator is. For that you use the FPAT variable, like this:
~ $ awk '{ print $2,$6 }' FPAT="[^,:]+" OFS="," pet_owners.txt
Jane,cat
John,dog
That way you're defining as a field everything that is not a comma or a colon.
Sometimes it makes programs easier.
OFS sets the output field separator to a comma.

Related

How can I parse CSV files with quoted fields containing commas, in awk?

I have a big CSV field, and I use awk with the field separator set to a comma. However, some fields are quoted and contain a comma, and I'm facing this issue:
Original file:
Downloads $ cat testfile.csv
"aaa","bbb","ccc","dddd"
"aaa","bbb","ccc","d,dd,d"
"aaa","bbb","ccc","dd,d,d"
I am trying this way:
Downloads $ cat testfile.csv | awk -F "," '{ print $2","$3","$4 }'
"bbb","ccc","dddd"
"bbb","ccc","d
"bbb","ccc","dd
Expecting result:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
I would use a tool that is able to properly parse CSV, such as xsv. With it, the command would look like
$ xsv select 2-4 testfile.csv
bbb,ccc,dddd
bbb,ccc,"d,dd,d"
bbb,ccc,"dd,d,d"
or, if you really want every value quoted, with a second step:
$ xsv select 2-4 testfile.csv | xsv fmt --quote-always
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Include (escaped) quotes in your field separator flag, and add them to your output print fields:
testfile.csv | awk -F "\",\"" '{print "\""$2"\",\""$3"\",\""$4}'
output:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
If gawk or GNU awk is available, you can make use of FPAT, which matches the fields, instead of splitting on field separators.
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=, '{print $2, $3, $4}' testfile.csv
Result:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
The string ([^,]+)|(\"[^\"]+\") is a regex pattern which matches either of:
([^,]+) ... matches a sequence of any characters other than a comma.
(\"[^\"]+\") ... matches a string enclosed by double quotes (which may include commas in between).
The parentheses around the patterns are put for visual clarity purpose and the regex will work without them such as FPAT='[^,]+|\"[^\"]+\"' because the alternative | has lower precedence.

Print part of a comma-separated field using AWK

I have a line containing this string:
$DLOAD , 123 , Loadcase name=SUBCASE_1
I am trying to only print SUBCASE_1. Here is my code, but I get a syntax error.
awk -F, '{n=split($3,a,"="); a[n]} {printf(a[1]}' myfile
How can I fix this?
1st solution: In case you want only to get last field(which contains = in it) then with your shown samples please try following
awk -F',[[:space:]]+|=' '{print $NF}' Input_file
2nd solution: OR in case you want to get specifically 3rd field's value after = then try following awk code please. Simply making comma followed by space(s) as field separator and in main program splitting 3rd field storing values into arr array, then printing 2nd item value of arr array.
awk -F',[[:space:]]+' '{split($3,arr,"=");print arr[2]}' Input_file
Possibly the shortest solution would be:
awk -F= '{print $NF}' file
Where you simply use '=' as the field-separator and then print the last field.
Example Use/Output
Using your sample into in a heredoc with the sigil quoted to prevent expansion of $DLOAD, you would have:
$ awk -F= '{print $NF}' << 'eof'
> $DLOAD , 123 , Loadcase name=SUBCASE_1
> eof
SUBCASE_1
(of course in this case it probably doesn't matter whether $DLOAD was expanded or not, but for completeness, in case $DLOAD included another '=' ...)

How to replace text in a specific line between first appearence of a symbol and second appearence of it

So, let's say I have some text file like this:
aaaa:bbbb:cccc:dddd
eeee:ffff:gggg:hhhh
iiii:jjjj:kkkk:llll
and I need a command that makes me able to replace what is in between the first and second : in a variable line.
I managed to do something like this but it's obviously just adding the text in the middle: sed $lineNumber' s/:/:'$pass'/' users.txt
the result given by the command should be someting like this if I want to replace what is in between the first and second ":" of the second line with "asd"
aaaa:bbbb:cccc:dddd
eeee:asd:gggg:hhhh
iiii:jjjj:kkkk:llll
A job for awk:
awk -v col="2" -v row="2" -v sep=":" -v new="asd" 'BEGIN{FS=OFS=sep} NR==row{$col=new} {print}' file
or
awk 'NR==row{$col=new}1' col='2' row='2' FS=':' OFS=':' new='asd' file
Output:
aaaa:bbbb:cccc:dddd
eeee:asd:gggg:hhhh
iiii:jjjj:kkkk:llll
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Use a regular expression that matches the entire part that you want to replace, e.g.,
sed "$lineNumber s/:[^:]\+/:$pass/" users.txt
# ^^^^^^ = not : one or more times
if you want shell variables to get expanded, use double quotes in sed command:
sed "$var1 s/:/:$var2/" file

How can I change the awk delimiter for a part of my script?

I have an input string that is formatted like this:
string1;string2"string3";string4
I want to parse this file to get the value of string3 using awk. To do this, I can first delimit by ;, print the second segment, and then delimit by " and print the second segment. Example using pipes:
$ echo 'string1;string2"string3";string4' | awk -F\; '{print $2}' | awk -F\" '{print $2}';
string3
I want to combine this into a single awk command, but I do not know how to change the field separator during my command. Is there syntax I can use in awk to change my separator?
You can use split function inside awk:
s='string1;string2"string3";string4'
awk -F ';' 'split($2, a, /"/){print a[2]}' <<< "$s"
string3
As per the linked doc:
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array.
Could you please try following and let me know how it goes then.
echo 'string1;string2"string3";string4' | awk -F'[;"]' '{print $3}'
So above is creating multiple delimiters by mentioning -F option in awk and then I am setting delimiters as chars(; ") so then string3 will be 3rd field and you could merge your awk like that. I hope this helps you.
EDIT: Apologies MODs/all, I am new to this site, so I am adding another alternative for this question's answer.
Thank you Questionmark, it encourages me. So in case you have only have two occurrences of " in your string and you want to get rid of this delimiter then following could help you:
echo 'string1;string2"string3";string4' | awk '{match($0,/\".*\"/);print substr($0,RSTART+1,RLENGTH-2)}'
In the above code I am matching the regex using the match functionality of awk, so once it matches the specific string then I am printing the specific match(where RSTART and RLENGTH are the built-in variables in awk which will be set only when inside, the regex match is TRUE, so they are printed. I hope this will help too.

How to get word count of a part of a line

The lines of the files are as something like this .
<some character> ||| each line. So far i can get the total number of lines and the text for each on its own line ||| <some text>
Now I want to count the no of words in between the |||.
What I intended to do is
awk -F '|||' '{print $2}' word_file | wc -l
but it throws blank in the awk part ,which suggests it is not taking ||| as I want (which is as a delimiter ),interestingly if i use $1 instead of $2 ,it prints the whole text
However if I use ||| (i.e a space before and after) it gives me some output but does not treat the sentence between the two delimeters as one field ,i.e it prints each instead of the whole sentence if I use the following
awk -F ' ||| ' '{print $2}' word_file
How do I achieve this using a bash command
FYI
awk version -GNU Awk 4.0.1
Awk's -F option, which sets FS, the input-field separator, expects a regular expression as its value.
Thus, for ||| to be interpreted as a literal, you must \-escape the | chars, which are metacharacters in a regex context.
Given that Awk also accepts \-based escape sequences in string literals, you must double the \ instances:
awk -F '\\|\\|\\|' ...
To properly count the words (defined as whitespace-separated tokens) in field 2, you can try this:
awk -F '\\|\\|\\|' 'BEGIN { orgFs=FS } { FS=" "; $0 = $2; print NF; FS=orgFS }' word_file
This splits each input line into fields by literal |||.
By temporarily setting FS to a single space - which is a magic value that tells Awk to split into fields by any nonempty run of whitespace - we can assign $2, the value of field 2, to $0, the whole input line, which causes the new value of $0 to be split into fields again.
At that point NF reflects the number of fields in what was originally the 2nd field - i.e., the number of words - and we can print that.
Restoring FS to its original value then prepares for parsing the next input line.
with gawk multi-char RS support, this might be easier
$ awk -v RS="\\\|\\\|\\\|" 'NR==2{print NF}' file
or if not sure how to escape the pipe, perhaps cleaner with
$ awk -v RS='[|]{3}' ...

Resources