How can I change the awk delimiter for a part of my script? - bash

I have an input string that is formatted like this:
string1;string2"string3";string4
I want to parse this file to get the value of string3 using awk. To do this, I can first delimit by ;, print the second segment, and then delimit by " and print the second segment. Example using pipes:
$ echo 'string1;string2"string3";string4' | awk -F\; '{print $2}' | awk -F\" '{print $2}';
string3
I want to combine this into a single awk command, but I do not know how to change the field separator during my command. Is there syntax I can use in awk to change my separator?

You can use split function inside awk:
s='string1;string2"string3";string4'
awk -F ';' 'split($2, a, /"/){print a[2]}' <<< "$s"
string3
As per the linked doc:
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array.

Could you please try following and let me know how it goes then.
echo 'string1;string2"string3";string4' | awk -F'[;"]' '{print $3}'
So above is creating multiple delimiters by mentioning -F option in awk and then I am setting delimiters as chars(; ") so then string3 will be 3rd field and you could merge your awk like that. I hope this helps you.
EDIT: Apologies MODs/all, I am new to this site, so I am adding another alternative for this question's answer.
Thank you Questionmark, it encourages me. So in case you have only have two occurrences of " in your string and you want to get rid of this delimiter then following could help you:
echo 'string1;string2"string3";string4' | awk '{match($0,/\".*\"/);print substr($0,RSTART+1,RLENGTH-2)}'
In the above code I am matching the regex using the match functionality of awk, so once it matches the specific string then I am printing the specific match(where RSTART and RLENGTH are the built-in variables in awk which will be set only when inside, the regex match is TRUE, so they are printed. I hope this will help too.

Related

How can I parse CSV files with quoted fields containing commas, in awk?

I have a big CSV field, and I use awk with the field separator set to a comma. However, some fields are quoted and contain a comma, and I'm facing this issue:
Original file:
Downloads $ cat testfile.csv
"aaa","bbb","ccc","dddd"
"aaa","bbb","ccc","d,dd,d"
"aaa","bbb","ccc","dd,d,d"
I am trying this way:
Downloads $ cat testfile.csv | awk -F "," '{ print $2","$3","$4 }'
"bbb","ccc","dddd"
"bbb","ccc","d
"bbb","ccc","dd
Expecting result:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
I would use a tool that is able to properly parse CSV, such as xsv. With it, the command would look like
$ xsv select 2-4 testfile.csv
bbb,ccc,dddd
bbb,ccc,"d,dd,d"
bbb,ccc,"dd,d,d"
or, if you really want every value quoted, with a second step:
$ xsv select 2-4 testfile.csv | xsv fmt --quote-always
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Include (escaped) quotes in your field separator flag, and add them to your output print fields:
testfile.csv | awk -F "\",\"" '{print "\""$2"\",\""$3"\",\""$4}'
output:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
If gawk or GNU awk is available, you can make use of FPAT, which matches the fields, instead of splitting on field separators.
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=, '{print $2, $3, $4}' testfile.csv
Result:
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
The string ([^,]+)|(\"[^\"]+\") is a regex pattern which matches either of:
([^,]+) ... matches a sequence of any characters other than a comma.
(\"[^\"]+\") ... matches a string enclosed by double quotes (which may include commas in between).
The parentheses around the patterns are put for visual clarity purpose and the regex will work without them such as FPAT='[^,]+|\"[^\"]+\"' because the alternative | has lower precedence.

Print part of a comma-separated field using AWK

I have a line containing this string:
$DLOAD , 123 , Loadcase name=SUBCASE_1
I am trying to only print SUBCASE_1. Here is my code, but I get a syntax error.
awk -F, '{n=split($3,a,"="); a[n]} {printf(a[1]}' myfile
How can I fix this?
1st solution: In case you want only to get last field(which contains = in it) then with your shown samples please try following
awk -F',[[:space:]]+|=' '{print $NF}' Input_file
2nd solution: OR in case you want to get specifically 3rd field's value after = then try following awk code please. Simply making comma followed by space(s) as field separator and in main program splitting 3rd field storing values into arr array, then printing 2nd item value of arr array.
awk -F',[[:space:]]+' '{split($3,arr,"=");print arr[2]}' Input_file
Possibly the shortest solution would be:
awk -F= '{print $NF}' file
Where you simply use '=' as the field-separator and then print the last field.
Example Use/Output
Using your sample into in a heredoc with the sigil quoted to prevent expansion of $DLOAD, you would have:
$ awk -F= '{print $NF}' << 'eof'
> $DLOAD , 123 , Loadcase name=SUBCASE_1
> eof
SUBCASE_1
(of course in this case it probably doesn't matter whether $DLOAD was expanded or not, but for completeness, in case $DLOAD included another '=' ...)

using awk and gensub to remove the part in a string ending with "character+number+S"

My goal is to remove the end "1S" as well as the letter immediately before it, in this case "M". How do I achieve that? My non-working code :
echo "14M3856N61M1S" | gawk '{gensub(/([^(1S)]*)[a-zA-Z](1S$)/, "\\1", "g") ; print $0}'
>14M3856N61M1S
The desired results should be
>14M3856N61
Some additional information here . 1. I do not think substr will work here since my actual target strings would come with various lengths. 2. I prefer not to take the approach of defining special delimiter because this would be used together with "if" as part of the awk conditional operation while the
delimiter is already defined globally.
Thank you in advance!
Why not use a simple substitution to match the 1S at the last and match any character before it?
echo "14M3856N61M1S" | awk '{sub(/[[:alnum:]]{1}1S$/,"")}1'
14M3856N61M1S
Here the [[:alnum:]] corresponds the POSIX character class to match alphanumeric characters (digits and alphabets) and {1} represent to match just one. Or if you are sure about only characters could occur before the pattern 1S, replace [[:alnum:]] with [[:alpha:]].
To answer OP's question to put the match result on a separate variable, use match() as sub() does not return the substituted string but only the count of number of substitutions made.
echo "14M3856N61M1S" | awk 'match($0,/[[:alnum:]]{1}1S$/){str=substr($0,1,RSTART-1); print str}'
EDIT: As per OP's comment I am adding solutions where OP could get the result into a bash variable too as follows.
var=$(echo "14M3856N61M1S" | awk 'match($0,/[a-zA-Z]1S$/){print substr($0,1,RSTART-1)}' )
echo "$var"
14M3856N61
Could you please try following too.
echo "14M3856N61M1S" | awk 'match($0,/[a-zA-Z]1S$/){$0=substr($0,1,RSTART-1)} 1'
14M3856N61
Explanation of above command:
echo "14M3856N61M1S" | ##printing sample string value by echo command here and using |(pipe) for sending standard ouptut of it as standard input to awk command.
awk ' ##Starting awk command here.
match($0,/[a-zA-Z]1S$/){ ##using match keyword of awk here to match 1S at last of the line along with an alphabet(small or capital) before it too.
$0=substr($0,1,RSTART-1) ##If match found in above command then re-creating current line and keeping its value from 1 to till RSTART-1 value where RSTART and RLENGTH values are set by match out of the box variables by awk.
} ##Closing match block here.
1' ##Mentioning 1 will print the edited/non-edited values of lines here.
echo "14M3856N61M1S" | awk -F '.1S$' '{print $1}'
Output:
14M3856N61

Using awk to split line with multiple string delimiters

I have a file called pet_owners.txt that looks like:
petOwner:Jane,petName:Fluffy,petType:cat
petOwner:John,petName:Oreo,petType:dog
...
petOwner:Jake,petName:Lucky,petType:dog
I'd like to use awk to split the file using the delimiters: 'petOwner', 'petName', and 'petType' so that I can extract the pet owners and pet types. My desired output is:
Jane,cat
John,dog
...
Jake,dog
So far I've tried:
awk < pet_owners.txt -F'['petOwner''petName''petType']' '{print $1 $3}'
but the result is a bunch of newlines.
Any ideas for how I can achieve this?
$ awk -F'[:,]' -v OFS=',' '{print $2,$6}' file
Jane,cat
John,dog
Jake,dog
As for why your attempt wasn't working, mainly it's because [ and ] in the context of a regular expression are the "bracket expression" delimiters and what goes inside that is a set of characters (which may be individual characters, ranges, lists, and/or classes) so when you wrote:
-F'['petOwner''petName''petType']'
that would set FS to the set of characters p, e, t, etc. not the set of strings petOwner, etc. The multiple internal 's are canceling each other out as you jump in/out of shell for no reason exactly as if you had written -F'[petOwnerpetNamepetType]' given there's no metacharacters in there that the shell would expand.
To set FS to a set of strings (actually regexps so watch out for metachars) would be:
-F'petOwner|petName|petType'
you can also write the delimiters in this form instead of char set
$ awk -F'pet(Owner|Name|Type):' '{print $2,$4}' file
Jane, cat
John, dog
Jake, dog
You can also define what a field is, instead of defining what a separator is. For that you use the FPAT variable, like this:
~ $ awk '{ print $2,$6 }' FPAT="[^,:]+" OFS="," pet_owners.txt
Jane,cat
John,dog
That way you're defining as a field everything that is not a comma or a colon.
Sometimes it makes programs easier.
OFS sets the output field separator to a comma.

Trim leading and trailing spaces from a string in awk

I'm trying to remove leading and trailing space in 2nd column of the below input.txt:
Name, Order  
Trim, working
cat,cat1
I have used the below awk to remove leading and trailing space in 2nd column but it is not working. What am I missing?
awk -F, '{$2=$2};1' input.txt
This gives the output as:
Name, Order  
Trim, working
cat,cat1
Leading and trailing spaces are not removed.
If you want to trim all spaces, only in lines that have a comma, and use awk, then the following will work for you:
awk -F, '/,/{gsub(/ /, "", $0); print} ' input.txt
If you only want to remove spaces in the second column, change the expression to
awk -F, '/,/{gsub(/ /, "", $2); print$1","$2} ' input.txt
Note that gsub substitutes the character in // with the second expression, in the variable that is the third parameter - and does so in-place - in other words, when it's done, the $0 (or $2) has been modified.
Full explanation:
-F, use comma as field separator
(so the thing before the first comma is $1, etc)
/,/ operate only on lines with a comma
(this means empty lines are skipped)
gsub(a,b,c) match the regular expression a, replace it with b,
and do all this with the contents of c
print$1","$2 print the contents of field 1, a comma, then field 2
input.txt use input.txt as the source of lines to process
EDIT I want to point out that #BMW's solution is better, as it actually trims only leading and trailing spaces with two successive gsub commands. Whilst giving credit I will give an explanation of how it works.
gsub(/^[ \t]+/,"",$2); - starting at the beginning (^) replace all (+ = zero or more, greedy)
consecutive tabs and spaces with an empty string
gsub(/[ \t]+$/,"",$2)} - do the same, but now for all space up to the end of string ($)
1 - ="true". Shorthand for "use default action", which is print $0
- that is, print the entire (modified) line
remove leading and trailing white space in 2nd column
awk 'BEGIN{FS=OFS=","}{gsub(/^[ \t]+/,"",$2);gsub(/[ \t]+$/,"",$2)}1' input.txt
another way by one gsub:
awk 'BEGIN{FS=OFS=","} {gsub(/^[ \t]+|[ \t]+$/, "", $2)}1' infile
Warning by #Geoff: see my note below, only one of the suggestions in this answer works (though on both columns).
I would use sed:
sed 's/, /,/' input.txt
This will remove on leading space after the , .
Output:
Name,Order
Trim,working
cat,cat1
More general might be the following, it will remove possibly multiple spaces and/or tabs after the ,:
sed 's/,[ \t]\?/,/g' input.txt
It will also work with more than two columns because of the global modifier /g
#Floris asked in discussion for a solution that removes trailing and and ending whitespaces in each colum (even the first and last) while not removing white spaces in the middle of a column:
sed 's/[ \t]\?,[ \t]\?/,/g; s/^[ \t]\+//g; s/[ \t]\+$//g' input.txt
*EDIT by #Geoff, I've appended the input file name to this one, and now it only removes all leading & trailing spaces (though from both columns). The other suggestions within this answer don't work. But try: " Multiple spaces , and 2 spaces before here " *
IMO sed is the optimal tool for this job. However, here comes a solution with awk because you've asked for that:
awk -F', ' '{printf "%s,%s\n", $1, $2}' input.txt
Another simple solution that comes in mind to remove all whitespaces is tr -d:
cat input.txt | tr -d ' '
I just came across this. The correct answer is:
awk 'BEGIN{FS=OFS=","} {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2)} 1'
just use a regex as a separator:
', *' - for leading spaces
' *,' - for trailing spaces
for both leading and trailing:
awk -F' *,? *' '{print $1","$2}' input.txt
Simplest solution is probably to use tr
$ cat -A input
^I Name, ^IOrder $
Trim, working $
cat,cat1^I
$ tr -d '[:blank:]' < input | cat -A
Name,Order$
Trim,working$
cat,cat1
The following seems to work:
awk -F',[[:blank:]]*' '{$2=$2}1' OFS="," input.txt
If it is safe to assume only one set of spaces in column two (which is the original example):
awk '{print $1$2}' /tmp/input.txt
Adding another field, e.g. awk '{print $1$2$3}' /tmp/input.txt will catch two sets of spaces (up to three words in column two), and won't break if there are fewer.
If you have an indeterminate (large) number of space delimited words, I'd use one of the previous suggestions, otherwise this solution is the easiest you'll find using awk.

Resources