Awk consider double quoted string as one token and ignore space in between - bash

Data file - data.txt:
ABC "I am ABC" 35 DESC
DEF "I am not ABC" 42 DESC
cat data.txt | awk '{print $2}'
will result the "I" instead of the string being quoted
How to make awk so that it ignore the space within the quote and think that it is one single token?

Another alternative would be to use the FPAT variable, that defines a regular expression describing the contents of each field.
Save this AWK script as parse.awk:
#!/bin/awk -f
BEGIN {
FPAT = "([^ ]+)|(\"[^\"]+\")"
}
{
print $2
}
Make it executable with chmod +x ./parse.awk and parse your data file as ./parse.awk data.txt:
"I am ABC"
"I am not ABC"

Yes, this can be done nicely in awk. It's easy to get all the fields without any serious hacks.
(This example works in both The One True Awk and in gawk.)
{
split($0, a, "\"")
$2 = a[2]
$3 = $(NF - 1)
$4 = $NF
print "and the fields are ", $1, "+", $2, "+", $3, "+", $4
}

Try this:
$ cat data.txt | awk -F\" '{print $2}'
I am ABC
I am not ABC

The top answer for this question only works for lines with a single quoted field. When I found this question I needed something that could work for an arbitrary number of quoted fields.
Eventually I came upon an answer by Wintermute in another thread, and he provided a good generalized solution to this problem. I've just modified it to remove the quotes. Note that you need to invoke awk with -F\" when running the below program.
BEGIN { OFS = "" } {
for (i = 1; i <= NF; i += 2) {
gsub(/[ \t]+/, ",", $i)
}
print
}
This works by observing that every other element in the array will be inside of the quotes when you separate by the "-character, and so it replaces the whitespace dividing the ones not in quotes with a comma.
You can then easily chain another instance of awk to do whatever processing you need (just use the field separator switch again, -F,).
Note that this might break if the first field is quoted - I haven't tested it. If it does, though, it should be easy to fix by adding an if statement to start at 2 rather than 1 if the first character of the line is a ".

I've scrunched up together a function that re-splits $0 into an array called B. Spaces between double quotes are not acting as field separators. Works with any number of fields, a mix of quoted and unquoted ones. Here goes:
#!/usr/bin/gawk -f
# Resplit $0 into array B. Spaces between double quotes are not separators.
# Single quotes not handled. No escaping of double quotes.
function resplit( a, l, i, j, b, k, BNF) # all are local variables
{
l=split($0, a, "\"")
BNF=0
delete B
for (i=1;i<=l;++i)
{
if (i % 2)
{
k=split(a[i], b)
for (j=1;j<=k;++j)
B[++BNF] = b[j]
}
else
{
B[++BNF] = "\""a[i]"\""
}
}
}
{
resplit()
for (i=1;i<=length(B);++i)
print i ": " B[i]
}
Hope it helps.

Okay, if you really want all three fields, you can get them, but it takes a lot of piping:
$ cat data.txt | awk -F\" '{print $1 "," $2 "," $3}' | awk -F' ,' '{print $1 "," $2}' | awk -F', ' '{print $1 "," $2}' | awk -F, '{print $1 "," $2 "," $3}'
ABC,I am ABC,35
DEF,I am not ABC,42
By the last pipe you've got all three fields to do whatever you'd like with.

Here is something like what I finally got working that is more generic for my project.
Note it doesn't use awk.
someText="ABC \"I am ABC\" 35 DESC '1 23' testing 456"
putItemsInLines() {
local items=""
local firstItem="true"
while test $# -gt 0; do
if [ "$firstItem" == "true" ]; then
items="$1"
firstItem="false"
else
items="$items
$1"
fi
shift
done
echo "$items"
}
count=0
while read -r valueLine; do
echo "$count: $valueLine"
count=$(( $count + 1 ))
done <<< "$(eval putItemsInLines $someText)"
Which outputs:
0: ABC
1: I am ABC
2: 35
3: DESC
4: 1 23
5: testing
6: 456

Related

Processing text with multiple delims in awk

I have a text which looks like -
Application.||dates:[2022-11-12]|models:[MODEL1]|count:1|ids:2320
Application.||dates:[2022-11-12]|models:[MODEL1]|count:5|ids:2320
I want the number from the count:1 columns so 1 and i wish to store these numbers in an array.
nums=($(echo -n "$grepResult" | awk -F ':' '{ print $4 }' | awk -F '|' '{ print $1 }'))
this seems very repetitive and not very efficient, any ideas how to simplify this ?
You can use awk once, set the field separator to |. Then loop all the fields and split on :
If the field starts with count then print the second part of the splitted value.
This way the count: part can occur anywhere in the string and can possibly print this multiple times.
nums=($(echo -n "$grepResult" | awk -F'|' '
{
for(i=1; i<=NF; i++) {
split($i, a, ":")
if (a[1] == "count") {
print a[2]
}
}
}
'))
for i in "${nums[#]}"
do
echo "$i"
done
Output
1
5
If you want to combine the both split values, you can use [|:] as a character class and print field number 8 for a precise match as mentioned in the comments.
Note that it does not check if it starts with count:
nums=($(echo -n "$grepResult" | awk -F '[|:]' '{print $8}'))
With gnu awk you can use a capture group to get a bit more precise match where on the left and right can be either the start/end of string or a pipe char. The 2nd group matches 1 or more digits:
nums=($(echo -n "$grepResult" | awk 'match($0, /(^|\|)count:([0-9]+)(\||$)/, a) {print a[2]}' ))
Try sed
nums=($(sed 's/.*count://;s/|.*//' <<< "$grepResult"))
Explanation:
There are two sed commands separated with ; symbol.
First command 's/.*count://' remove all characters till 'count:' including it.
Second command 's/|.*//' remove all characters starting from '|' including it.
Command order is important here.

Appending result of function on another field into csv using shell script, awk

I have a csv file stored as a temporary variable in a shell script (*.sh).
Let's say the data looks like this:
Account,Symbol,Price
100,AAPL US,200
102,SPY US,500
I want to add a fourth column, "Type", which is the result of a shell function "foobar". Run from the command line or a shell script itself:
$ foobar "AAPL US"
"Stock"
$ foobar "SPY US"
"ETF"
How do I add this column to my csv, and populate it with calls to foobar which take the second column as an argument? To clarify, this is my ideal result post-script:
Account,Symbol,Price,Type
100,AAPL US,200,Common Stock
102,SPY US,500,ETF
I see many examples online involving such a column addition using awk, and populating the new column with fixed values, conditional values, mathematical derivations from other columns, etc. - but nothing that calls a function on another field and stores its output.
You may use this awk:
export -f foobar
awk 'BEGIN{FS=OFS=","} NR==1{print $0, "Type"; next} {
cmd = "foobar \"" $2 "\""; cmd | getline line; close(cmd);
print $0, line
}' file.csv
Account,Symbol,Price,Type
100,AAPL US,200,Common Stock
102,SPY US,500,ETF
#anubhavas answer is a good approach so please don't change the accepted answer as I'm only posting this as an answer as it's too big and in need of formatting to fit in a comment.
FWIW I'd write his awk script as:
awk '
BEGIN { FS=OFS="," }
NR==1 { type = "Type" }
NR > 1 {
cmd = "foobar \047" $2 "\047"
type = ((cmd | getline line) > 0 ? line : "ERROR")
close(cmd)
}
{ print $0, type }
' file.csv
to:
better protect $2 from shell expansion, and
protect from silently printing the previous value if/when cmd | getline fails, and
consolidate the print statements to 1 line so it's easy to change for all output lines if/when necessary
awk to the rescue!
$ echo "Account,Symbol,Price
100,AAPL US,200
102,SPY US,500" |
awk -F, 'NR>1{cmd="foobar "$2; cmd | getline type} {print $0 FS (NR==1?"Type":type)}'
Not sure you need to quote the input to foobar
Another way not using awk:
paste -d, input.csv <({ read; printf "Type\n"; while IFS=, read -r _ s _; do foobar "$s"; done; } < input.csv)

How to split string by a delimiter in unix

I have the following string and I want to split it into 3 parts:
Text:
<http://rdf.freebase.com/ns/american_football.football_player.footballdb_id> <http://www.w3.org/2000/01/rdf-schema#label> "footballdb ID"#en
Output should be
$1 = <http://rdf.freebase.com/ns/american_football.football_player.footballdb_id>
$2 = <http://www.w3.org/2000/01/rdf-schema#label>
$3 = "footballdb ID"#en
basically an splitting a RDF'sh tuple into its parts.
I want to do this via a UNIX script , but I do not know sed or awk.
Please help.
If your input fields are tab-separated, this will produce your posted desired output:
$ awk -F'\t' '{ for (i=1;i<=NF;i++) printf "$%d = %s\n", i, $i }' file
$1 = <http://rdf.freebase.com/ns/american_football.football_player.footballdb_id>
$2 = <http://www.w3.org/2000/01/rdf-schema#label>
$3 = "footballdb ID"#en
Alternatively this might be what you want if your fields are not tab-separated:
$ cat tst.awk
{
gsub(/<[^>]+>/,"&\n")
split($0,a,/[[:space:]]*\n[[:space:]]*/)
for (i=1; i in a; i++)
printf "$%d = %s\n", i, a[i]
}
$
$ awk -f tst.awk file
$1 = <http://rdf.freebase.com/ns/american_football.football_player.footballdb_id>
$2 = <http://www.w3.org/2000/01/rdf-schema#label>
$3 = "footballdb ID"#en
If that's not how your input fields are separated and/or not what you want output, update your question to clarify.
read A B C <<< $string
echo -e "\$1 = $A\n\$2 = $B\n\$3 = $C"
Output:
$1 = <http://rdf.freebase.com/ns/american_football.football_player.footballdb_id>
$2 = <http://www.w3.org/2000/01/rdf-schema#label>
$3 = "footballdb ID"#en
Whatever you use to split the string needs to recognize not only the white space but also the convention that the double quote "protects" the blank space before ID and prevents it from splitting the fields. I fear this computation may be beyond what is possible with sed. You could do it in awk, but awk provides little special advantage here.
You show a space-separated format with quotes. A similar problem is to parse comma-separated format with quotes. Related questions:
Parse CSV with double quote in some cases
How to split csv whose columns may contain ,
echo "your string" |awk -F" " '{ print $1 $2 $3 $4}'
awk '{ print "$1 = " $1 "\n$2 = " $2 "\n$3 = " $3 }' filename

Modify content inside quotation marks, BASH

Good day to all,
I was wondering how to modify the content inside quotation marks and left unmodified the outside.
Input line:
,,,"Investigacion,,, desarrollo",,,
Output line:
,,,"Investigacion, desarrollo",,,
Initial try:
sed 's/\"",,,""*/,/g'
But nothing happens, thanks in advance for any clue
The idiomatic awk way to do this is simply:
$ awk 'BEGIN{FS=OFS="\""} {sub(/,+/,",",$2)} 1' file
,,,"Investigacion, desarrollo",,,
or if you can have more than one set of quoted strings on each line:
$ cat file
,,,"Investigacion,,, desarrollo",,,"foo,,,,bar",,,
$ awk 'BEGIN{FS=OFS="\""} {for (i=2;i<=NF;i+=2) sub(/,+/,",",$i)} 1' file
,,,"Investigacion, desarrollo",,,"foo,bar",,,
This approach works because everything up to the first " is field 1, and everything from there to the second " is field 2 and so on so everything between "s is the even-numbered fields. It can only fail if you have newlines or escaped double quotes inside your fields but that'd affect every other possible solution too so you'd need to add cases like that to your sample input if you want a solution that handles it.
Using a language that has built-in CSV parsing capabilities like perl will help.
perl -MText::ParseWords -ne '
print join ",", map { $_ =~ s/,,,/,/; $_ } parse_line(",", 1, $_)
' file
,,,"Investigacion, desarrollo",,,
Text::ParseWords is a core module so you don't need to download it from CPAN. Using the parse_line method we set the delimiter and a flag to keep the quotes. Then just do simple substitution and join the line to make your CSV again.
Using egrep, sed and tr:
s=',,,"Investigacion,,, desarrollo",,,'
r=$(egrep -o '"[^"]*"|,' <<< "$s"|sed '/^"/s/,\{2,\}/,/g'|tr -d "\n")
echo "$r"
,,,"Investigacion, desarrollo",,,
Using awk:
awk '{ p = ""; while (match($0, /"[^"]*,{2,}[^"]*"/)) { t = substr($0, RSTART, RLENGTH); gsub(/,+/, ",", t); p = p substr($0, 1, RSTART - 1) t; $0 = substr($0, RSTART + RLENGTH); }; $0 = p $0 } 1'
Test:
$ echo ',,,"Investigacion,,, desarrollo",,,' | awk ...
,,,"Investigacion, desarrollo",,,
$ echo ',,,"Investigacion,,, desarrollo",,,",,, "' | awk ...
,,,"Investigacion, desarrollo",,,", "

Print a comma except on the last line in Awk

I have the following script
awk '{printf "%s", $1"-"$2", "}' $a >> positions;
where $a stores the name of the file. I am actually writing multiple column values into one row. However, I would like to print a comma only if I am not on the last line.
Single pass approach:
cat "$a" | # look, I can use this in a pipeline!
awk 'NR > 1 { printf(", ") } { printf("%s-%s", $1, $2) }'
Note that I've also simplified the string formatting.
Enjoy this one:
awk '{printf t $1"-"$2} {t=", "}' $a >> positions
Yeh, looks a bit tricky at first sight. So I'll explain, first of all let's change printf onto print for clarity:
awk '{print t $1"-"$2} {t=", "}' file
and have a look what it does, for example, for file with this simple content:
1 A
2 B
3 C
4 D
so it will produce the following:
1-A
, 2-B
, 3-C
, 4-D
The trick is the preceding t variable which is empty at the beginning. The variable will be set {t=...} only on the next step of processing after it was shown {print t ...}. So if we (awk) continue iterating we will got the desired sequence.
I would do it by finding the number of lines before running the script, e.g. with coreutils and bash:
awk -v nlines=$(wc -l < $a) '{printf "%s", $1"-"$2} NR != nlines { printf ", " }' $a >>positions
If your file only has 2 columns, the following coreutils alternative also works. Example data:
paste <(seq 5) <(seq 5 -1 1) | tee testfile
Output:
1 5
2 4
3 3
4 2
5 1
Now replacing tabs with newlines, paste easily assembles the date into the desired format:
<testfile tr '\t' '\n' | paste -sd-,
Output:
1-5,2-4,3-3,4-2,5-1
You might think that awk's ORS and OFS would be a reasonable way to handle this:
$ awk '{print $1,$2}' OFS="-" ORS=", " input.txt
But this results in a final ORS because the input contains a newline on the last line. The newline is a record separator, so from awk's perspective there is an empty last record in the input. You can work around this with a bit of hackery, but the resultant complexity eliminates the elegance of the one-liner.
So here's my take on this. Since you say you're "writing multiple column values", it's possible that mucking with ORS and OFS would cause problems. So we can achieve the desired output entirely with formatting.
$ cat input.txt
3 2
5 4
1 8
$ awk '{printf "%s%d-%d",t,$1,$2; t=", "} END{print ""}' input.txt
3-2, 5-4, 1-8
This is similar to Michael's and rook's single-pass approaches, but it uses a single printf and correctly uses the format string for formatting.
This will likely perform negligibly better than Michael's solution because an assignment should take less CPU than a test, and noticeably better than any of the multi-pass solutions because the file only needs to be read once.
Here's a better way, without resorting to coreutils:
awk 'FNR==NR { c++; next } { ORS = (FNR==c ? "\n" : ", "); print $1, $2 }' OFS="-" file file
awk '{a[NR]=$1"-"$2;next}END{for(i=1;i<NR;i++){print a[i]", " }}' $a > positions

Resources