What is the optimal way to extract values between braces in bash/awk? - bash

I have the output in this format:
Infosome - infotwo: (29333) - data-info-ids: (33389, 94934)
I want to extract the last two numbers in the last pair of braces. Some times there is only a single number in the last pair of braces.
This is the code I used.
echo "Infosome - infotwo: (29333) - data-info-ids: (33389, 94934)" | \
tr "," " " | tr "(" " " | tr ")" " " | awk -F: '{print $4}'
Is a more clean way to extract the values? or a more optimal way?

Try this:
awk -F '[()]' '{print $(NF-1)}' input | tr -d ,
It's kind of refactoring of your command.

awk -F\( '{gsub("[,)]", " ", $NF); print $NF}' input
will give
33389 94934
I am a bit unclear about the meaning of "optimal"/"professional" in this problem's context, but this only uses one command/tool, not sure if that qualifies.
Or building on #kev's approach (but not needing tr to eliminate the comma):
awk -F'[(,)]' '{print $4, $5}' input
outputs:
33389 94934

This can also be done in pure bash. Assuming the text always looks like the sample in the question, the following should work:
$ text="Infosome - infotwo: (29333) - data-info-ids: (33389, 94934)"
$ result="${text/*(}"
$ echo ${result//[,)]}
33389 94934
This uses shell "parameter expansion" (which you can search for in bash's man page) to strip the string in much the same way you did using tr. Strictly speaking, the quotes in the second line are not necessary, but they help with StackOverflow syntax highlighting. :-)
You could alternately make this a little more flexible by looking for the actual field you're interested in. If you're using GNU awk, you can specify RS with multiple characters:
$ gawk -vRS=" - " -vFS=": *" '
{ f[$1]=$2; }
END {
print f["data-info-ids"];
# Or you could strip the non-numeric characters to get just numbers.
#print gensub(/[^0-9 ]/,"","g",f["data-info-ids"]);
}' <<<"$text"
I prefer this way, because it actually interprets the input data for what it is -- structured text representing some sort of array.

Related

How do i get the value present in first double quotes?

I'm currently writing a bash script to get the first value among the many comma separated strings.
I have a file that looks like this -
name
things: "water bottle","40","new phone cover",10
place
I just need to return the value in first double quotes.
water bottle
The value in first double quotes can be one word/two words. That is, water bottle can be sometimes replaced with pen.
I tried -
awk '/:/ {print $2}'
But this just gives
water
I wanted to comma separate it, but there's colon(:) after things. So, I'm not sure how to separate it.
How do i get the value present in first double quotes?
EDIT:
SOLUTION:
I used the below code since I particularly wanted to use awk -
awk '/:/' test.txt | cut -d\" -f2
A solution using the cut utility could be
cut -d\" -f2 infile > outfile
Using gnu awk you could make use of a capture group, and use a negated character class to not cross the , as that is the field delimiter.
awk 'match($0, /^[^",:]*:[^",]*"([^"]*)"/, a) {print a[1]}' file
Output
water bottle
The pattern matches
^ Start of string
[^",:]*:Optionally match any value except " and , and :, then match :
[^",]* Optionally match any value except " and ,
"([^"]*)" Capture in group 1 the value between double quotes
If the value is always between double quotes, a short option to get the desired result could be setting the field separator to " and check if group 1 contains a colon, although technically you can also get water bottle if there is only a leading double quote and not closing one.
awk -F'"' '$1 ~ /:/ {print $2}' file
With your shown samples, please try following awk code.
awk '/^things:/ && match($0,/"[^"]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file
Explanation: In awk program checking if line starts with things: AND using match function to match everything between 1st and 2nd " and printing them accordingly.
Solution 1: awk
You can use a single awk command:
awk -F\" 'index($1, ":"){print $2}' test.txt > outfile
See the online demo.
The -F\" sets the field separator to a " char, index($1, ":") condition makes sure Field 1 contains a : char (no regex needed) and then {print $2} prints the second field value.
Solution 2: awk + cut
You can use awk + cut:
awk '/:/' test.txt | cut -d\" -f2 > outfile
With awk '/:/' test.txt, you will extract line(s) containing : char, and then the piped cut -d\" -f2 command will split the string with " as a separator and return the second item. See the online demo.
Solution 3: sed
Alternatively, you can use sed:
sed -n 's/^[^"]*"\([^"]*\)".*/\1/p' file > outfile
See the online demo:
#!/bin/bash
s='name
things: "water bottle","40","new phone cover",10
place'
sed -n 's/^[^"]*"\([^"]*\)".*/\1/p' <<< "$s"
# => water bottle
The command means
-n - the option suppresses the default line output
^[^"]*"\([^"]*\)".* - a POSIX BRE regex pattern that matches
^ - start of string
[^"]* - zero or more chars other than "
" - a " char
\([^"]*\) - Group 1 (\1 refers to this value): any zero or more chars other than "
".* - a " char and the rest of the string.
\1 replaces the match with Group 1 value
p - only prints the result of a successful substitution.

Ignore comma after backslash in a line in a text file using awk or sed

I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?
Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.
You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.
Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.
This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.
Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess
You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.
Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

How to manipulate text to one side of a delimiter while preserving text on the opposite side

I am trying to translate some documents in which every line is of the form:
name1:text to be translated
name2:text to be translated
I am using translate-shell to perform the translations. trans -b :es -input ~/path/to/file
The desired output would be:
name1:texto a traducir
name2:texto a traducir
But instead I am getting this output:
nombre1:texto a traducir
nombre2:texto a traducir
If I had to guess I would guess the answer probably lies in separating the fields with awk but I'm having difficulty understanding the man pages well enough to figure out how to do it properly. Right now I'm doing this
awk -F: '/:/ { print $1 ": " $2 }' ~/path/to/file
to separate the fields and then attempting to work with each field separately. But I am confused about the pattern-action statement awk. Can I run another command within the awk environment? So far all my attempts to do so have resulted in syntax errors.
Here is a recipe involving cut and paste:
cut the names and texts into two separated files:
cut -d: -f1 yourfile > names.txt
cut -d: -f2- yourfile > text.txt
translate text.txt using whatever workflow you are using at the moment
combine the old names.txt with the translated text:
paste -d: names.txt yourtranslated_text
I think #LarsFischer has the best answer so far but just in case you have some reason to need to use awk and you can pass individual strings to "trans" and the text to be translated cannot contain newlines, this is how you'd do it:
awk '
{
name = text = $0
sub(/:.*/,"",name)
sub(/[^:]+:/,"",text)
cmd = "trans args \"" text "\""
if ( (cmd | getline rslt) > 0 ) {
print name ":" rslt
}
close(cmd)
}
' file
Well, I can't get the translate-shell to work but maybe something like this:
awk -v dq='"' -F: '{printf "%s ", $1; gsub(/^.*:/,""); system("trans -b :es "dq""$0""dq)}' test.in
another alternative is paste the original and translated files together and cut the needed fields, that is
paste -d: original translation | cut -d: -f1,4

Bash split string according to string

In python, I would do something simple like sRet = sOut.split('Word')
In bash, scrounged from other answers, I have the following two methods that are insufficient in my case, but may be useful to someone in the future:
sOut="I want this Point to matter"
1) sRet=( $sOut )
2) IFS="Point " read -r -a sRet <<< ${sOut}
echo ${sRet[-1]}
I want returned: "to matter"
(1) gives: "matter"
(2) gives: "er"
The first only splits by spaces, the second splits by the last character, in this case it would be 't'.
How do I split by a full string, as I would in python?
sOut="I want this Point to matter"
s="Point "
[[ $sOut =~ $s(.*) ]] && echo ${BASH_REMATCH[1]}
Output:
to matter
IFS is single character, so you will need to deploy another tool. I'd suggest awk in this case:
$ awk -F 'Point' '{print $NF}' <<< "$sOut"
to matter
You can replace 'Point' with a variable holding the delimiter. You can also change which part of the split you get back. The variable $NF means "the last element". You can also use $1 for the first element, $2 for the second, and so on.
You can use awk for splitting the string:
text="I want this Point to matter"
s='Point'
awk -v s="$s" -v text="$text" 'BEGIN {split(text, a, "[[:blank:]]*" s "[[:blank:]]*");
for (i in a) print a[i]}'
I want this
to matter
To get only the last match:
awk -v s="$s" -v text="$text" 'BEGIN {n=split(text, a, "[[:blank:]]*" s "[[:blank:]]*"); print a[n]}'
to matter
Or:
awk -v s="$s" 'BEGIN{FS="[[:blank:]]*" s "[[:blank:]]*"} {print $NF}' <<< "$text"
to matter
IFS on the other hand doesn't work with multiple character string. So IFS='Point' will split the output on each character P, o, i, n, t.
sDelim="Point"
sRet1=$(awk -F ${sDelim} '{print $1}' <<< ${sOut})
sRet2=$(awk -F ${sDelim} '{print $NF}' <<< ${sOut})
Given all the other excellent answers, I prefer this one most for the following reasons:
1) Its short ans sweet
2) Everything is fairly explicit when wanting to use variables
3) Any elements can be selected: 1,2,.. from the beginning, NF, NF-1,.. from the end
4) if sDelim is not actually in sOut, the script doesn't freak out
Thanks mainly to #bishop for leading me to this
You could use the parenthesis feature of sed to retrieve
the string that is matched.
The below code:
sOut="I want this point to matter"
s="point "
echo $sOut | sed "s/.*$s\(.*\)/\1/"
would give me:
to matter
as output.

cut string in a specific column in bash

How can I cut the leading zeros in the third field so it will only be 6 characters?
xxx,aaa,00000000cc
rrr,ttt,0000000yhh
desired output
xxx,aaa,0000cc
rrr,ttt,000yhh
or here's a solution using awk
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3)}1'
output
xxx,aaa,0000cc
rrr,ttt,000yhh
awk uses -F (or FS for FieldSeparator) and you must use OFS for OutputFieldSeparator) .
sub(/srchtarget/, "replacmentstring", stringToFix) is uses a regular expression to look for 4 0s at the front of (^) the third field ($3).
The 1 is a shorthand for the print statement. A longhand version of the script would be
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3);print}'
# ---------------------------------------------------------^^^^^^
Its all related to awk's /pattern/{action} idiom.
IHTH
If you can assume there are always three fields and you want to strip off the first four zeros in the third field you could use a monstrosity like this:
$ cat data
xxx,0000aaa,00000000cc
rrr,0000ttt,0000000yhh
$ cat data |sed 's/\([^,]\+\),\([^,]\+\),0000\([^,]\+\)/\1,\2,\3/
xxx,0000aaa,0000cc
rrr,0000ttt,000yhh
Another more flexible solution if you don't mind piping into Python:
cat data | python -c '
import sys
for line in sys.stdin():
print(",".join([f[4:] if i == 2 else f for i, f in enumerate(line.strip().split(","))]))
'
This says "remove the first four characters of the third field but leave all other fields unchanged".
Using awks substr should also work:
awk -F, -v OFS=, '{$3=substr($3,5,6)}1' file
xxx,aaa,0000cc
rrr,ttt,000yhh
It just take 6 characters from 5 position in field 3 and set it back to field 3

Resources