How can I interpret a string that contains decimal escape sequences? - bash

I'm trying to parse the "parsable" ouput of the avahi-browse command for use in a shell script. e.g.
for i in $(avahi-browse -afkpt | awk -F';' '{print $4}') ; do <do something with $i> ; done
The output looks like:
+;br.vlan150;IPv4;Sonos-7828CAC5D944\064Bedroom;_sonos._tcp;local
I am particularly interested in the value of the 4th field, which is a "service name".
With the -p|--parsable flag, avahi-browse escapes the "service name" values.
For example 7828CAC5D944\064Bedroom, where \064 is a zero-padded decimal representation of the ASCII character '#'.
I just want 7828CAC5D944#Bedroom so I can, for example, use it as an argument to another command.
I can't quite figure out how to do this inside the shell.
I tried using printf, but that only seems to interpret octal escape sequences. e.g.:
# \064 is interpreted as 4
$ printf '%b\n' '7828CAC5D944\064Bedroom'
7828CAC5D9444Bedroom
How can I parse these values, converting any of the decimal escape sequences to their corresponding ASCII characters?

Assumptions:
there's a reason the -p flag cannot be removed (will removing -p generate a # instead of \064?)
the 4th field is to be further processed by stripping off all text up to and including a hyphen (-)
\064 is the only escaped decimal value we need to worry about (for now)
Since OP is already calling awk to process the raw data I propose we do the rest of the processing in the same awk call.
One awk idea:
awk -F';' '
{ n=split($4,arr,"-") # split field #4 based on a hyphen delimiter
gsub(/\\064/,"#",arr[n]) # perform the string replacement in the last arr[] entry
print arr[n] # print the newly modified string
}'
# or as a one-liner:
awk -F';' '{n=split($4,arr,"-");gsub(/\\064/,"#",arr[n]);print arr[n]}'
Simulating the avahi-browse call feeding into awk:
echo '+;br.vlan150;IPv4;Sonos-7828CAC5D944\064Bedroom;_sonos._tcp;local' |
awk -F';' '{n=split($4,arr,"-");gsub(/\\064/,"#",arr[n]);print arr[n]}'
This generates:
7828CAC5D944#Bedroom
And for the main piece of code I'd probably opt for a while loop, especially if there's a chance the avahi-browse/awk process could generate data with white space:
while read -r i
do
<do something with $i>
done < <(avahi-browse -afkpt | awk -F';' '{n=split($4,arr,"-");gsub(/\\064/,"#",arr[n]);print arr[n]}')

Using perl to do the conversion:
$ perl -pe 's/\\(\d+)/chr $1/ge' <<<"7828CAC5D944\064Bedroom"
7828CAC5D944#Bedroom
As part of your larger script, completely replacing awk:
while read -r i; do
# do something with i
done < <(avahi-browse -afkpt | perl -F';' -lane 'print $F[3] =~ s/\\(\d+)/chr $1/ger')

Related

How to extract two pieces of data from a string

I am trying to extract two pieces of data from a string and I have having a bit of trouble. The string is formatted like this:
11111111-2222:3333:4444:555555555555 aaaaaaaa:bbbbbbbb:cccccccc:dddddddd
What I am trying to achieve is to print the first column (11111111-2222:3333:4444:555555555555) and the third section of the colon string (cccccccc), on the same line with a space between the two, as the first column is an identifier. Ideally in a way that can just be run as one-line from the terminal.
I have tried using cut and awk but I have yet to find a good way to make this work.
How about a sed expression like this?
echo "11111111-2222:3333:4444:555555555555 aaaaaaaa:bbbbbbbb:cccccccc:dddddddd" |
sed -e "s/\(.*\) .*:.*:\(.*\):.*/\1 \2/"
Result:
11111111-2222:3333:4444:555555555555 cccccccc
The following awk script does the job without relying on the format of the first column.
awk -F: 'BEGIN {RS=ORS=" "} NR==1; NR==2 {print $3}'
Use it in a pipe or pass the string as a file (simply append the filename as an argument) or as a here-string (append <<< "your string").
Explanation:
Instead of lines this awk script splits the input into space-separated records (RS=ORS=" "). Each record is subdivided into :-separated fields (-F:). The first record will be printed as is (NR==1;, that's the same as NR==1 {print $0}). In the second record, we will only print the 3rd field (NR==2 {print {$3}}); in case of the record aaa:bbb:ccc:ddd the 3rd field is ccc.
I think the answer from user803422 is better but here's another option. Maybe it'll help you use cut in the future.
str='11111111-2222:3333:4444:555555555555 aaaaaaaa:bbbbbbbb:cccccccc:dddddddd'
first=$(echo "$str" | cut -d ' ' -f1)
second=$(echo "$str" | cut -d ':' -f6)
echo "$first $second"
With pure Bash Regex:
str='11111111-2222:3333:4444:555555555555 aaaaaaaa:bbbbbbbb:cccccccc:dddddddd'
echo "$([[ $str =~ (.*\ ).*:.*:([^:]*) ]])${BASH_REMATCH[1]}${BASH_REMATCH[2]}"
Explanations:
[[ $str =~ (.*\ ).*:.*:([^:]* ]]: Match $str against the POSIX Extended RegEx (.*\ ).*:.*:([^:]*) witch contains two capture groups: 1: (.*\ ) 0 or more of any characters, followed by a space; and capture group 2: ([^:]*) witch contains any number of characters that are not :.
$([[ $str =~ (.*\ ).*:.*:([^:]*) ]]): execute the RegEx match in a sub-shell during the string value expansion. (here it produces no output, but the RegEx captured groups are referenced later).
${BASH_REMATCH[1]}${BASH_REMATCH[2]}: expand the content of the RegEx captured groups that Bash keeps in the dedicated $BASH_REMATCH array.

Ignore comma after backslash in a line in a text file using awk or sed

I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?
Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.
You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.
Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.
This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.
Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess
You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.
Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

Replacing/removing excess white space between columns in a file

I am trying to parse a file with similar contents:
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
I want the out file to be tab delimited:
I am a string\t12831928
I am another string\t41327318
A set of strings\t39842938
Another string\t3242342
I have tried the following:
sed 's/\s+/\t/g' filename > outfile
I have also tried cut, and awk.
Just use awk:
$ awk -F' +' -v OFS='\t' '{sub(/ +$/,""); $1=$1}1' file
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Breakdown:
-F' +' # tell awk that input fields (FS) are separated by 2 or more blanks
-v OFS='\t' # tell awk that output fields are separated by tabs
'{sub(/ +$/,""); # remove all trailing blank spaces from the current record (line)
$1=$1} # recompile the current record (line) replacing FSs by OFSs
1' # idiomatic: any true condition invokes the default action of "print"
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
The difficulty comes in the varying number of words per-line. While you can handle this with awk, a simple script reading each word in a line into an array and then tab-delimiting the last word in each line will work as well:
#!/bin/bash
fn="${1:-/dev/stdin}"
while read -r line || test -n "$line"; do
arr=( $(echo "$line") )
nword=${#arr[#]}
for ((i = 0; i < nword - 1; i++)); do
test "$i" -eq '0' && word="${arr[i]}" || word=" ${arr[i]}"
printf "%s" "$word"
done
printf "\t%s\n" "${arr[i]}"
done < "$fn"
Example Use/Output
(using your input file)
$ bash rfmttab.sh < dat/tabfile.txt
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Each number is tab-delimited from the rest of the string. Look it over and let me know if you have any questions.
sed -E 's/[ ][ ]+/\\t/g' filename > outfile
NOTE: the [ ] is openBracket Space closeBracket
-E for extended regular expression support.
The double brackets [ ][ ]+ is to only substitute tabs for more than 1 consecutive space.
Tested on MacOS and Ubuntu versions of sed.
Your input has spaces at the end of each line, which makes things a little more difficult than without. This sed command would replace the spaces before that last column with a tab:
$ sed 's/[[:blank:]]*\([^[:blank:]]*[[:blank:]]*\)$/\t\1/' infile | cat -A
I am a string^I12831928 $
I am another string^I41327318 $
A set of strings^I39842938 $
Another string^I3242342 $
This matches – anchored at the end of the line – blanks, non-blanks and again blanks, zero or more of each. The last column and the optional blanks after it are captured.
The blanks before the last column are then replaced by a single tab, and the rest stays the same – see output piped to cat -A to show explicit line endings and ^I for tab characters.
If there are no blanks at the end of each line, this simplifies to
sed 's/[[:blank:]]*\([^[:blank:]]*\)$/\t\1/' infile
Notice that some seds, notably BSD sed as found in MacOS, can't use \t for tab in a substitution. In that case, you have to use either '$'\t'' or '"$(printf '\t')"' instead.
another approach, with gnu sed and rev
$ rev file | sed -r 's/ +/\t/1' | rev
You have trailing spaces on each line. So you can do two sed expressions in one go like so:
$ sed -E -e 's/ +$//' -e $'s/ +/\t/' /tmp/file
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Note the $'s/ +/\t/': This tells bash to replace \t with an actual tab character prior to invoking sed.
To show that these deletions and \t insertions are in the right place you can do:
$ sed -E -e 's/ +$/X/' -e $'s/ +/Y/' /tmp/file
I am a stringY12831928X
I am another stringY41327318X
A set of stringsY39842938X
Another stringY3242342X
Simple and without invisible semantic characters in the code:
perl -lpe 's/\s+$//; s/\s\s+/\t/' filename
Explanation:
Options:
-l: remove LF during processing (in this case)
-p: loop over records (like awk) and print
-e: code follows
Code:
remove trailing whitespace
change two or more whitespace to tab
Tested on OP data. The trailing spaces are removed for consistency.

How to get word count of a part of a line

The lines of the files are as something like this .
<some character> ||| each line. So far i can get the total number of lines and the text for each on its own line ||| <some text>
Now I want to count the no of words in between the |||.
What I intended to do is
awk -F '|||' '{print $2}' word_file | wc -l
but it throws blank in the awk part ,which suggests it is not taking ||| as I want (which is as a delimiter ),interestingly if i use $1 instead of $2 ,it prints the whole text
However if I use ||| (i.e a space before and after) it gives me some output but does not treat the sentence between the two delimeters as one field ,i.e it prints each instead of the whole sentence if I use the following
awk -F ' ||| ' '{print $2}' word_file
How do I achieve this using a bash command
FYI
awk version -GNU Awk 4.0.1
Awk's -F option, which sets FS, the input-field separator, expects a regular expression as its value.
Thus, for ||| to be interpreted as a literal, you must \-escape the | chars, which are metacharacters in a regex context.
Given that Awk also accepts \-based escape sequences in string literals, you must double the \ instances:
awk -F '\\|\\|\\|' ...
To properly count the words (defined as whitespace-separated tokens) in field 2, you can try this:
awk -F '\\|\\|\\|' 'BEGIN { orgFs=FS } { FS=" "; $0 = $2; print NF; FS=orgFS }' word_file
This splits each input line into fields by literal |||.
By temporarily setting FS to a single space - which is a magic value that tells Awk to split into fields by any nonempty run of whitespace - we can assign $2, the value of field 2, to $0, the whole input line, which causes the new value of $0 to be split into fields again.
At that point NF reflects the number of fields in what was originally the 2nd field - i.e., the number of words - and we can print that.
Restoring FS to its original value then prepares for parsing the next input line.
with gawk multi-char RS support, this might be easier
$ awk -v RS="\\\|\\\|\\\|" 'NR==2{print NF}' file
or if not sure how to escape the pipe, perhaps cleaner with
$ awk -v RS='[|]{3}' ...

Escape multiple dots in variable

Suppose i have a variable $email whose value is stack.over#gmail.com.I want to add a \ before every dot except the last dot and store it in a new variable $email_soa.
$email_soa should be stack\.over#gmail.com in this case.
sed -E 's/\./\\\./g;s/(.*)\\\./\1\./'
should do it.
Test
$ var="stack.over#flow.com"
$ echo $var | sed -E 's/\./\\\./g;s/(.*)\\\./\1./'
stack\.over#flow.com
$ var="stack.over#flow.com."
$ echo $var | sed -E 's/\./\\\./g;s/(.*)\\\./\1./'
stack\.over#flow\.com.
Note
The \\ makes a literal backslash and \. makes a literal dot
You can use gawk:
var="stack.over#gmail.com"
gawk -F'.' '{OFS="\\.";a=$NF;NF--;print $0"."a}' <<< "$var"
Output:
stack\.over#gmail.com
Explanation:
-F'.' splits the string by dots
OFS="\\." sets the output field separator to \.
a=$NF saves the portion after the last dot in a variable 'a'. NF is the number of fields.
NF-- decrements the field count which would effectively remove the last field. This also tells awk to reassemble the record using the OFS This feature does at least work with GNU's gawk.
print $0"."a prints the reassmbled record along with a dot and the value of a
You could use perl to do this:
perl -pe 's/\.(?=.*\.)/\\./g' <<<'stack.over#gmail.com'
Add a slash before any dots that have a dot somewhere after them in the string.
How about this:
temp=${email%.*}
email_soa=${temp/./\\.}.${email##*.}

Resources