hexadecimal literals in awk patterns - macos

awk is capable of parsing fields as hexadecimal numbers:
$ echo "0x14" | awk '{print $1+1}'
21 <-- correct, since 0x14 == 20
However, it does not seem to handle actions with hexadecimal literals:
$ echo "0x14" | awk '$1+1<=21 {print $1+1}' | wc -l
1 <-- correct
$ echo "0x14" | awk '$1+1<=0x15 {print $1+1}' | wc -l
0 <-- incorrect. awk is not properly handling the 0x15 here
Is there a workaround?

You're dealing with two similar but distinct issues here, non-decimal data in awk input, and non-decimal literals in your awk program.
See the POSIX-1.2004 awk specification, Lexical Conventions:
8. The token NUMBER shall represent a numeric constant. Its form and numeric value [...]
with the following exceptions:
a. An integer constant cannot begin with 0x or include the hexadecimal digits 'a', [...]
So awk (presumably you're using nawk or mawk) behaves "correctly". gawk (since version 3.1) supports non-decimal (octal and hex) literal numbers by default, though using the --posix switch turns that off, as expected.
The normal workaround in cases like this is to use the defined numeric string behaviour, where a numeric string is to effectively be parsed as the C standard atof() or strtod() function, that supports 0x-prefixed numbers:
$ echo "0x14" | nawk '$1+1<=0x15 {print $1+1}'
<no output>
$ echo "0x14" | nawk '$1+1<=("0x15"+0) {print $1+1}'
21
The problem here is that that isn't quite correct, as POSIX-1.2004 also states:
A string value shall be considered a numeric string if it comes from one of the following:
1. Field variables
...
and after all the following conversions have been applied, the resulting string would
lexically be recognized as a NUMBER token as described by the lexical conventions in Grammar
UPDATE: gawk aims for "2008 POSIX.1003.1", note however since the 2008 edition (see the IEEE Std 1003.1 2013 edition awk here) allows strtod() and implementation-dependent behaviour that does not require the number to conform to the lexical conventions. This should (implicitly) support INF and NAN too. The text in Lexical Conventions is similarly amended to optionally allow hexadecimal constants with 0x prefixes.
This won't behave (given the lexical constraint on numbers) quite as hoped in gawk:
$ echo "0x14" | gawk '$1+1<=0x15 {print $1+1}'
1
(note the "wrong" numeric answer, which would have been hidden by |wc -l)
unless you use --non-decimal-data too:
$ echo "0x14" | gawk --non-decimal-data '$1+1<=0x15 {print $1+1}'
21
See also:
https://www.gnu.org/software/gawk/manual/html_node/Nondecimal_002dnumbers.html
http://www.gnu.org/software/gawk/manual/html_node/Variable-Typing.html
This accepted answer to this SE question has a portability workaround.
The options for having the two types of support for non-decimal numbers are:
use only gawk, without --posix and with --non-numeric-data
implement a wrapper function to perform hex-to-decimal, and use this both with your literals and on input data
If you search for "awk dec2hex" you can find many instances of the latter, a passable one is here: http://www.tek-tips.com/viewthread.cfm?qid=1352504 . If you want something like gawk's strtonum(), you can get a portable awk-only version here.

Are you stuck with an old awk version? I don't know of a way to do mathematics with hexadecimal numbers with it (you will have to wait for better answers :-). I can contribute with an option of Gawk:
-n, --non-decimal-data: Recognize octal and hexadecimal values in input data. Use this option with great caution!
So, either
echo "0x14" | awk -n '$1+1<=21 {print $1+1}'
and
echo "0x14" | awk -n '$1+1<=0x15 {print $1+1}'
return
21

Whatever awk you're using seems to be broken, or non-POSIX at least:
$ echo '0x14' | /usr/xpg4/bin/awk '{print $1+1}'
1
$ echo '0x14' | nawk '{print $1+1}'
1
$ echo '0x14' | gawk '{print $1+1}'
1
$ echo '0x14' | gawk --posix '{print $1+1}'
1
Get GNU awk and use strtonum() everywhere you could have a hex number:
$ echo '0x14' | gawk '{print strtonum($1)+1}'
21
$ echo '0x14' | gawk 'strtonum($1)+1<=21{print strtonum($1)+1}'
21
$ echo '0x14' | gawk 'strtonum($1)+1<=strtonum(0x15){print strtonum($1)+1}'
21

Related

non-printable character not recognised as field separator

I have a file. Its field separator is non-printable character \x1c (chr(28) in Python). In VI it looks like a^\b^\c but using cat I just see abc. The fieldseparator ^\ is not seen.
I have a simple awk command:
awk -F $ā€™\x1cā€™ ā€˜{print NF}ā€™ a
to get the total number of fields. It works on MacOS, but on AIX, it fails. It seems AIX can't recognize the field separator. So the output is 1 meaning the whole line is considered one field.
How to do this on AIX? Any idea is much appreciated.
I was able to reproduce this on SOLARIS running ksh.
sol bash $ printf '\034a\034b\034c' | cat -v
^\a^\b^\c$
sol bash $ printf '\034a\034b\034c' | awk -F$'\x1c' '{print NF}'
4
sol bash $ printf '\034a\034b\034c' | awk -F$'\034' '{print NF}'
4
sol ksh $ printf '\034a\034b\034c' | cat -v
^\a^\b^\c$
sol ksh $ printf '\034a\034b\034c' | awk -F$'\x1c' '{print NF}'
1
sol ksh $ printf '\034a\034b\034c' | awk -F$'\034' '{print NF}'
1
I cannot confirm if this is a ksh issue or awk issue, as other cases fail on both.
sol ksh/bash $ printf '\034a\034b\034c' | awk 'BEGIN{FS="\034"}{print NF}'
1
All the above cases work successfully on any Linux system (which run by default GNU awk), but it seemed to have failed gloriously.
The following trick is a work arround that cannot fail at all (until the point it will fail):
sol ksh/bash $ printf '\034a\034b\034c' | awk 'BEGIN{FS=sprintf("%c",28)}{print NF}'
4
The above works because we let awk set the FS using the sprintf function where we pass the decimal number 28=x1c=034
Well $'\x1c' is a bashizm, the portable format is "$(printf '\034')".
(This answer has already been written as a comment.)
When awk has issues, try Perl
$ cat -vT tonyren.txt
a^\b^\c^\d
p^\q^\r^\s
x^\y^\z
$ perl -F"\x1c" -le ' { print scalar #F } ' tonyren.txt
4
4
3
$

Searching a file (grep/awk) for 2 carriage return/line-feed characters

I'm trying to write a script that'll simply count the occurrences of \r\n\r\n in a file. (Opening the sample file in vim binary mode shows me the ^M character in the proper places, and the newline is still read as a newline).
Anyway, I know there are tons of solutions, but they don't seem to get me what I want.
e.g. awk -e '/\r/,/\r/!d' or using $'\n' as part of the grep statement.
However, none of these seem to produce what I need. I can't find the \r\n\r\n pattern with grep's "trick", since that just expands one variable. The awk solution is greedy, and so gets me way more lines than I want/need.
Switching grep to binary/Perl/no-newline mode seems to be closer to what I want,
e.g. grep -UPzo '\x0D', but really what I want then is grep -UPzo '\x0D\x00\x0D\x00', which doesn't produce the output I want.
It seems like such a simple task.
By default, awk treats \n as the record separator. That makes it very hard to count \r\n\r\n. If we choose some other record separator, say a letter, then we can easily count the appearance of this combination. Thus:
awk '{n+=gsub("\r\n\r\n", "")} END{print n}' RS='a' file
Here, gsub returns the number of substitutions made. These are summed and, after the end of the file has been reached, we print the total number.
Example
Here, we use bash's $'...' construct to explicitly add newlines and linefeeds:
$ echo -n $'\r\n\r\n\r\n\r\na' | awk '{n+=gsub("\r\n\r\n", "")} END{print n}' RS='a'
2
Alternate solution (GNU awk)
We can tell it to treat \r\n\r\n as the record separator and then return the count (minus 1) of the number of records:
cat file <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
In awk, RS is the record separator and NR is the count of the number of records. Since we are using a multiple-character record separator, this requires GNU awk.
If the file ends with \r\n\r\n, the above would be off by one. To avoid that, the echo -n 1 statement is used to assure that there are always at least one character after the last \r\n\r\n in the file.
Examples
Here, we use bash's $'...' construct to explicitly add newlines and linefeeds:
$ echo -n $'abc\r\n\r\n' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
1
$ echo -n $'abc\r\n\r\ndef' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
1
$ echo -n $'\r\n\r\n\r\n\r\n' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
2
$ echo -n $'1\r\n\r\n2\r\n\r\n3' | cat - <(echo 1) | awk 'END{print NR-1;}' RS='\r\n\r\n'
2

How can I extract the 11th and 12th characters of a grep match in Bash?

I have a text file called temp.txt that consists of 3 serial numbers.
AB400-251429-0014
AA200-251429-0028
AD200-251430-0046
The 11th and 12th characters in the serial number correspond to the week. I want to extract this number for each unit and do something with it (but for this example just echo it). I have the following code:
while read line; do
week=` grep A[ABD][42]00 $line | cut -c11-12 `
echo $week
done < temp.txt
Looks like it's not working as cut is expecting a filename called the serial number in each case. Is there an alternative way to do this?
The problem is not with cut but with grep which expects a filename, but gets the line contents. Also, the expression doesn't match the IDs: they don't start with S followed by A, B, or D.
You can process lines in bash without starting a subshell:
while read line ; do
echo 11th and 12th characters are: "${line:10:2}".
done < temp.txt
Your original approach is still possible:
week=$( echo "$line" | grep 'S[ABD][42]00' | cut -c11-12 )
Note that for non matching lines, $week would be empty.
You can try also the:
grep -oP '.{10}\K..' filename
for your input prints
29
29
30
The \K mean, A variable length look-behind. With other words, the grep would look for the pattern before \K but would not include it into the result.
More precise selection of the lines:
grep -oP '[ABD][42]00-.{4}\K..' # or more precise
grep -oP '^\w[ABD][42]00-.{4}\K..' # or even more
grep -oP '^[A-Z][ABD][42]00-.{4}\K..' # or
grep -oP '^[A-Z][ABD][42]00-\d{4}\K..' # or
prints like the above, but selects the interesting lines... :)
I would use this simple awk
awk '{print substr($0,11,2)}' text.file
29
29
30
To get it into an array that you can use later:
results=($(awk '{print substr($0,11,2)}' text.file))
echo "${results[#]}"
29 29 30
TL;DR
Looping with Bash is pretty inefficient, especially when reading a file a line at a time. You can get what you want faster and more effectively by using grep to cut only the interesting lines, or by using awk to avoid having to call cut in a separate pipelined process.
GNU Grep and Cut Solution
$ grep '[[:alpha:]][ABD][42]' temp.txt | cut -c11,12
29
29
30
Awk Solutions
# As far as I know, this will work on most awks. If you find an exception,
# please post a constructive comment!
$ awk -v NF=1 -v FPAT=. '/[[:alpha:]][ABD][42]00/ { print $11 $12 }' temp.txt
29
29
30
# A more elegant solution as noted by #rici that works with GNU awk,
# and possibly others.
$ gawk -v FS= '/[[:alpha:]][ABD][42]00/ { print $11 $12 }' temp.txt
29
29
30
Store the Results in a Bash Array
Either way, you can store the results of your match in an Bash array to use later. For example:
$ results=(`grep '[[:alpha:]][ABD][42]00' temp.txt | cut -c11,12`)
$ echo "${results[#]}"
29 29 30

apply logic condition to decimal values in awk [duplicate]

I want gawk to parse number using comma , as the decimal point character.
So I set LC_NUMERIC to fr_FR.utf-8 but it does not work:
echo 123,2 | LC_NUMERIC=fr_FR.utf-8 gawk '{printf ("%.2f\n", $1 + 0) }'
123.00
The solution is to specify option --posix or export POSIXLY_CORRECT=1 but in this case the GNU awk extensions are not available, for example delete or the gensub function:
echo 123,2 | LC_NUMERIC=fr_FR.utf-8 gawk --posix '{printf ("%.2f\n", $1 + 0) }'
123,20
Is it possible to have gawk parsing number with , as decimal point without specifying
POSIX option?
The option your are looking for is:
--use-lc-numeric
This forces gawk to use the locale's decimal point character when parsing input data. Although the POSIX standard requires this
behavior, and gawk does so when --posix is in effect, the default is
to follow traditional behavior and use a period as the decimal
point, even in locales where the period is not the decimal point
character. This option overrides the default behavior, without the
full draconian strictness of the --posix option.
Demo:
$ echo 123,2 | LC_NUMERIC=fr_FR.utf-8 awk --use-lc-numeric '{printf "%.2f\n",$1}'
123,20
Notes: printf is statement not a function so the parenthesis are not required and I'm not sure why you are adding zero here?

how to grep multiples variable in bash

I need to grep multiple strings, but i don't know the exact number of strings.
My code is :
s2=( $(echo $1 | awk -F"," '{ for (i=1; i<=NF ; i++) {print $i} }') )
for pattern in "${s2[#]}"; do
ssh -q host tail -f /some/path |
grep -w -i --line-buffered "$pattern" > some_file 2>/dev/null &
done
now, the code is not doing what it's supposed to do. For example if i run ./script s1,s2,s3,s4,.....
it prints all lines that contain s1,s2,s3....
The script is supposed to do something like grep "$s1" | grep "$s2" | grep "$s3" ....
grep doesn't have an option to match all of a set of patterns. So the best solution is to use another tool, such as awk (or your choice of scripting languages, but awk will work fine).
Note, however, that awk and grep have subtly different regular expression implementations. It's not clear from the question whether the target strings are literal strings or regular expression patterns, and if the latter, what the expectations are. However, since the argument comes delimited with commas, I'm assuming that the pieces are simple strings and should not be interpreted as patterns.
If you want the strings to be interpreted as patterns, you can change index to match in the following little program:
ssh -q host tail -f /some/path |
awk -v STRINGS="$1" -v IGNORECASE=1 \
'BEGIN{split(STRINGS,strings,/,/)}
{for(i in strings)if(!index($0,strings[i]))next}
{print;fflush()}'
Note:
IGNORECASE is only available in gnu awk; in (most) other implementations, it will do nothing. It seems that is what you want, based on the fact that you used -i in your grep invocation.
fflush() is also an extension, although it works with both gawk and mawk. In Posix awk, fflush requires an argument; if you were using Posix awk, you'd be better off printing to stderr.
You can use extended grep
egrep "$s1|$s2|$s3" fileName
If you don't know how many pattern you need to grep, but you have all of them in an array called s, you can use
egrep $(sed 's/ /|/g' <<< "${s[#]}") fileName
This creates a herestring with all elements of the array, sed replaces the field separator of bash (space) with | and if we feed that to egrep we grep all strings that are in the array s.
test.sh:
#!/bin/bash -x
a=" $#"
grep ${a// / -e } .bashrc
it works that way:
$ ./test.sh 1 2 3
+ a=' 1 2 3'
+ grep -e 1 -e 2 -e 3 .bashrc
(here is lots of text that fits all the arguments)

Resources