apply logic condition to decimal values in awk [duplicate] - bash

I want gawk to parse number using comma , as the decimal point character.
So I set LC_NUMERIC to fr_FR.utf-8 but it does not work:
echo 123,2 | LC_NUMERIC=fr_FR.utf-8 gawk '{printf ("%.2f\n", $1 + 0) }'
123.00
The solution is to specify option --posix or export POSIXLY_CORRECT=1 but in this case the GNU awk extensions are not available, for example delete or the gensub function:
echo 123,2 | LC_NUMERIC=fr_FR.utf-8 gawk --posix '{printf ("%.2f\n", $1 + 0) }'
123,20
Is it possible to have gawk parsing number with , as decimal point without specifying
POSIX option?

The option your are looking for is:
--use-lc-numeric
This forces gawk to use the locale's decimal point character when parsing input data. Although the POSIX standard requires this
behavior, and gawk does so when --posix is in effect, the default is
to follow traditional behavior and use a period as the decimal
point, even in locales where the period is not the decimal point
character. This option overrides the default behavior, without the
full draconian strictness of the --posix option.
Demo:
$ echo 123,2 | LC_NUMERIC=fr_FR.utf-8 awk --use-lc-numeric '{printf "%.2f\n",$1}'
123,20
Notes: printf is statement not a function so the parenthesis are not required and I'm not sure why you are adding zero here?

Related

print environment variables sorted by name including variables with newlines

I couldn't find an existing answer to this specific case: I would like to simply display all exported environment variables sorted by their name. Normally I can do this like simply like:
$ env | sort
However, if some environment variables contain newlines in their values (as is the case on the CI system I'm working with), this does not work because the multi-line values will get mixed up with other variables.
Answering my own question since I couldn't find this elsewhere:
$ env -0 | sort -z | tr '\0' '\n'
env -0 separates each variable by a null character (which is more-or-less how they are already stored internally). sort -z uses null characters instead of newlines as the delimiter for fields to be sorted, and finally tr '\0' '\n' replaces the nulls with newlines again.
Note: env -0 and sort -z are non-standard extensions provided by the GNU coreutils versions of these utilities. Open to other ideas for how to do this with POSIX sort--I'm sure it is possible but it might require a for loop or something; not as easy as a one-liner.
The bash builtin export prints a sorted list of envars:
export -p | sed 's/declare -x //'
Similarly, to print a sorted list of exported functions (without their definitions):
export -f | grep 'declare -fx' | sed 's/declare -fx //'
In a limited environment where env -0 is not available, eg. Alpine 3.13, 3.14 (commands are simplified busybox versions) you can use awk:
awk 'BEGIN { for (K in ENVIRON) { printf "%s=%s%c", K, ENVIRON[K], 0; }}' | sort -z | tr '\0' '\n'
This uses awk to print each environment variable terminated with a null, simulating env -0. Note that setting ORS to null (-vORS='\0') does not work in this limited version of awk, neither does directly printing \0 in the printf, hence the %c to print 0.
Busybox awk lacks any sort functions, hence the remainder of the answer is the same as the top one.
env | sort -f
Worked for me.
The -f option makes sort ignore case, which is what you probably want 99% of the time

Sed command to replace numbers between space and :

I have a file with a records like the below
FIRST 1: SECOND 2: THREE 4: FIVE 255: SIX 255
I want to remove values between space and :
FIRST:SECOND:THREE:FIVE:SIX
with code
awk -F '[[:space:]]*,:*' '{$1=$1}1' OFS=, file
tried on gnu awk:
awk -F' [0-9]*(: *|$)' -vOFS=':' '{print $1,$2,$3,$4,$5}' file
tried on gnu sed:
sed -E 's/\s+[0-9]+(:|$)\s*/\1/g' file
Explanation of awk,
regex , a space, followed by [0-9]+ one or more number followed by literal : followed by one or more space: *, if all such matched, then collect everything else than this matched pattern, ie. FIRST, SECOND,... so on because -F option determine it as field separator (FS) and $1, $2 .. so on is always else than FS. But the output needs nice look ie. has FS so that'd be : and it'd be awk variable definition -vOFS=':'
You can add [[:digit:]] also with a ending asterisk, and leave only a space just after OFS= :
$ awk -F '[[:space:]][[:digit:]]*' '{$1=$1}1' OFS= file
FIRST:SECOND:THREE:FIVE:SIX
To get the output we want in idiomatic awk, we make the input field separator (with -F) contain all the stuff we want to eliminate (anchored with :), and make the output field separator (OFS) what we want it replaced with. The catch is that this won't eliminate the space and numbers at the end of the line, and for this we need to do something more. GNU’s implementation of awk will allow us to use a regular expression for the input record separator (RS), but we could just do a simple sub() with POSIX complaint awk as well. Finally, force recalculation via $1=$1... the side effects for this pattern/statement are that the buffer will be recalculated doing FS/RS substitution for us, and that non-blank lines will take the default action -- which is to print.
gawk -F '[[:space:]]*[[:digit:]]*:[[:space:]]*' -v OFS=: -v RS='[[:space:]]*[[:digit:]]*\n' '$1=$1' file
Or:
awk -F '[[:space:]]*[[:digit:]]*:[[:space:]]*' -v OFS=: '{ sub(/[[:space:]]*[[:digit:]]*$/, “”) } $1=$1' file
A sed implementation is fun but probably slower (because current versions of awk have better regex implementations).
sed 's/[[:space:]]*[[:digit:]]*:[[:space:]]/:/g; s/[[:space:]]*[[:digit:]]*[[:space:]]*$//' file
Or if POSIX character classes are not available...
sed 's/[\t ]*[0-9]*:[\t ]/:/g; s/[\t ]*[0-9]*[\t ]*$//' file
Something tells me that your “FIRST, SECOND, THIRD...” might be more complicated, and might contain digits... in this case, you might want to experiment with replacing * with + for awk or with \+ for sed.

How can I determine the number of fields in a CSV, from the shell?

I have a well-formed CSV file, which may or may not have a header line; and may or may not have quoted data. I want to determine the number of columns in it, using the shell.
Now, if I can be sure there are no quoted commas in the file, the following seems to work:
x=$(tail -1 00-45-19-tester-trace.csv | grep -o , | wc -l); echo $((x + 1))
but what if I can't make that assumption? That is, what if I can't assume a comma is always a field separator? How do I do it then?
If it helps, you're allowed to make the assumption of there being no quoted quotes (i.e. \"s between within quoted strings); but better not to make that one either.
If you cannot make any optimistic assumptions about the data, then there won't be a simple solution in Bash. It's not trivial to parse a general CSV format with possible embedded newlines and embedded separators. You're better off not writing that in bash, but using an existing proper CSV parse. For example Python has one built in its standard library.
If you can assume that there are no embedded newlines and no embedded separators, than it's simple to split in commas using awk:
awk -F, '{ print NF; exit }' input.csv
-F, tells awk to use comma as the field separator, and the automatic NF variable is the number of fields on the current line.
If you want to allow embedded separators, but you can assume no embedded double quotes, then you can eliminate the embedded separators with a simple filter, before piping to the same awk as earlier:
head -n 1 input.csv | sed -e 's/"[^"]*"//g' | awk ...
Note that both of these examples use the first line to decide the number of fields. If the input has a header line, this should work quite well, as the header should not contain embedded newlines
count fields in first row, then verify all rows have same number
CNT=$(head -n1 hhdata.csv | awk -F ',' '{print NF}')
cat hhdata.csv | awk -F ',' '{print NF}' | grep -v $CNT
Doesn't cope with embedded commas but will highlight if they exist
If File has not double quotes then use below command:
awk -F"," '{ print NF }' filename| sort -u
If File has every column enclosed with double quotes then use below command:
awk -F, '{gsub(/"[^"]*"/,x);print NF}' filename | sort -u

number of unique words in a document

I have a very large txt file (500GiB), and I want to get the number of its unique words. I tried this, but it seems to be very slow as it does sort:
grep -o -E '\w+' temp | sort -u -f | wc -l
Is there any better way of doing this?
awk to the rescue!
$ awk -v RS=" " '{a[$0]++} END{for(k in a) sum++; print sum}' file
UPDATE:
It's probably better to do preprocessing with tr and let the awk do the counting economically. You may want to delimit the words with spaces or new lines.
For example:
$ tr ':;,?!\"' ' ' < file | tr -s ' ' '\n' | awk '!a[$0]++{c++} END{print c}'
You can rely on awk's default behavior to split lines into words by runs of whitespace, and use its associative arrays:
awk '{ for (i=1; i<=NF; ++i) a[tolower($i)]++ } END { print length(a) }' file
Update: As #rici points out in a comment, white-space separated tokens may include punctuation other than _ and other characters, and are thus not necessarily the same as grep's \w+ construct. #4ae1e1 therefore suggests using a field separator of along the lines of '[^[:alnum:]_]'. Note that this will result in each component of a hyphenated word to be counted separately; similarly, ' separates words.
awk -F '[^[:alnum:]_]+' '{ for (i=1; i<=NF; ++i) { a[tolower($i)]++ } }
END { print length(a) - ("" in a) }' file
Associative array a is built in a way that counts the occurrence of each distinct word encountered in the input, converted to lowercase first so as to ignore differences in case - if you do NOT want to ignore case differences, simply remove the tolower() call.
CAVEAT: It seems that Mawk and BSD Awk aren't locale-aware, so tolower() won't work properly with non-ASCII characters.
On having processed all words, the number of elements of a equals the number of unique words.
NOTE: The POSIX-compliant reformulation of print length(a) is: for (k in a) ++count; print count
The above will work with GNU Awk, Mawk (1.3.4+), and BSD Awk, even though it isn't strictly POSIX-compliant (POSIX defines the length function only for strings, not arrays).
An important feature of sorting is that it is locale-aware, and therefore much more expensive in any locale other than C. Since you don't really care about the order here, you might as well tell sort to ignore the locale by using LC_ALL=C sort -u -f. If your locale is set to something else, that will probably cut your execution time in half.
The original version of this answer suggested that you should only do this if you don't care about non-ascii characters. However, if you are using Gnu coreutils, it turns out that none of this stuff will work in UTF-8 locales. While gnu sort will do a locale-aware string comparison in any locale (using the strxfrm standard library function), sort -f only does locale-aware case-folding in single-byte locales. Gnu uniq -i has the same problem. And tr only translates single-byte characters (by design, afaik); in theory [:alpha:] is locale-aware, but only for characters representable as single bytes.
In short, if you want to use sort -u -f, you might as well specify the C locale. That is no less broken for non-English letters, but at least the breakage doesn't waste time.
Gnu awk's tolower() function does apparently work on multibyte locales. So check out one of the awk answers if you need this to work in a UTF-8 locale.

hexadecimal literals in awk patterns

awk is capable of parsing fields as hexadecimal numbers:
$ echo "0x14" | awk '{print $1+1}'
21 <-- correct, since 0x14 == 20
However, it does not seem to handle actions with hexadecimal literals:
$ echo "0x14" | awk '$1+1<=21 {print $1+1}' | wc -l
1 <-- correct
$ echo "0x14" | awk '$1+1<=0x15 {print $1+1}' | wc -l
0 <-- incorrect. awk is not properly handling the 0x15 here
Is there a workaround?
You're dealing with two similar but distinct issues here, non-decimal data in awk input, and non-decimal literals in your awk program.
See the POSIX-1.2004 awk specification, Lexical Conventions:
8. The token NUMBER shall represent a numeric constant. Its form and numeric value [...]
with the following exceptions:
a. An integer constant cannot begin with 0x or include the hexadecimal digits 'a', [...]
So awk (presumably you're using nawk or mawk) behaves "correctly". gawk (since version 3.1) supports non-decimal (octal and hex) literal numbers by default, though using the --posix switch turns that off, as expected.
The normal workaround in cases like this is to use the defined numeric string behaviour, where a numeric string is to effectively be parsed as the C standard atof() or strtod() function, that supports 0x-prefixed numbers:
$ echo "0x14" | nawk '$1+1<=0x15 {print $1+1}'
<no output>
$ echo "0x14" | nawk '$1+1<=("0x15"+0) {print $1+1}'
21
The problem here is that that isn't quite correct, as POSIX-1.2004 also states:
A string value shall be considered a numeric string if it comes from one of the following:
1. Field variables
...
and after all the following conversions have been applied, the resulting string would
lexically be recognized as a NUMBER token as described by the lexical conventions in Grammar
UPDATE: gawk aims for "2008 POSIX.1003.1", note however since the 2008 edition (see the IEEE Std 1003.1 2013 edition awk here) allows strtod() and implementation-dependent behaviour that does not require the number to conform to the lexical conventions. This should (implicitly) support INF and NAN too. The text in Lexical Conventions is similarly amended to optionally allow hexadecimal constants with 0x prefixes.
This won't behave (given the lexical constraint on numbers) quite as hoped in gawk:
$ echo "0x14" | gawk '$1+1<=0x15 {print $1+1}'
1
(note the "wrong" numeric answer, which would have been hidden by |wc -l)
unless you use --non-decimal-data too:
$ echo "0x14" | gawk --non-decimal-data '$1+1<=0x15 {print $1+1}'
21
See also:
https://www.gnu.org/software/gawk/manual/html_node/Nondecimal_002dnumbers.html
http://www.gnu.org/software/gawk/manual/html_node/Variable-Typing.html
This accepted answer to this SE question has a portability workaround.
The options for having the two types of support for non-decimal numbers are:
use only gawk, without --posix and with --non-numeric-data
implement a wrapper function to perform hex-to-decimal, and use this both with your literals and on input data
If you search for "awk dec2hex" you can find many instances of the latter, a passable one is here: http://www.tek-tips.com/viewthread.cfm?qid=1352504 . If you want something like gawk's strtonum(), you can get a portable awk-only version here.
Are you stuck with an old awk version? I don't know of a way to do mathematics with hexadecimal numbers with it (you will have to wait for better answers :-). I can contribute with an option of Gawk:
-n, --non-decimal-data: Recognize octal and hexadecimal values in input data. Use this option with great caution!
So, either
echo "0x14" | awk -n '$1+1<=21 {print $1+1}'
and
echo "0x14" | awk -n '$1+1<=0x15 {print $1+1}'
return
21
Whatever awk you're using seems to be broken, or non-POSIX at least:
$ echo '0x14' | /usr/xpg4/bin/awk '{print $1+1}'
1
$ echo '0x14' | nawk '{print $1+1}'
1
$ echo '0x14' | gawk '{print $1+1}'
1
$ echo '0x14' | gawk --posix '{print $1+1}'
1
Get GNU awk and use strtonum() everywhere you could have a hex number:
$ echo '0x14' | gawk '{print strtonum($1)+1}'
21
$ echo '0x14' | gawk 'strtonum($1)+1<=21{print strtonum($1)+1}'
21
$ echo '0x14' | gawk 'strtonum($1)+1<=strtonum(0x15){print strtonum($1)+1}'
21

Resources