non-printable character not recognised as field separator - shell

I have a file. Its field separator is non-printable character \x1c (chr(28) in Python). In VI it looks like a^\b^\c but using cat I just see abc. The fieldseparator ^\ is not seen.
I have a simple awk command:
awk -F $ā€™\x1cā€™ ā€˜{print NF}ā€™ a
to get the total number of fields. It works on MacOS, but on AIX, it fails. It seems AIX can't recognize the field separator. So the output is 1 meaning the whole line is considered one field.
How to do this on AIX? Any idea is much appreciated.

I was able to reproduce this on SOLARIS running ksh.
sol bash $ printf '\034a\034b\034c' | cat -v
^\a^\b^\c$
sol bash $ printf '\034a\034b\034c' | awk -F$'\x1c' '{print NF}'
4
sol bash $ printf '\034a\034b\034c' | awk -F$'\034' '{print NF}'
4
sol ksh $ printf '\034a\034b\034c' | cat -v
^\a^\b^\c$
sol ksh $ printf '\034a\034b\034c' | awk -F$'\x1c' '{print NF}'
1
sol ksh $ printf '\034a\034b\034c' | awk -F$'\034' '{print NF}'
1
I cannot confirm if this is a ksh issue or awk issue, as other cases fail on both.
sol ksh/bash $ printf '\034a\034b\034c' | awk 'BEGIN{FS="\034"}{print NF}'
1
All the above cases work successfully on any Linux system (which run by default GNU awk), but it seemed to have failed gloriously.
The following trick is a work arround that cannot fail at all (until the point it will fail):
sol ksh/bash $ printf '\034a\034b\034c' | awk 'BEGIN{FS=sprintf("%c",28)}{print NF}'
4
The above works because we let awk set the FS using the sprintf function where we pass the decimal number 28=x1c=034

Well $'\x1c' is a bashizm, the portable format is "$(printf '\034')".
(This answer has already been written as a comment.)

When awk has issues, try Perl
$ cat -vT tonyren.txt
a^\b^\c^\d
p^\q^\r^\s
x^\y^\z
$ perl -F"\x1c" -le ' { print scalar #F } ' tonyren.txt
4
4
3
$

Related

Counting Python files with bash and awk always returns zero

I want to get a number of python files on my desktop and I have coded a small script for that. But the awk command does not work as is have expected.
script
ls -l | awk '{ if($NF=="*.py") print $NF; }' | wc -l
I know that there is another solution to finding a number of python files on a PC but I just want to know what am i doing wrong here.
ls -l | awk '{ if($NF=="*.py") print $NF; }' | wc -l
Your code does count of files literally named *.py, you should deploy regex matching and use correct GNU AWK syntax, after fixing that, your code becomes
ls -l | awk '{ if($NF~/[.]py$/) print $NF; }' | wc -l
note [.] which denote literal . and $ denoting end of string.
Your code might be further ameloriated, as there is not need to use if here, as pattern-action will do that is
ls -l | awk '$NF~/[.]py$/{ print $NF; }' | wc -l
Morever you might easily implemented counting inside GNU AWK rather than deploying wc -l as follows
ls -l | awk '$NF~/[.]py$/{t+=1}END{print t}'
Here, t is increased by 1 for every describe line, and after all is processed, that is in END it is printed. Observe there is no need to declare t variable in GNU AWK.
Don't try to parse the output of ls, see https://mywiki.wooledge.org/ParsingLs.
Beyond that your awk script is failing because $NF=="*.py" is doing a literal string partial comparison of the last sting of non-spaces against *.py when you probably wanted a regexp comparison such as $NF~/*.py$/ and your print $NF would fail for any file names containing spaces.
If you really want to involve awk in this for some reason then, assuming the list of python files doesn't exceed ARG_MAX, it'd be:
awk 'BEGIN{print ARGC-1; exit}' *.py
but you could just do it in bash:
shopt -s nullglob
files=(*.py)
echo "${#files[#]}"
or if you want to have a pipe to wc -l for some reason and your files can't have newlines in their names then:
printf '%s\n' *.py | wc -l
gfind . -maxdepth 1 -type f -name "*.py" -print0 |
{m,g}awk 'END { print NR }' RS='\0' FS='^$'
or
{m,g}awk 'END { print --NF }' RS='^$' FS='\0'
879

Awk double-slash record separator

I am trying to separate RECORDS of a file based on the string, "//".
What I've tried is:
awk -v RS="//" '{ print "******************************************\n\n"$0 }' myFile.gb
Where the "******" etc, is just a trace to show me that the record is split.
However, the file also contains / (by themselves) and my trace, ****** is being printed there as well meaning that awk is interpreting those also as my record separator.
How can I get awk to only split records on // ????
UPDATE: I am running on Unix (the one that comes with OS X)
I found a temporary solution, being:
sed s/"\/\/"/"*"/g | awk -v RS="*" ...
But there must be a better way, especially with massive files that I am working with.
On a Mac, awk version 20070501 does not support multi-character RS. Here's an illustration using such an awk, and a comparison (on the same machine) with gawk:
$ /usr/bin/awk --version
awk version 20070501
$ /usr/bin/awk -v RS="//" '{print NR ":" $0}' <<< x//y//z
1:x
2:
3:y
4:
5:z
$ gawk -v RS="//" '{print NR ":" $0}' <<< x//y//z
1:x
2:y
3:z
If you cannot find a suitable awk, then pick a better character than *. For example, if tabs are acceptable, and if your shell supports $'...', then you could use this incantation of sed:
sed $'s,//,\t,g'

Reasonably performant hex decoding in native bash?

I have a 2MB file which is a sequence of hex values delimited by spaces. For example:
3F 41 56 00 00
Easy peasy to do this in Bash:
cat hex.txt | tr -s " " $'\n' | while read a; do
echo $a | xxd -r -p | tee -a ascii
done
or
f=$(cat hex.txt)
for a in $f; do
echo $a | xxd -r -p | tee -a ascii
done
Both are excruciatingly slow.
I whipped up a C program which converted the file in about about two seconds and later realized that I could have just done this:
cat hex.txt | xxd -r -p
As I've already converted the file and found an optimal solution, my question isn't about the conversion process itself but rather how to optimize my first two attempts as if the third were not possible. Is there anything to be done to speed up these one-liners or is Bash just too slow for this?
Try the following - unfortunately, the solution varies by awk implementation used:
# BSD/OSX awk
xargs printf '0x%s ' < hex.txt | awk -v RS=' ' '{ printf "%c", $0 }' > ascii
# GNU awk; option -n needed to support hex. numbers
xargs printf '0x%s ' < hex.txt | awk -n -v RS=' ' '{ printf "%c", $0 }' > ascii
# mawk - sadly, printf "%c" only works with letters and numbers if the input is *hex*
awk -v RS=' ' '{ printf "%c", int(sprintf("%d", "0x" $0)) }' < hex.txt
With a 2MB input file, the timings on my late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3 are as follows:
BSD/OSX awk: ca. 1s
GNU awk: ca. 0.6s
mawk: ca. 0.5s
Contrast this with PSkocik's optimized-bash-loop solution: ca. 11s
It's tempting to think that the mawk solution, given that it's a single command without a pipeline, should be the faster solution with all awk implementations, but in practice it is not. Here's a version that works with all three implementations with -n for GNU awk provided on demand: awk $([[ $(gawk --version 2>/dev/null) = GNU* ]] && printf %s -n) -v RS=' ' '{ printf "%c", int(sprintf("%d", "0x" $0)) }' < hex.txt
The speed increase comes from avoiding bash loops altogether and letting utilities do the work:
xargs printf '0x%s ' < hex.txt prefixes all values in hex.txt with 0x so that awk will later recognize them as hexadecimals.
Note that, depending on your platform, the command line that xargs constructs with all stdin input tokens as arguments may exceed the maximum command-line length as reported by getconf ARG_MAX - fortunately, xargs is smart enough to then invoke the command multiple times, fitting as many arguments as possible on the command line each time.
awk -v RS=' ' '{ printf "%c", $0 }'
awk -v RS=' ' reads each space-separated token - i.e., each hex. value - as a separate input record
printf "%c", $0 then simply converts each record into its ASCII-character equivalent using printf.
Generally speaking:
Bash loops with large iteration counts are intrinsically slow.
It gets much worse if you also call an external utility in every iteration.
See Charles Duffy's comments below.
For good performance with large iteration counts, avoid bash loops and let external utilities do the iteration work.
It's slow because you're invoking two programs,
xxd and tee,
in each iteration of the loop.
Using the printf builtin should be more loop-friendly, and you only need one instance of tee:
tr -s " " '\n' < hex.txt |
while read seq; do printf "\x$seq"; done |
tee -a ascii
(You might not be needing the -a switch to tee anymore).
(
If you want to use a use a scripting language, ruby is another good choice beside awk:
tr -s " " '\n' < hex.txt | ruby -pe '$_ = $_.to_i(16).chr'
(Much faster than the in-bash version).
)
Well, you can drop the first cat and replace it by tr < hex.txt. Then you can also build a static conversion table and drop echo and xxd. But the loop will still be slow, and you can't get rid of that, I think.

How to write a bash and define awk constants in command line [duplicate]

This question already has answers here:
Using awk with variables
(3 answers)
Closed 8 years ago.
As a part of my bash, I want to pass some constant from command line to awk. For example, I want to subtract constant1 from column 1 and constant2 from column 5
$ sh bash.sh infile 0.54 0.32
#!/bin/bash
#infile = $1
#constant1 = $2
#constant2 = $3
cat $1 | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}'
thank you very much for your help
As awk is it's own language, by default it does not share the same variables as Bash. To use Bash variables in an awk command, you should pass the variables to awk using the -v option.
#!/bin/bash
awk -v constant1=$2 -v constant2=$3 '{print($1-constant1),($5-constant2)}' $1
You'll notice I removed cat as there is no need to pipe cat into awk since awk can read from files.
you need to remove gaps when defining vaariables:
#!/bin/bash
infile=$1
constant1=$2
constant2=$3
cat $1 | awk '{print $1 $2 $3 $4 $5 $6}'

hexadecimal literals in awk patterns

awk is capable of parsing fields as hexadecimal numbers:
$ echo "0x14" | awk '{print $1+1}'
21 <-- correct, since 0x14 == 20
However, it does not seem to handle actions with hexadecimal literals:
$ echo "0x14" | awk '$1+1<=21 {print $1+1}' | wc -l
1 <-- correct
$ echo "0x14" | awk '$1+1<=0x15 {print $1+1}' | wc -l
0 <-- incorrect. awk is not properly handling the 0x15 here
Is there a workaround?
You're dealing with two similar but distinct issues here, non-decimal data in awk input, and non-decimal literals in your awk program.
See the POSIX-1.2004 awk specification, Lexical Conventions:
8. The token NUMBER shall represent a numeric constant. Its form and numeric value [...]
with the following exceptions:
a. An integer constant cannot begin with 0x or include the hexadecimal digits 'a', [...]
So awk (presumably you're using nawk or mawk) behaves "correctly". gawk (since version 3.1) supports non-decimal (octal and hex) literal numbers by default, though using the --posix switch turns that off, as expected.
The normal workaround in cases like this is to use the defined numeric string behaviour, where a numeric string is to effectively be parsed as the C standard atof() or strtod() function, that supports 0x-prefixed numbers:
$ echo "0x14" | nawk '$1+1<=0x15 {print $1+1}'
<no output>
$ echo "0x14" | nawk '$1+1<=("0x15"+0) {print $1+1}'
21
The problem here is that that isn't quite correct, as POSIX-1.2004 also states:
A string value shall be considered a numeric string if it comes from one of the following:
1. Field variables
...
and after all the following conversions have been applied, the resulting string would
lexically be recognized as a NUMBER token as described by the lexical conventions in Grammar
UPDATE: gawk aims for "2008 POSIX.1003.1", note however since the 2008 edition (see the IEEE Std 1003.1 2013 edition awk here) allows strtod() and implementation-dependent behaviour that does not require the number to conform to the lexical conventions. This should (implicitly) support INF and NAN too. The text in Lexical Conventions is similarly amended to optionally allow hexadecimal constants with 0x prefixes.
This won't behave (given the lexical constraint on numbers) quite as hoped in gawk:
$ echo "0x14" | gawk '$1+1<=0x15 {print $1+1}'
1
(note the "wrong" numeric answer, which would have been hidden by |wc -l)
unless you use --non-decimal-data too:
$ echo "0x14" | gawk --non-decimal-data '$1+1<=0x15 {print $1+1}'
21
See also:
https://www.gnu.org/software/gawk/manual/html_node/Nondecimal_002dnumbers.html
http://www.gnu.org/software/gawk/manual/html_node/Variable-Typing.html
This accepted answer to this SE question has a portability workaround.
The options for having the two types of support for non-decimal numbers are:
use only gawk, without --posix and with --non-numeric-data
implement a wrapper function to perform hex-to-decimal, and use this both with your literals and on input data
If you search for "awk dec2hex" you can find many instances of the latter, a passable one is here: http://www.tek-tips.com/viewthread.cfm?qid=1352504 . If you want something like gawk's strtonum(), you can get a portable awk-only version here.
Are you stuck with an old awk version? I don't know of a way to do mathematics with hexadecimal numbers with it (you will have to wait for better answers :-). I can contribute with an option of Gawk:
-n, --non-decimal-data: Recognize octal and hexadecimal values in input data. Use this option with great caution!
So, either
echo "0x14" | awk -n '$1+1<=21 {print $1+1}'
and
echo "0x14" | awk -n '$1+1<=0x15 {print $1+1}'
return
21
Whatever awk you're using seems to be broken, or non-POSIX at least:
$ echo '0x14' | /usr/xpg4/bin/awk '{print $1+1}'
1
$ echo '0x14' | nawk '{print $1+1}'
1
$ echo '0x14' | gawk '{print $1+1}'
1
$ echo '0x14' | gawk --posix '{print $1+1}'
1
Get GNU awk and use strtonum() everywhere you could have a hex number:
$ echo '0x14' | gawk '{print strtonum($1)+1}'
21
$ echo '0x14' | gawk 'strtonum($1)+1<=21{print strtonum($1)+1}'
21
$ echo '0x14' | gawk 'strtonum($1)+1<=strtonum(0x15){print strtonum($1)+1}'
21

Resources