Reasonably performant hex decoding in native bash? - bash

I have a 2MB file which is a sequence of hex values delimited by spaces. For example:
3F 41 56 00 00
Easy peasy to do this in Bash:
cat hex.txt | tr -s " " $'\n' | while read a; do
echo $a | xxd -r -p | tee -a ascii
done
or
f=$(cat hex.txt)
for a in $f; do
echo $a | xxd -r -p | tee -a ascii
done
Both are excruciatingly slow.
I whipped up a C program which converted the file in about about two seconds and later realized that I could have just done this:
cat hex.txt | xxd -r -p
As I've already converted the file and found an optimal solution, my question isn't about the conversion process itself but rather how to optimize my first two attempts as if the third were not possible. Is there anything to be done to speed up these one-liners or is Bash just too slow for this?

Try the following - unfortunately, the solution varies by awk implementation used:
# BSD/OSX awk
xargs printf '0x%s ' < hex.txt | awk -v RS=' ' '{ printf "%c", $0 }' > ascii
# GNU awk; option -n needed to support hex. numbers
xargs printf '0x%s ' < hex.txt | awk -n -v RS=' ' '{ printf "%c", $0 }' > ascii
# mawk - sadly, printf "%c" only works with letters and numbers if the input is *hex*
awk -v RS=' ' '{ printf "%c", int(sprintf("%d", "0x" $0)) }' < hex.txt
With a 2MB input file, the timings on my late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3 are as follows:
BSD/OSX awk: ca. 1s
GNU awk: ca. 0.6s
mawk: ca. 0.5s
Contrast this with PSkocik's optimized-bash-loop solution: ca. 11s
It's tempting to think that the mawk solution, given that it's a single command without a pipeline, should be the faster solution with all awk implementations, but in practice it is not. Here's a version that works with all three implementations with -n for GNU awk provided on demand: awk $([[ $(gawk --version 2>/dev/null) = GNU* ]] && printf %s -n) -v RS=' ' '{ printf "%c", int(sprintf("%d", "0x" $0)) }' < hex.txt
The speed increase comes from avoiding bash loops altogether and letting utilities do the work:
xargs printf '0x%s ' < hex.txt prefixes all values in hex.txt with 0x so that awk will later recognize them as hexadecimals.
Note that, depending on your platform, the command line that xargs constructs with all stdin input tokens as arguments may exceed the maximum command-line length as reported by getconf ARG_MAX - fortunately, xargs is smart enough to then invoke the command multiple times, fitting as many arguments as possible on the command line each time.
awk -v RS=' ' '{ printf "%c", $0 }'
awk -v RS=' ' reads each space-separated token - i.e., each hex. value - as a separate input record
printf "%c", $0 then simply converts each record into its ASCII-character equivalent using printf.
Generally speaking:
Bash loops with large iteration counts are intrinsically slow.
It gets much worse if you also call an external utility in every iteration.
See Charles Duffy's comments below.
For good performance with large iteration counts, avoid bash loops and let external utilities do the iteration work.

It's slow because you're invoking two programs,
xxd and tee,
in each iteration of the loop.
Using the printf builtin should be more loop-friendly, and you only need one instance of tee:
tr -s " " '\n' < hex.txt |
while read seq; do printf "\x$seq"; done |
tee -a ascii
(You might not be needing the -a switch to tee anymore).
(
If you want to use a use a scripting language, ruby is another good choice beside awk:
tr -s " " '\n' < hex.txt | ruby -pe '$_ = $_.to_i(16).chr'
(Much faster than the in-bash version).
)

Well, you can drop the first cat and replace it by tr < hex.txt. Then you can also build a static conversion table and drop echo and xxd. But the loop will still be slow, and you can't get rid of that, I think.

Related

non-printable character not recognised as field separator

I have a file. Its field separator is non-printable character \x1c (chr(28) in Python). In VI it looks like a^\b^\c but using cat I just see abc. The fieldseparator ^\ is not seen.
I have a simple awk command:
awk -F $ā€™\x1cā€™ ā€˜{print NF}ā€™ a
to get the total number of fields. It works on MacOS, but on AIX, it fails. It seems AIX can't recognize the field separator. So the output is 1 meaning the whole line is considered one field.
How to do this on AIX? Any idea is much appreciated.
I was able to reproduce this on SOLARIS running ksh.
sol bash $ printf '\034a\034b\034c' | cat -v
^\a^\b^\c$
sol bash $ printf '\034a\034b\034c' | awk -F$'\x1c' '{print NF}'
4
sol bash $ printf '\034a\034b\034c' | awk -F$'\034' '{print NF}'
4
sol ksh $ printf '\034a\034b\034c' | cat -v
^\a^\b^\c$
sol ksh $ printf '\034a\034b\034c' | awk -F$'\x1c' '{print NF}'
1
sol ksh $ printf '\034a\034b\034c' | awk -F$'\034' '{print NF}'
1
I cannot confirm if this is a ksh issue or awk issue, as other cases fail on both.
sol ksh/bash $ printf '\034a\034b\034c' | awk 'BEGIN{FS="\034"}{print NF}'
1
All the above cases work successfully on any Linux system (which run by default GNU awk), but it seemed to have failed gloriously.
The following trick is a work arround that cannot fail at all (until the point it will fail):
sol ksh/bash $ printf '\034a\034b\034c' | awk 'BEGIN{FS=sprintf("%c",28)}{print NF}'
4
The above works because we let awk set the FS using the sprintf function where we pass the decimal number 28=x1c=034
Well $'\x1c' is a bashizm, the portable format is "$(printf '\034')".
(This answer has already been written as a comment.)
When awk has issues, try Perl
$ cat -vT tonyren.txt
a^\b^\c^\d
p^\q^\r^\s
x^\y^\z
$ perl -F"\x1c" -le ' { print scalar #F } ' tonyren.txt
4
4
3
$

Splitting and looping over live command output in Bash

I am archiving and using split to produce several parts while also printing the output files (from split on STDERR, which I am redirecting to STDOUT). However the loop over the output data doesn't happen until after the command returns.
Is there anyway to actively split over the STDOUT output of a command before it returns?
The following is what I currently have, but it only prints the list of filenames after the command returns:
export IFS=$'\n'
for line in `data_producing_command | split -d -b $CHUNK_SIZE --verbose - $ARCHIVE_PREFIX 2>&1`; do
FILENAME=`echo $line | awk '{ print $3 }'`
echo " - $FILENAME"
done
Try this:
data_producing_command | split -d -b $CHUNK_SIZE --verbose - $ARCHIVE_PREFIX 2>&1 | while read -r line
do
FILENAME=`echo $line | awk '{ print $3 }'`
echo " - $FILENAME"
done
Note however that any variables set in the while loop will not preserve their values after the loop (the while loop runs in a subshell).
There's no reason for the for loop or the read or the echo. Just pipe the stream to awk:
... | split -d -b $CHUNK_SIZE --verbose - test 2>&1 |
awk '{printf " - %s\n", $3 }'
You are going to see some delay from buffering, but unless your system is very slow or you are very perceptive, you're not likely to notice it.
The command substitution needs1 to run before the for loop can start.
for item in $(command which produces items); do ...
whereas a while read -r can start consuming output as soon as the first line is produced (or, more realistically, as soon as the output buffer is full):
command which produces items |
while read -r item; do ...
1 Well, it doesn't absolutely need to, from a design point of view, I suppose, but that's how it currently works.
As William Pursell already noted, there is no particular reason to run Awk inside a while read loop, because that's something Awk does quite well on its own, actually.
command which produces items |
awk '{ print " - " $3 }'
Of course, with a reasonably recent GNU Coreutils split, you could simply do
split --filter='printf " - %s\n" "$FILE"'; cat >"$FILE" ... options

how to grep multiples variable in bash

I need to grep multiple strings, but i don't know the exact number of strings.
My code is :
s2=( $(echo $1 | awk -F"," '{ for (i=1; i<=NF ; i++) {print $i} }') )
for pattern in "${s2[#]}"; do
ssh -q host tail -f /some/path |
grep -w -i --line-buffered "$pattern" > some_file 2>/dev/null &
done
now, the code is not doing what it's supposed to do. For example if i run ./script s1,s2,s3,s4,.....
it prints all lines that contain s1,s2,s3....
The script is supposed to do something like grep "$s1" | grep "$s2" | grep "$s3" ....
grep doesn't have an option to match all of a set of patterns. So the best solution is to use another tool, such as awk (or your choice of scripting languages, but awk will work fine).
Note, however, that awk and grep have subtly different regular expression implementations. It's not clear from the question whether the target strings are literal strings or regular expression patterns, and if the latter, what the expectations are. However, since the argument comes delimited with commas, I'm assuming that the pieces are simple strings and should not be interpreted as patterns.
If you want the strings to be interpreted as patterns, you can change index to match in the following little program:
ssh -q host tail -f /some/path |
awk -v STRINGS="$1" -v IGNORECASE=1 \
'BEGIN{split(STRINGS,strings,/,/)}
{for(i in strings)if(!index($0,strings[i]))next}
{print;fflush()}'
Note:
IGNORECASE is only available in gnu awk; in (most) other implementations, it will do nothing. It seems that is what you want, based on the fact that you used -i in your grep invocation.
fflush() is also an extension, although it works with both gawk and mawk. In Posix awk, fflush requires an argument; if you were using Posix awk, you'd be better off printing to stderr.
You can use extended grep
egrep "$s1|$s2|$s3" fileName
If you don't know how many pattern you need to grep, but you have all of them in an array called s, you can use
egrep $(sed 's/ /|/g' <<< "${s[#]}") fileName
This creates a herestring with all elements of the array, sed replaces the field separator of bash (space) with | and if we feed that to egrep we grep all strings that are in the array s.
test.sh:
#!/bin/bash -x
a=" $#"
grep ${a// / -e } .bashrc
it works that way:
$ ./test.sh 1 2 3
+ a=' 1 2 3'
+ grep -e 1 -e 2 -e 3 .bashrc
(here is lots of text that fits all the arguments)

"grep"ing first 12 of last 24 character from a line

I am trying to extract "first 12 of last 24 character" from a line, i.e.,
for a line:
species,subl,cmp= 1 4 1 s1,torque= 0.41207E-09-0.45586E-13
I need to extract "0.41207E-0".
(I have not written the code, so don't curse me for its formatting. )
I have managed to do this via:
var_s=`grep "species,subl,cmp= $3 $4 $5" $tfile |sed -n '$s/.*\(........................\)$/\1/p'|sed -n '$s/\(............\).*$/\1/p'`
but, is there any more readable way of doing this, rather then counting dots?
EDIT
Thanks to both of you;
so, I have sed,awk grep and bash.
I will run that in loop, for 100's of file.
so, can you also suggest me which one is most efficient, wrt time?
One way with GNU sed (without counting dots):
$ sed -r 's/.*(.{11}).{12}/\1/' file
0.41207E-09
Similarly with GNU grep:
$ grep -Po '.{11}(?=.{12}$)' file
0.41207E-09
Perhaps a python solution may also be helpful:
python -c 'import sys;print "\n".join([a[-24:-13] for a in sys.stdin])' < file
0.41207E-09
I'm not sure your example data and question match up so just change the values in the {n} quantifier accordingly.
Simplest is using pure bash:
echo "${str:(-24):12}"
OR awk can also do that:
awk '{print substr($0, length($0)-23, 12)}' <<< $str
OUTPUT:
0.41207E-09
EDIT: For using bash solution on a file:
while read l; do echo "${l:(-24):12}"; done < file
Another one, less efficient but has the advantage of making you discover new tools
`echo "$str" | rev | cut -b 1-24 | rev | cut -b 1-12
You can use awk to get first 12 characters of last 24 characters from a line:
awk '{substr($0,(length($0)-23))};{print substr($0,(length($0)-10))}' myfile.txt

Awk replace a column with its hash value

How can I replace a column with its hash value (like MD5) in awk or sed?
The original file is super huge, so I need this to be really efficient.
So, you don't really want to be doing this with awk. Any of the popular high-level scripting languages -- Perl, Python, Ruby, etc. -- would do this in a way that was simpler and more robust. Having said that, something like this will work.
Given input like this:
this is a test
(E.g., a row with four columns), we can replace a given column with its md5 checksum like this:
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
$2=cksum
print
}' < sample
This relies on GNU awk (you'll probably have this by default on a Linux system), and it uses openssl to generate the md5 checksum. We first build a shell command line in tmp to pass the selected column to the md5 command. Then we pipe the output into the cksum variable, and replace column 2 with the checksum. Given the sample input above, the output of this awk script would be:
this 7e1b6dbfa824d5d114e96981cededd00 a test
I copy pasted larsks's response, but I have added the close line, to avoid the problem indicated in this post: gawk / awk: piping date to getline *sometimes* won't work
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample
This might work using Bash/GNU sed:
<<<"this is a test" sed -r 's/(\S+\s)(\S+)(.*)/echo "\1 $(md5sum <<<"\2") \3"/e;s/ - //'
this 7e1b6dbfa824d5d114e96981cededd00 a test
or a mostly sed solution:
<<<"this is a test" sed -r 'h;s/^\S+\s(\S+).*/md5sum <<<"\1"/e;G;s/^(\S+).*\n(\S+)\s\S+\s(.*)/\2 \1 \3/'
this 7e1b6dbfa824d5d114e96981cededd00 a test
Replaces is from this is a test with md5sum
Explanation:
In the first:- identify the columns and use back references as parameters in the Bash command which is substituted and evaluated then make cosmetic changes to lose the file description (in this case standard input) generated by the md5sum command.
In the second:- similar to the first but hive the input string into the hold space, then after evaluating the md5sum command, append the string G to the pattern space (md5sum result) and using substitution arrange to suit.
You can also do that with perl :
echo "aze qsd wxc" | perl -MDigest::MD5 -ne 'print "$1 ".Digest::MD5::md5_hex($2)." $3" if /([^ ]+) ([^ ]+) ([^ ]+)/'
aze 511e33b4b0fe4bf75aa3bbac63311e5a wxc
If you want to obfuscate large amount of data it might be faster than sed and awk which need to fork a md5sum process for each lines.
You might have a better time with read than awk, though I haven't done any benchmarking.
the input (scratch001.txt):
foo|bar|foobar|baz|bang|bazbang
baz|bang|bazbang|foo|bar|foobar
transformed using read:
while IFS="|" read -r one fish twofish red fishy bluefishy; do
twofish=`echo -n $twofish | md5sum | tr -d " -"`
echo "$one|$fish|$twofish|$red|$fishy|$bluefishy"
done < scratch001.txt
produces the output:
foo|bar|3858f62230ac3c915f300c664312c63f|baz|bang|bazbang
baz|bang|19e737ea1f14d36fc0a85fbe0c3e76f9|foo|bar|foobar

Resources