Capturing special characters from stdin to a shell variable - bash

I have a program which prints something that contains null bytes \0 and special characters like \x1f and newlines. For instance:
someprogram
#!/bin/bash
printf "ALICE\0BOB\x1fCHARLIE\n"
Given such a program, I want to read its output in such a way that all those special characters are captured in a shell variable output. So, if I run:
echo $output
because I'm not giving -e, I'd want the output to be:
ALICE\0BOB\x1fCHARLIE\n
How can this be achieved?
My first attempt was:
output=$(someprogram)
But I got this echoed output which doesn't have the special characters:
./myscript.sh: line 2: warning: command substitution: ignored null byte in input
ALICEBOBCHARLIE
I also tried to use read as follows:
output=""
while read -r
do
output="$output$REPLY"
done < <(someprogram)
Then I got rid of the warning but the output is still missing all special characters:
ALICEBOBCHARLIE
So how can I capture the output of someprogram in such a way that I have all the special characters in my resulting string?
EDIT: Note that it is possible to have such strings in bash:
$ x="ALICE\0BOB\x1fCHARLIE\n"
$ echo $x
ALICE\0BOB\x1fCHARLIE\n
So that shouldn't be the problem.
EDIT2: I'll reformulate the question a little bit now that I got an accepted answer and I understood things a little bit better. So, I just needed to be able to store the output of someprogram in some shell variable in such a way that I can print it to stdout without any changes in any special characters as if someprogram was just piped directly to stdout.

You just can't store zero byte in bash variables. It's impossible.
The usual solution is to convert the stream of bytes into hexadecimal. Then convert it back each time you want to do something with it.
$ x=$(printf "ALICE\0BOB\x1fCHARLIE\n" | xxd -p)
$ echo "$x"
414c49434500424f421f434841524c49450a
$ <<<"$x" xxd -p -r | hexdump -C
00000000 41 4c 49 43 45 00 42 4f 42 1f 43 48 41 52 4c 49 |ALICE.BOB.CHARLI|
00000010 45 0a |E.|
00000012
You can also write your own serialization and deserialization functions for the purpose.
Another idea I have is to for example read the data into an array by using zero byte as a separator (as any other byte is valid). This however will have problems with distinguishing the trailing zero byte:
$ readarray -d '' arr < <(printf "ALICE\0BOB\x1fCHARLIE\n")
$ printf "%s\0" "${arr[#]}" | hexdump -C
00000000 41 4c 49 43 45 00 42 4f 42 1f 43 48 41 52 4c 49 |ALICE.BOB.CHARLI|
00000010 45 0a 00 |E..|
# ^^ additional zero byte if input doesn't contain a trailing zero byte
00000013

Related

UNIX/Linux shell script: Removing variant form emoji from a text

Consider you are using a Linux/UNIX shell whose default character set is UTF-8:
$ echo $LANG
en_US.UTF-8
You have a text file, emoji.txt, which is coded in UTF-8:
$ file -i ./emoji.txt
./emoji.txt: text/plain; charset=utf-8
This text file contains some emoji and a variant form escape sequence:
$ cat ./emoji.txt
Standard ☁
Variant form ☁️
$ uni2ascii -a B -q ./emoji.txt
Standard \x2601
Variant form \x2601\xFE0F
You want to remove both emoji, including that variant form character (\xFE0F), and so the output should be
Standard
Variant form
How would you do this?
Update. This question is not about how to remove the last word in every line. Imagine emoji2.txt that includes a large text with many emoji characters; and some of these are followed by the variant form sequence.
With GNU sed and bash:
sed -E s/$'\u2601\uFE0F?'//g emoji.txt
You can use awk, like this:
$ cat emo.ascii
Standard \x2601
Variant form \x2601\xFE0F
$ ascii2uni -a B emo.ascii
Standard ☁
Variant form ☁️
3 tokens converted # note: this is stderr
$ ascii2uni -a B emo.ascii | awk -F' ' '{NF--}1' | cat -A
3 tokens converted # note: this is stderr
Standard$
Variant form$
NF-- will decrease the field count in awk, which effectively removes the last field. 1 evaluates to true, which makes awk print the modified line.
(Used cat -A here only to show that there aren't any invisible characters left)
Have awk print all but the last field:
$ awk '/^Standard/ || /^Variant form/ { $(NF)="" }1' emoji.txt
Standard
Variant form
NOTE: This particular solution will leave the field separator (blank) on the end of the output line; if you want to strip the trailing blank you can pipe to sed, tr, etc ... or have awk loop through fields 1 to (NF-1) and output via printf
Use nkf command. nkf -s try to convert character encoding to Shift-jis which does not support emojis. Therefore, emojis and escape sequence will be gone. Finally, revert input to UTF-8 with nkf -w.
$ cat emoji.txt | nkf -s | nkf -w
Standard
Variant form
$ cat emoji.txt | nkf -s | nkf -w | od -tx1c
0000000 53 74 61 6e 64 61 72 64 20 0a 56 61 72 69 61 6e
S t a n d a r d \n V a r i a n
0000020 74 20 66 6f 72 6d 20 0a
t f o r m \n
0000030
I thought ruby may work. Because \p{Emoji} matches emojis. But it remains the escape sequences..
$ ruby -nle 'puts $_.gsub!(/\p{Emoji}/,"")' emoji.txt
Standard
Variant form ️
$ ruby -nle 'puts $_.gsub!(/\p{Emoji}/,"")' emoji.txt | od -tx1c
0000000 53 74 61 6e 64 61 72 64 20 0a 56 61 72 69 61 6e
S t a n d a r d \n V a r i a n
0000020 74 20 66 6f 72 6d 20 ef b8 8f 0a
t f o r m 217 \n
0000033
Convert the Unicode text file to ASCII and remove those Unicode characters that are represented by ASCII characters, and convert it to UTF-8 again:
$ uni2ascii -q ./emoji.txt | sed "s/ 0x2601\(0xFE0F\)\?//g" | ascii2uni -q
Standard
Variant form
$

How to write string with octal value

In bash, I would like to write the string "BLA\1"
so it will be a buffer 42 4C 41 01 but the result is 42 4C 41 5C 31
To complete, in python if you write "BLA\1" in a binary file, the "\1" is interpreted as "1"
So how can I write the string "BLA\1" correctly in bash?
Use the $'' special quotes:
echo -n $'BLA\1' | xxd
00000000: 424c 4101 BLA.
Use printf, defined by the POSIX standard:
printf 'BLA\1'
Some bash-specific options:
# Let echo expand the escape code
echo -ne 'BLA\1'
# Use $'...', as in choroba's answer
echo -n $'BLA\1'

Conversion hex string into ascii in bash command line

I have a lot of this kind of string and I want to find a command to convert it in ascii, I tried with echo -e and od, but it did not work.
0xA7.0x9B.0x46.0x8D.0x1E.0x52.0xA7.0x9B.0x7B.0x31.0xD2
This worked for me.
$ echo 54657374696e672031203220330 | xxd -r -p
Testing 1 2 3$
-r tells it to convert hex to ascii as opposed to its normal mode of doing the opposite
-p tells it to use a plain format.
This code will convert the text 0xA7.0x9B.0x46.0x8D.0x1E.0x52.0xA7.0x9B.0x7B.0x31.0xD2 into a stream of 11 bytes with equivalent values. These bytes will be written to standard out.
TESTDATA=$(echo '0xA7.0x9B.0x46.0x8D.0x1E.0x52.0xA7.0x9B.0x7B.0x31.0xD2' | tr '.' ' ')
for c in $TESTDATA; do
echo $c | xxd -r
done
As others have pointed out, this will not result in a printable ASCII string for the simple reason that the specified bytes are not ASCII. You need post more information about how you obtained this string for us to help you with that.
How it works: xxd -r translates hexadecimal data to binary (like a reverse hexdump). xxd requires that each line start off with the index number of the first character on the line (run hexdump on something and see how each line starts off with an index number). In our case we want that number to always be zero, since each execution only has one line. As luck would have it, our data already has zeros before every character as part of the 0x notation. The lower case x is ignored by xxd, so all we have to do is pipe each 0xhh character to xxd and let it do the work.
The tr translates periods to spaces so that for will split it up correctly.
You can use xxd:
$cat hex.txt
68 65 6c 6c 6f
$cat hex.txt | xxd -r -p
hello
You can use something like this.
$ cat test_file.txt
54 68 69 73 20 69 73 20 74 65 78 74 20 64 61 74 61 2e 0a 4f 6e 65 20 6d 6f 72 65 20 6c 69 6e 65 20 6f 66 20 74 65 73 74 20 64 61 74 61 2e
$ for c in `cat test_file.txt`; do printf "\x$c"; done;
This is text data.
One more line of test data.
The values you provided are UTF-8 values. When set, the array of:
declare -a ARR=(0xA7 0x9B 0x46 0x8D 0x1E 0x52 0xA7 0x9B 0x7B 0x31 0xD2)
Will be parsed to print the plaintext characters of each value.
for ((n=0; n < ${#ARR[*]}; n++)); do echo -e "\u${ARR[$n]//0x/}"; done
And the output will yield a few printable characters and some non-printable characters as shown here:
For converting hex values to plaintext using the echo command:
echo -e "\x<hex value here>"
And for converting UTF-8 values to plaintext using the echo command:
echo -e "\u<UTF-8 value here>"
And then for converting octal to plaintext using the echo command:
echo -e "\0<octal value here>"
When you have encoding values you aren't familiar with, take the time to check out the ranges in the common encoding schemes to determine what encoding a value belongs to. Then conversion from there is a snap.
The echo -e must have been failing for you because of wrong escaping.
The following code works fine for me on a similar output from your_program with arguments:
echo -e $(your_program with arguments | sed -e 's/0x\(..\)\.\?/\\x\1/g')
Please note however that your original hexstring consists of non-printable characters.
Make a script like this:
bash
#!/bin/bash
echo $((0x$1)).$((0x$2)).$((0x$3)).$((0x$4))
Example:
sh converthextoip.sh c0 a8 00 0b
Result:
192.168.0.11

redirect stdout to script, so it can be parsed and then sent to stdout

I have a (java) program that prints a line of hex numbers to stdout every 5ish seconds, until the program is terminated by the user.
I would like to redirect that output to a bash script so I could convert each of those hex numbers independently to decimal, then print the parsed line to stdout.
I tried using myProgram | myScript but that did the piping before any lines were printed, then didn't keep listening to stdout. I then tried myProgram > myScript, and that just overwrote the script.
Ideas?
Edit: adding output from the runs, (sorry for the poor formatting, I couldn't get it all in the code highlighting) so the middle of the output is not highighted).
Here is the script
#!/bin/bash
echo $0
echo $#
echo $1
Here is how my program runs while it goes straight to stdout this would continue forever if I didn't terminate it.
mmmm#mmmm:~/mmmm/mmmm/mmmmm$ java net.tinyos.tools.Listen -comm
serial#/dev/ttyUSB0:micaz
serial#/dev/ttyUSB0:57600: resynchronising
00 FF FF 00 02 04 22 93 00 02 02 C9
00 FF FF 00 03 04 22 93 00 03 03 0E
00 FF FF 00 02 04 22 93 00 03 03 0E
00 FF FF 00 02 04 22 93 00 02 02 C9
^Z
[5]+ Stopped java net.tinyos.tools.Listen -comm
serial#/dev/ttyUSB0:micaz
Here is where I try to pipe it to my script (which i have set to print the number of command line arguments and the first argument. It just freeze after this...
mmmm#mmmm:~/mmmm/mmmm/mmmmm$$ java net.tinyos.tools.Listen -comm serial#/dev/ttyUSB0:micaz | ./parser.sh
./parser.sh
0
serial#/dev/ttyUSB0:57600: resynchronising
Diagnosis
When you use this script like this:
java javaprog | myScript
and myScript contains:
#!/bin/bash
echo $0
echo $#
echo $1
Then the output from the script will be its name (myScript) from the echo $0, the number of arguments it was passed (0) from the echo $#, and the first argument (an empty line is echoed) from the echo $1. The script then exits (successfully). The issue is nothing to do with buffering; it is all to do with the script not reading anything from its standard input. Even a trivial modification would be an improvement:
#!/bin/bash
while read data; do echo $data; done
That's a slower form of cat, except that it normalizes random sequences of spaces and tabs into single spaces, stripping leading and trailing spaces off the line. It would at least demonstrate the script processing the output from the Java program.
Trying awk
To do what you're after, you should probably replace that with an awk program or something similar. This is a first draft, but it stands some chance of working:
awk '{for(i = 1; i <= NF; i++) { x = "0x" $i + 0; printf(" %d", x); printf "\n";}'
This says 'for each line (because there is no pattern before the open brace)', do 'for each of the fields 1..NF, convert the field into an explicit hex string with the 0x prefix and adding 0, then print the value as a decimal number (trusting awk to convert a string such as '0xC9' to a number).
Using Perl
Unfortunately, a little testing shows that this does not work; the problem is getting a value other than 0 for x. So, ... time to fall back on Perl in awk-emulation mode:
$ echo '00 C9 28 13 A0 FF 01' |
> perl -na -e 'for ($i = 0; $i < scalar(#F); $i++) { printf(" %d", hex $F[$i]); }
> printf "\n";'
0 201 40 19 160 255 1
$
That works - it's even fairly easy to understand. The -n option means 'read each line of data and execute the commands in the script on each line (but do not print $_ at the end)'. The -a option combined with either -n (as here, or -p which is like -n except it prints $_ automatically) means 'automatically split the input into the array #F. The script then processes each element of #F in each line (rather verbosely), using the hex function to convert the string in $F[$i] to a number and then printing that number with printf(). The verbosity can be reduced (this is Perl: There's More Than One Way To Do It, or TMTOWTDI - tim-toady) with:
$ echo '00 C9 28 13 A0 FF 01' |
> perl -na -e 'foreach my $i (#F) { printf(" %d", hex $i); } printf "\n";'
0 201 40 19 160 255 1
$
Same result, less code. There might be more abbreviated techniques; that's compact enough without being wholly illegible.
\1. check if your system has the unbuffer command installed
which unbuffer
(typically systems that are using bash are Linux-based, and have unbuffer available)
\2. If yes,
unbuffer myProgram | myScript
edit
As you have shown us your shell script as
#!/bin/bash
echo $0
echo $#
echo $1
Please recall that the values you are echoing, $0, $#, $1 are positional parameters to bash related to the command line arguments. Typically options or filenames for processing.
To print the whole line, the # of fields on the line, and the value of the first line, awk is a perfect solution to this problem.
Try changing your script to
cat myScript.awk
#!/bin/awk -f
{
print $0
print $NF
print $1
}
chmod 755 myScript.awk
Hmm.. Seeing ^Z to stop input tells me you are using Windows or are you using bash under Cygwin?
I hope this helps.
This might be a buffering issue. The GNU Coreutils come with a tool called stdbuf. If it is available on your system, try running:
stdbuf -o0 program | stdbuf -i0 script

How to get only the first ten bytes of a binary file

I am writing a bash script that needs to get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes. These are binary files and will likely have \0's and \n's throughout the first 10 bytes. It seems like most utilities work with ASCII files. What is a good way to achieve this task?
To get the first 10 bytes, as noted already:
head -c 10
To get all but the first 10 bytes (at least with GNU tail):
tail -c+11
head -c 10 does the right thing here.
You can use the dd command to copy an arbitrary number of bytes from a binary file.
dd if=infile of=outfile1 bs=10 count=1
dd if=infile of=outfile2 bs=10 skip=1
How to split a stream (or a file) under bash
Two answer here!
Reading SO request:
get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes.
I understand:
How to split a file at specific point
As all answers here does access same file two time, instead of just split it!!
Here is my two cents:
The interesting thing using Un*x is considering every whole job as a filter, it's easy to a split stream using unbuffered I/O. Most of standard un*x tools (cat, grep, awk, sed, python, perl ...) work as filters.
1. Using head or dd but in a single pass
{ head -c 10 >head_part; cat >tail_part;} <file
This is the more efficient, as your file is read only 1 time, the first 10 byte goes to head_part and the rest goes to tail_part.
Note: second redirection >tail_part could be place outside of whole list ({ ...;}) as well...
You could do same, using dd:
{ dd count=1 bs=10 of=head_part; cat;} <file >tail_part
This stay more efficient than running two process of dd to open same file two times.
...And still use standard block size for the rest of file:
Another sample based on read by line:
Split HTTP (or mail) stream on near empty line (line containing only carriage return: \r):
nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
{ sed -u '/^\r$/q' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw
or, to drop empty last head line:
nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
{ sed -nu '/^\r$/q;p' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw
This will produce two files:
ls -l so_*.raw
-rw-r--r-- 1 root root 307 Apr 25 11:40 so_head.raw
-rw-r--r-- 1 root root 219 Apr 25 11:40 so_body.raw
grep www so_*.raw
so_body.raw:here.
so_head.raw:Location: http://www.google.com/
2. Pure bash way:
If the goal is to obtain values of first 10 bytes in a usable bash variable, here is a nice and efficient way:
Because ten byte are few, fork to head could be avoided. from Read a file by bytes in BASH:
read8() {
local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car
}
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < "$infile" >"$outfile"
This will create an array ${first10[#]} containing hexadecimal values of first ten bytes of $infile and store rest of data into $outfile.
declare -p first10
declare -a first10=([0]="25" [1]="50" [2]="44" [3]="46" [4]="2D" [5]="31" [6]="2E"
[7]="34" [8]="0A" [9]="25")
This was a PDF (%PDF -> 25 50 44 46)... Here's another sample:
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} <<<"Hello world!"
d!
As I didn't redirect output, string d! will be output on terminal.
echo ${first10[#]}
48 65 6C 6C 6F 20 77 6F 72 6C
printf '%b%b%b%b%b%b%b%b%b%b\n' ${first10[#]/#/\\x}
Hello worl
About binary
You said:
These are binary files and will likely have \0's and \n's throughout the first 10 bytes.
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < <(gzip <<<"Hello world!") >/dev/null
echo ${first10[#]}
1F 8B 08 00 00 00 00 00 00 03
( Sample with a \n at bottom of this ;)
As a function
read8() { local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car ;}
get10() {
local -n result=${1:-first10} # 1st arg is array name
local -i _i
result=()
for ((_i=0;_i<${2:-10};_i++));do # 2nd arg is number of bytes
read8 result[_i] || { unset result[_i] ; return 1 ;}
done
cat
}
Then (here, I use the special character ⛶ for: there was no newline. ).
get10 pdf 4 <$infile >$outfile
printf %b ${pdf[#]/#/\\x}
%PDF⛶
echo $(( $(stat -c %s $infile) - $(stat -c %s $outfile) ))
4
get10 test 8 <<<'Hello world'
rld!
printf %b ${test[#]/#/\\x}
Hello Wo⛶
get10 test 24 <<<'Hello World!'
printf %b ${test[#]/#/\\x}
Hello World!
( And the last character printed is a \n! ;)
Final binary demo:
get10 test 256 < <(gzip <<<'Hello world!')
printf '%b' ${test[#]/#/\\x} | gunzip
Hello world!
printf " %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s\n" ${test[#]}
1F 8B 08 00 00 00 00 00 00 03 F3 48 CD C9 C9 57
28 CF 2F CA 49 51 E4 02 00 41 E4 A9 B2 0D 00 00
00
Note!! This work fine and is very quick while number of byte to read stay low, even processing large files. This could be used for file recognition, for sample. But for spliting files on larger parts, you have to use split, head, tail and/or dd.

Resources