Which characters need to be escaped when using Bash? - bash

Is there any comprehensive list of characters that need to be escaped in Bash? Can it be checked just with sed?
In particular, I was checking whether % needs to be escaped or not. I tried
echo "h%h" | sed 's/%/i/g'
and worked fine, without escaping %. Does it mean % does not need to be escaped? Was this a good way to check the necessity?
And more general: are they the same characters to escape in shell and bash?

There are two easy and safe rules which work not only in sh but also bash.
1. Put the whole string in single quotes
This works for all chars except single quote itself. To escape the single quote, close the quoting before it, insert the single quote, and re-open the quoting.
'I'\''m a s#fe $tring which ends in newline
'
sed command: sed -e "s/'/'\\\\''/g; 1s/^/'/; \$s/\$/'/"
2. Escape every char with a backslash
This works for all characters except newline. For newline characters use single or double quotes. Empty strings must still be handled - replace with ""
\I\'\m\ \a\ \s\#\f\e\ \$\t\r\i\n\g\ \w\h\i\c\h\ \e\n\d\s\ \i\n\ \n\e\w\l\i\n\e"
"
sed command: sed -e 's/./\\&/g; 1{$s/^$/""/}; 1!s/^/"/; $!s/$/"/'.
2b. More readable version of 2
There's an easy safe set of characters, like [a-zA-Z0-9,._+:#%/-], which can be left unescaped to keep it more readable
I\'m\ a\ s#fe\ \$tring\ which\ ends\ in\ newline"
"
sed command: LC_ALL=C sed -e 's/[^a-zA-Z0-9,._+#%/-]/\\&/g; 1{$s/^$/""/}; 1!s/^/"/; $!s/$/"/'.
Note that in a sed program, one can't know whether the last line of input ends with a newline byte (except when it's empty). That's why both above sed commands assume it does not. You can add a quoted newline manually.
Note that shell variables are only defined for text in the POSIX sense. Processing binary data is not defined. For the implementations that matter, binary works with the exception of NUL bytes (because variables are implemented with C strings, and meant to be used as C strings, namely program arguments), but you should switch to a "binary" locale such as latin1.
(You can easily validate the rules by reading the POSIX spec for sh. For bash, check the reference manual linked by #AustinPhillips)

format that can be reused as shell input
Edit february 2021: bash ${var#Q}
Under bash, you could store your variable content with Parameter Expansion's # command for Parameter transformation:
${parameter#operator}
Parameter transformation. The expansion is either a transforma‐
tion of the value of parameter or information about parameter
itself, depending on the value of operator. Each operator is a
single letter:
Q The expansion is a string that is the value of parameter
quoted in a format that can be reused as input.
...
A The expansion is a string in the form of an assignment
statement or declare command that, if evaluated, will
recreate parameter with its attributes and value.
Sample:
$ var=$'Hello\nGood world.\n'
$ echo "$var"
Hello
Good world.
$ echo "${var#Q}"
$'Hello\nGood world.\n'
$ echo "${var#A}"
var=$'Hello\nGood world.\n'
Old answer
There is a special printf format directive (%q) built for this kind of request:
printf [-v var] format [arguments]
%q causes printf to output the corresponding argument
in a format that can be reused as shell input.
Some samples:
read foo
Hello world
printf "%q\n" "$foo"
Hello\ world
printf "%q\n" $'Hello world!\n'
$'Hello world!\n'
This could be used through variables too:
printf -v var "%q" "$foo
"
echo "$var"
$'Hello world\n'
Quick check with all (128) ascii bytes:
Note that all bytes from 128 to 255 have to be escaped.
for i in {0..127} ;do
printf -v var \\%o $i
printf -v var $var
printf -v res "%q" "$var"
esc=E
[ "$var" = "$res" ] && esc=-
printf "%02X %s %-7s\n" $i $esc "$res"
done |
column
This must render something like:
00 E '' 1A E $'\032' 34 - 4 4E - N 68 - h
01 E $'\001' 1B E $'\E' 35 - 5 4F - O 69 - i
02 E $'\002' 1C E $'\034' 36 - 6 50 - P 6A - j
03 E $'\003' 1D E $'\035' 37 - 7 51 - Q 6B - k
04 E $'\004' 1E E $'\036' 38 - 8 52 - R 6C - l
05 E $'\005' 1F E $'\037' 39 - 9 53 - S 6D - m
06 E $'\006' 20 E \ 3A - : 54 - T 6E - n
07 E $'\a' 21 E \! 3B E \; 55 - U 6F - o
08 E $'\b' 22 E \" 3C E \< 56 - V 70 - p
09 E $'\t' 23 E \# 3D - = 57 - W 71 - q
0A E $'\n' 24 E \$ 3E E \> 58 - X 72 - r
0B E $'\v' 25 - % 3F E \? 59 - Y 73 - s
0C E $'\f' 26 E \& 40 - # 5A - Z 74 - t
0D E $'\r' 27 E \' 41 - A 5B E \[ 75 - u
0E E $'\016' 28 E \( 42 - B 5C E \\ 76 - v
0F E $'\017' 29 E \) 43 - C 5D E \] 77 - w
10 E $'\020' 2A E \* 44 - D 5E E \^ 78 - x
11 E $'\021' 2B - + 45 - E 5F - _ 79 - y
12 E $'\022' 2C E \, 46 - F 60 E \` 7A - z
13 E $'\023' 2D - - 47 - G 61 - a 7B E \{
14 E $'\024' 2E - . 48 - H 62 - b 7C E \|
15 E $'\025' 2F - / 49 - I 63 - c 7D E \}
16 E $'\026' 30 - 0 4A - J 64 - d 7E E \~
17 E $'\027' 31 - 1 4B - K 65 - e 7F E $'\177'
18 E $'\030' 32 - 2 4C - L 66 - f
19 E $'\031' 33 - 3 4D - M 67 - g
Where first field is hexa value of byte, second contain E if character need to be escaped and third field show escaped presentation of character.
Why ,?
You could see some characters that don't always need to be escaped, like ,, } and {.
So not always but sometime:
echo test 1, 2, 3 and 4,5.
test 1, 2, 3 and 4,5.
or
echo test { 1, 2, 3 }
test { 1, 2, 3 }
but care:
echo test{1,2,3}
test1 test2 test3
echo test\ {1,2,3}
test 1 test 2 test 3
echo test\ {\ 1,\ 2,\ 3\ }
test 1 test 2 test 3
echo test\ {\ 1\,\ 2,\ 3\ }
test 1, 2 test 3

To save someone else from having to RTFM... in bash:
Enclosing characters in double quotes preserves the literal value of all characters within the quotes, with the exception of $, `, \, and, when history expansion is enabled, !.
...so if you escape those (and the quote itself, of course) you're probably okay.
If you take a more conservative 'when in doubt, escape it' approach, it should be possible to avoid getting instead characters with special meaning by not escaping identifier characters (i.e. ASCII letters, numbers, or '_'). It's very unlikely these would ever (i.e. in some weird POSIX-ish shell) have special meaning and thus need to be escaped.

Using the print '%q' technique, we can run a loop to find out which characters are special:
#!/bin/bash
special=$'`!##$%^&*()-_+={}|[]\\;\':",.<>?/ '
for ((i=0; i < ${#special}; i++)); do
char="${special:i:1}"
printf -v q_char '%q' "$char"
if [[ "$char" != "$q_char" ]]; then
printf 'Yes - character %s needs to be escaped\n' "$char"
else
printf 'No - character %s does not need to be escaped\n' "$char"
fi
done | sort
It gives this output:
No, character % does not need to be escaped
No, character + does not need to be escaped
No, character - does not need to be escaped
No, character . does not need to be escaped
No, character / does not need to be escaped
No, character : does not need to be escaped
No, character = does not need to be escaped
No, character # does not need to be escaped
No, character _ does not need to be escaped
Yes, character needs to be escaped
Yes, character ! needs to be escaped
Yes, character " needs to be escaped
Yes, character # needs to be escaped
Yes, character $ needs to be escaped
Yes, character & needs to be escaped
Yes, character ' needs to be escaped
Yes, character ( needs to be escaped
Yes, character ) needs to be escaped
Yes, character * needs to be escaped
Yes, character , needs to be escaped
Yes, character ; needs to be escaped
Yes, character < needs to be escaped
Yes, character > needs to be escaped
Yes, character ? needs to be escaped
Yes, character [ needs to be escaped
Yes, character \ needs to be escaped
Yes, character ] needs to be escaped
Yes, character ^ needs to be escaped
Yes, character ` needs to be escaped
Yes, character { needs to be escaped
Yes, character | needs to be escaped
Yes, character } needs to be escaped
Some of the results, like , look a little suspicious. Would be interesting to get #CharlesDuffy's inputs on this.

Characters that need escaping are different in Bourne or POSIX shell than Bash. Generally (very) Bash is a superset of those shells, so anything you escape in shell should be escaped in Bash.
A nice general rule would be "if in doubt, escape it". But escaping some characters gives them a special meaning, like \n. These are listed in the man bash pages under Quoting and echo.
Other than that, escape any character that is not alphanumeric, it is safer. I don't know of a single definitive list.
The man pages list them all somewhere, but not in one place. Learn the language, that is the way to be sure.
One that has caught me out is !. This is a special character (history expansion) in Bash (and csh) but not in Korn shell. Even echo "Hello world!" gives problems. Using single-quotes, as usual, removes the special meaning.

I presume that you're talking about bash strings. There are different types of strings which have a different set of requirements for escaping. eg. Single quotes strings are different from double quoted strings.
The best reference is the Quoting section of the bash manual.
It explains which characters needs escaping. Note that some characters may need escaping depending on which options are enabled such as history expansion.

I noticed that bash automatically escapes some characters when using auto-complete.
For example, if you have a directory named dir:A, bash will auto-complete to dir\:A
Using this, I runned some experiments using characters of the ASCII table and derived the following lists:
Characters that bash escapes on auto-complete: (includes space)
!"$&'()*,:;<=>?#[\]^`{|}
Characters that bash does not escape:
#%+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~
(I excluded /, as it cannot be used in directory names)

Related

UNIX/Linux shell script: Removing variant form emoji from a text

Consider you are using a Linux/UNIX shell whose default character set is UTF-8:
$ echo $LANG
en_US.UTF-8
You have a text file, emoji.txt, which is coded in UTF-8:
$ file -i ./emoji.txt
./emoji.txt: text/plain; charset=utf-8
This text file contains some emoji and a variant form escape sequence:
$ cat ./emoji.txt
Standard ☁
Variant form ☁️
$ uni2ascii -a B -q ./emoji.txt
Standard \x2601
Variant form \x2601\xFE0F
You want to remove both emoji, including that variant form character (\xFE0F), and so the output should be
Standard
Variant form
How would you do this?
Update. This question is not about how to remove the last word in every line. Imagine emoji2.txt that includes a large text with many emoji characters; and some of these are followed by the variant form sequence.
With GNU sed and bash:
sed -E s/$'\u2601\uFE0F?'//g emoji.txt
You can use awk, like this:
$ cat emo.ascii
Standard \x2601
Variant form \x2601\xFE0F
$ ascii2uni -a B emo.ascii
Standard ☁
Variant form ☁️
3 tokens converted # note: this is stderr
$ ascii2uni -a B emo.ascii | awk -F' ' '{NF--}1' | cat -A
3 tokens converted # note: this is stderr
Standard$
Variant form$
NF-- will decrease the field count in awk, which effectively removes the last field. 1 evaluates to true, which makes awk print the modified line.
(Used cat -A here only to show that there aren't any invisible characters left)
Have awk print all but the last field:
$ awk '/^Standard/ || /^Variant form/ { $(NF)="" }1' emoji.txt
Standard
Variant form
NOTE: This particular solution will leave the field separator (blank) on the end of the output line; if you want to strip the trailing blank you can pipe to sed, tr, etc ... or have awk loop through fields 1 to (NF-1) and output via printf
Use nkf command. nkf -s try to convert character encoding to Shift-jis which does not support emojis. Therefore, emojis and escape sequence will be gone. Finally, revert input to UTF-8 with nkf -w.
$ cat emoji.txt | nkf -s | nkf -w
Standard
Variant form
$ cat emoji.txt | nkf -s | nkf -w | od -tx1c
0000000 53 74 61 6e 64 61 72 64 20 0a 56 61 72 69 61 6e
S t a n d a r d \n V a r i a n
0000020 74 20 66 6f 72 6d 20 0a
t f o r m \n
0000030
I thought ruby may work. Because \p{Emoji} matches emojis. But it remains the escape sequences..
$ ruby -nle 'puts $_.gsub!(/\p{Emoji}/,"")' emoji.txt
Standard
Variant form ️
$ ruby -nle 'puts $_.gsub!(/\p{Emoji}/,"")' emoji.txt | od -tx1c
0000000 53 74 61 6e 64 61 72 64 20 0a 56 61 72 69 61 6e
S t a n d a r d \n V a r i a n
0000020 74 20 66 6f 72 6d 20 ef b8 8f 0a
t f o r m 217 \n
0000033
Convert the Unicode text file to ASCII and remove those Unicode characters that are represented by ASCII characters, and convert it to UTF-8 again:
$ uni2ascii -q ./emoji.txt | sed "s/ 0x2601\(0xFE0F\)\?//g" | ascii2uni -q
Standard
Variant form
$

Capturing special characters from stdin to a shell variable

I have a program which prints something that contains null bytes \0 and special characters like \x1f and newlines. For instance:
someprogram
#!/bin/bash
printf "ALICE\0BOB\x1fCHARLIE\n"
Given such a program, I want to read its output in such a way that all those special characters are captured in a shell variable output. So, if I run:
echo $output
because I'm not giving -e, I'd want the output to be:
ALICE\0BOB\x1fCHARLIE\n
How can this be achieved?
My first attempt was:
output=$(someprogram)
But I got this echoed output which doesn't have the special characters:
./myscript.sh: line 2: warning: command substitution: ignored null byte in input
ALICEBOBCHARLIE
I also tried to use read as follows:
output=""
while read -r
do
output="$output$REPLY"
done < <(someprogram)
Then I got rid of the warning but the output is still missing all special characters:
ALICEBOBCHARLIE
So how can I capture the output of someprogram in such a way that I have all the special characters in my resulting string?
EDIT: Note that it is possible to have such strings in bash:
$ x="ALICE\0BOB\x1fCHARLIE\n"
$ echo $x
ALICE\0BOB\x1fCHARLIE\n
So that shouldn't be the problem.
EDIT2: I'll reformulate the question a little bit now that I got an accepted answer and I understood things a little bit better. So, I just needed to be able to store the output of someprogram in some shell variable in such a way that I can print it to stdout without any changes in any special characters as if someprogram was just piped directly to stdout.
You just can't store zero byte in bash variables. It's impossible.
The usual solution is to convert the stream of bytes into hexadecimal. Then convert it back each time you want to do something with it.
$ x=$(printf "ALICE\0BOB\x1fCHARLIE\n" | xxd -p)
$ echo "$x"
414c49434500424f421f434841524c49450a
$ <<<"$x" xxd -p -r | hexdump -C
00000000 41 4c 49 43 45 00 42 4f 42 1f 43 48 41 52 4c 49 |ALICE.BOB.CHARLI|
00000010 45 0a |E.|
00000012
You can also write your own serialization and deserialization functions for the purpose.
Another idea I have is to for example read the data into an array by using zero byte as a separator (as any other byte is valid). This however will have problems with distinguishing the trailing zero byte:
$ readarray -d '' arr < <(printf "ALICE\0BOB\x1fCHARLIE\n")
$ printf "%s\0" "${arr[#]}" | hexdump -C
00000000 41 4c 49 43 45 00 42 4f 42 1f 43 48 41 52 4c 49 |ALICE.BOB.CHARLI|
00000010 45 0a 00 |E..|
# ^^ additional zero byte if input doesn't contain a trailing zero byte
00000013

How to move the cursor in the bash shell when echoing emojis?

I am writing a game engine for Bash using the cursor movement feature described here. However, if I echo emojis or other UTF-8 characters that span more than 1 byte, the cursor position seems to get messed up.
For example, the following code is supposed to echo "1🔈3", move the cursor back 3 positions and then echo "abc" in the same place. The result should only be "abc" (ideally). Instead, I see "1abc"
~ $ echo -e "1🔈3\033[3Dabc"
1abc
A similar problem can be illustrated with the carriage feed:
~ $ echo -e "1🔈3\rabc"
abc3
Is there any good way of resolving this? I am using the Terminal app on macOS. Is there any portable way of doing this?
Note: note, not all UTF-8 chars seem to behave this way. Mostly, I have only been able to reproduce this issue with emojis:
~ $ while true; do read -p "Enter emoji: " x; echo $x | hexdump; echo -e "1${x}3\033[3Dabc"; done
Enter emoji: 🔈
0000000 f0 9f 94 88 0a
0000005
1abc
Enter emoji: ♞
0000000 e2 99 9e 0a
0000004
abc
Enter emoji: ☞
0000000 e2 98 9e 0a
0000004
abc
Enter emoji: 😋
0000000 f0 9f 98 8b 0a
0000005
1abc
Enter emoji: 🃘
0000000 f0 9f 83 98 0a
0000005
abc
Enter emoji: 🀖
0000000 f0 9f 80 96 0a
0000005
abc
Enter emoji: 𝕭
0000000 f0 9d 95 ad 0a
0000005
abc
Enter emoji: 🇺🇸
0000000 f0 9f 87 ba f0 9f 87 b8 0a
0000009
1abc
Enter emoji: ✎
0000000 e2 9c 8e 0a
0000004
abc
The problem happens because a 😋is actually rendered across two columns. On my system, the four emoji and eight digits are equally long:
😋😋😋😋
12345678
It's expected that a single Wide character will require two Narrow characters to overwrite it.
Treating these emoji as wide is recommended by Unicode TR51-16:
Current practice is for emoji to have a square aspect ratio, deriving from their origin in Japanese. For interoperability, it is recommended that this practice be continued with current and future emoji. They will typically have about the same vertical placement and advance width as CJK ideographs.
Given the recommendation, I would be comfortable simply hard coding anything in the "Emoticon" Unicode block as being wide. Your other symbols that work, such as 🀖 and ☞ are not in the Emoticon block (they're in Mahjong and Miscellaneous Symbols respectively).
If you want to determine the width at runtime, you can e.g. ask Python, which helpfully reports their East Asian Width as Full/Wide even though the Unicode tables themselves label it Neutral:
$ python3 -c 'import sys; import unicodedata as u; print(u.east_asian_width(sys.argv[1]))' 😋
W
$ python3 -c 'import sys; import unicodedata as u; print(u.east_asian_width(sys.argv[1]))' ♞
N
🇺🇸 is a bit of a special case since it's composed of two different Regional Indicator Symbols with separate code points, but Python labels each of them as Neutral so if you take that as 1 it'll still add up to 2.
Try this:
s="1🔈3" ; printf "$s"; sleep 2; printf "\033[$((${#s}+1))Dabc%${#s}s\n" ' '
I've put a delay in between the printfs so it's easier to see what happens. First there's:
1🔈 3
Two seconds later the above is overwritten with:
abc
How it works: We put the unicode stuff in a string $s. The ${#s} returns the length in bytes of that string. The length is used in $((${#s}+1)) to calculate how many spaces back to move, then %${#s}s tells printf how many spaces it needs (plus a few more) to overwrite any leftover chars.
If "a few more" spaces is too many, counting the overwriting string gives a more precise result:
s="1🔈3" t="abc"
printf "${s}"; sleep 2; printf "\033[$((${#s}+1))D$t%$((1+${#s}-${#t}))s\n" ''

How to write string with octal value

In bash, I would like to write the string "BLA\1"
so it will be a buffer 42 4C 41 01 but the result is 42 4C 41 5C 31
To complete, in python if you write "BLA\1" in a binary file, the "\1" is interpreted as "1"
So how can I write the string "BLA\1" correctly in bash?
Use the $'' special quotes:
echo -n $'BLA\1' | xxd
00000000: 424c 4101 BLA.
Use printf, defined by the POSIX standard:
printf 'BLA\1'
Some bash-specific options:
# Let echo expand the escape code
echo -ne 'BLA\1'
# Use $'...', as in choroba's answer
echo -n $'BLA\1'

Conversion hex string into ascii in bash command line

I have a lot of this kind of string and I want to find a command to convert it in ascii, I tried with echo -e and od, but it did not work.
0xA7.0x9B.0x46.0x8D.0x1E.0x52.0xA7.0x9B.0x7B.0x31.0xD2
This worked for me.
$ echo 54657374696e672031203220330 | xxd -r -p
Testing 1 2 3$
-r tells it to convert hex to ascii as opposed to its normal mode of doing the opposite
-p tells it to use a plain format.
This code will convert the text 0xA7.0x9B.0x46.0x8D.0x1E.0x52.0xA7.0x9B.0x7B.0x31.0xD2 into a stream of 11 bytes with equivalent values. These bytes will be written to standard out.
TESTDATA=$(echo '0xA7.0x9B.0x46.0x8D.0x1E.0x52.0xA7.0x9B.0x7B.0x31.0xD2' | tr '.' ' ')
for c in $TESTDATA; do
echo $c | xxd -r
done
As others have pointed out, this will not result in a printable ASCII string for the simple reason that the specified bytes are not ASCII. You need post more information about how you obtained this string for us to help you with that.
How it works: xxd -r translates hexadecimal data to binary (like a reverse hexdump). xxd requires that each line start off with the index number of the first character on the line (run hexdump on something and see how each line starts off with an index number). In our case we want that number to always be zero, since each execution only has one line. As luck would have it, our data already has zeros before every character as part of the 0x notation. The lower case x is ignored by xxd, so all we have to do is pipe each 0xhh character to xxd and let it do the work.
The tr translates periods to spaces so that for will split it up correctly.
You can use xxd:
$cat hex.txt
68 65 6c 6c 6f
$cat hex.txt | xxd -r -p
hello
You can use something like this.
$ cat test_file.txt
54 68 69 73 20 69 73 20 74 65 78 74 20 64 61 74 61 2e 0a 4f 6e 65 20 6d 6f 72 65 20 6c 69 6e 65 20 6f 66 20 74 65 73 74 20 64 61 74 61 2e
$ for c in `cat test_file.txt`; do printf "\x$c"; done;
This is text data.
One more line of test data.
The values you provided are UTF-8 values. When set, the array of:
declare -a ARR=(0xA7 0x9B 0x46 0x8D 0x1E 0x52 0xA7 0x9B 0x7B 0x31 0xD2)
Will be parsed to print the plaintext characters of each value.
for ((n=0; n < ${#ARR[*]}; n++)); do echo -e "\u${ARR[$n]//0x/}"; done
And the output will yield a few printable characters and some non-printable characters as shown here:
For converting hex values to plaintext using the echo command:
echo -e "\x<hex value here>"
And for converting UTF-8 values to plaintext using the echo command:
echo -e "\u<UTF-8 value here>"
And then for converting octal to plaintext using the echo command:
echo -e "\0<octal value here>"
When you have encoding values you aren't familiar with, take the time to check out the ranges in the common encoding schemes to determine what encoding a value belongs to. Then conversion from there is a snap.
The echo -e must have been failing for you because of wrong escaping.
The following code works fine for me on a similar output from your_program with arguments:
echo -e $(your_program with arguments | sed -e 's/0x\(..\)\.\?/\\x\1/g')
Please note however that your original hexstring consists of non-printable characters.
Make a script like this:
bash
#!/bin/bash
echo $((0x$1)).$((0x$2)).$((0x$3)).$((0x$4))
Example:
sh converthextoip.sh c0 a8 00 0b
Result:
192.168.0.11

Resources