How to decode URL-encoded string in shell? - bash

I have a file with a list of user-agents which are encoded.
E.g.:
Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en
I want a shell script which can read this file and write to a new file with decoded strings.
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en
I have been trying to use this example to get it going but it is not working so far.
$ echo -e "$(echo "%31+%32%0A%33+%34" | sed 'y/+/ /; s/%/\\x/g')"
My script looks like:
#!/bin/bash
for f in *.log; do
echo -e "$(cat $f | sed 'y/+/ /; s/%/\x/g')" > y.log
done

Here is a simple one-line solution.
$ function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
It may look like perl :) but it is just pure bash. No awks, no seds ... no overheads. Using the : builtin, special parameters, pattern substitution and the echo builtin's -e option to translate hex codes into characters. See bash's manpage for further details. You can use this function as separate command
$ urldecode https%3A%2F%2Fgoogle.com%2Fsearch%3Fq%3Durldecode%2Bbash
https://google.com/search?q=urldecode+bash
or in variable assignments, like so:
$ x="http%3A%2F%2Fstackoverflow.com%2Fsearch%3Fq%3Durldecode%2Bbash"
$ y=$(urldecode "$x")
$ echo "$y"
http://stackoverflow.com/search?q=urldecode+bash

If you are a python developer, this maybe preferable:
For Python 3.x (default):
echo -n "%21%20" | python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"
For Python 2.x (deprecated):
echo -n "%21%20" | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());"
urllib is really good at handling URL parsing

With BASH, to read the per cent encoded URL from standard in and decode:
while read; do echo -e ${REPLY//%/\\x}; done
Press CTRL-D to signal the end of file(EOF) and quit gracefully.
You can decode the contents of a file by setting the file to be standard in:
while read; do echo -e ${REPLY//%/\\x}; done < file
You can decode input from a pipe either, for example:
echo 'a%21b' | while read; do echo -e ${REPLY//%/\\x}; done
The read built in command reads standard in until it sees a Line Feed character. It sets a variable called REPLY equal to the line of text it just read.
${REPLY//%/\\x} replaces all instances of '%' with '\x'.
echo -e interprets \xNN as the ASCII character with hexadecimal value of NN.
while repeats this loop until the read command fails, eg. EOF has been reached.
The above does not change '+' to ' '. To change '+' to ' ' also, like guest's answer:
while read; do : "${REPLY//%/\\x}"; echo -e ${_//+/ }; done
: is a BASH builtin command. Here it just takes in a single argument and does nothing with it.
The double quotes make everything inside one single parameter.
_ is a special parameter that is equal to the last argument of the previous command, after argument expansion. This is the value of REPLY with all instances of '%' replaced with '\x'.
${_//+/ } replaces all instances of '+' with ' '.
This uses only BASH and doesn't start any other process, similar to guest's answer.

This is what seems to be working for me.
#!/bin/bash
urldecode(){
echo -e "$(sed 's/+/ /g;s/%\(..\)/\\x\1/g;')"
}
for f in /opt/logs/*.log; do
name=${f##/*/}
cat $f | urldecode > /opt/logs/processed/$HOSTNAME.$name
done
Replacing '+'s with spaces, and % signs with '\x' escapes, and letting echo interpret the \x escapes using the '-e' option was not working. For some reason, the cat command was printing the % sign as its own encoded form %25. So sed was simply replacing %25 with \x25. When the -e option was used, it was simply evaluating \x25 as % and the output was same as the original.
Trace:
Original: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en
sed: Mozilla\x252F5.0\x2520\x2528Macintosh\x253B\x2520U\x253B\x2520Intel\x2520Mac\x2520OS\x2520X\x252010.6\x253B\x2520en
echo -e: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en
Fix: Basically ignore the 2 characters after the % in sed.
sed: Mozilla\x2F5.0\x20\x28Macintosh\x3B\x20U\x3B\x20Intel\x20Mac\x20OS\x20X\x2010.6\x3B\x20en
echo -e: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en
Not sure what complications this would result in, after extensive testing, but works for now.

Bash script for doing it in native Bash (original source):
LANG=C
urlencode() {
local l=${#1}
for (( i = 0 ; i < l ; i++ )); do
local c=${1:i:1}
case "$c" in
[a-zA-Z0-9.~_-]) printf "$c" ;;
' ') printf + ;;
*) printf '%%%.2X' "'$c"
esac
done
}
urldecode() {
local data=${1//+/ }
printf '%b' "${data//%/\x}"
}
If you want to urldecode file content, just put the file content as an argument.
Here's a test that will run halt if the decoded encoded file content differs (if it runs for a few seconds, the script probably works correctly):
while true
do cat /dev/urandom | tr -d '\0' | head -c1000 > /tmp/tmp;
A="$(cat /tmp/tmp; printf x)"
A=${A%x}
A=$(urlencode "$A")
urldecode "$A" > /tmp/tmp2
cmp /tmp/tmp /tmp/tmp2
if [ $? != 0 ]
then break
fi
done

perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/pack H2,$1/gie' ./*.log
With -i updates the files in-place (some sed implementations have borrowed that from perl) with .back as the backup extension.
s/x/y/e substitutes x with the evaluation of the y perl code.
The perl code in this case uses pack to pack the hex number captured in $1 (first parentheses pair in the regexp) as the corresponding character.
An alternative to pack is to use chr(hex($1)):
perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/chr hex $1/gie' ./*.log
If available, you could also use uri_unescape() from URI::Escape:
perl -pi.back -MURI::Escape -e 'y/+/ /;$_=uri_unescape$_' ./*.log

bash idiom for url-decoding
Here is a bash idiom for url-decoding a string held in variabe x and assigning the result to variable y:
: "${x//+/ }"; printf -v y '%b' "${_//%/\\x}"
Unlike the accepted answer, it preserves trailing newlines during assignment. (Try assigning the result of url-decoding v%0A%0A%0A to a variable.)
It also is fast. It is 6700% faster at assigning the result of url-decoding to a variable than the accepted answer.
Caveat: It is not possible for a bash variable to contain a NUL. For example, any bash solution attempting to decode %00 and assign the result to a variable will not work.
Benchmark details
function.sh
#!/bin/bash
urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
x=%21%20
for (( i=0; i<5000; i++ )); do
y=$(urldecode "$x")
done
idiom.sh
#!/bin/bash
x=%21%20
for (( i=0; i<5000; i++ )); do
: "${x//+/ }"; printf -v y '%b' "${_//%/\\x}"
done
$ hyperfine --warmup 5 ./function.sh ./idiom.sh
Benchmark #1: ./function.sh
Time (mean ± σ): 2.844 s ± 0.036 s [User: 1.728 s, System: 1.494 s]
Range (min … max): 2.801 s … 2.907 s 10 runs
Benchmark #2: ./idiom.sh
Time (mean ± σ): 42.4 ms ± 1.0 ms [User: 40.7 ms, System: 1.1 ms]
Range (min … max): 40.5 ms … 44.8 ms 64 runs
Summary
'./idiom.sh' ran
67.06 ± 1.76 times faster than './function.sh'
If you really want a function ...
If you really want a function, say for readability reasons, I suggest the following:
# urldecode [-v var ] argument
#
# Urldecode the argument and print the result.
# It replaces '+' with SPACE and then percent decodes.
# The output is consistent with https://meyerweb.com/eric/tools/dencoder/
#
# Options:
# -v var assign the output to shell variable VAR rather than
# print it to standard output
#
urldecode() {
local assign_to_var=
local OPTIND opt
while getopts ':v:' opt; do
case $opt in
v)
local var=$OPTARG
assign_to_var=Y
;;
\?)
echo "$FUNCNAME: error: -$OPTARG: invalid option" >&2
return 1
;;
:)
echo "$FUNCNAME: error: -$OPTARG: this option requires an argument" >&2
return 1
;;
*)
echo "$FUNCNAME: error: an unexpected execution path has occurred." >&2
return 1
;;
esac
done
shift "$((OPTIND - 1))"
# Convert all '+' to ' '
: "${1//+/ }"
# We exploit that the $_ variable (last argument to the previous command
# after expansion) contains the result of the parameter expansion
if [[ $assign_to_var ]]; then
printf -v "$var" %b "${_//%/\\x}"
else
printf %b "${_//%/\\x}"
fi
}
Example 1: Printing the result to stdout
x='v%0A%0A%0A'
urldecode "$x" | od -An -tx1
Result:
76 0a 0a 0a
Example 2: Assigning the result of decoding to a shell variable:
x='v%0A%0A%0A'
urldecode -v y "$x"
echo -n "$y" | od -An -tx1
(same result)
This function, while not as fast as the idiom above, is still 1300% faster than the accepted answer at doing assignments due to no subshell being involved. In addition, as shown in the example's output, it preserves trailing newlines due to no command substitution being involved.

If you have php installed on your server, you can "cat" or even "tail" any file, with url encoded strings very easily.
tail -f nginx.access.log | php -R 'echo urldecode($argn)."\n";'

As #barti_ddu said in the comments, \x "should be [double-]escaped".
% echo -e "$(echo "Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en" | sed 'y/+/ /; s/%/\\x/g')"
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en
Rather than mixing up Bash and sed, I would do this all in Python. Here's a rough cut of how:
#!/usr/bin/env python
import glob
import os
import urllib
for logfile in glob.glob(os.path.join('.', '*.log')):
with open(logfile) as current:
new_log_filename = logfile + '.new'
with open(new_log_filename, 'w') as new_log_file:
for url in current:
unquoted = urllib.unquote(url.strip())
new_log_file.write(unquoted + '\n')

Building upon some of the other answers, but for the POSIX world, could use the following function:
url_decode() {
printf '%b\n' "$(sed -E -e 's/\+/ /g' -e 's/%([0-9a-fA-F]{2})/\\x\1/g')"
}
It uses printf '%b\n' because there is no echo -e and breaks the sed call to make it easier to read, forcing -E to be able to use references with \1. It also forces what follows % to look like some hex code.

Just wanted to share this other solution, pure bash:
encoded_string="Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en"
printf -v decoded_string "%b" "${encoded_string//\%/\\x}"
echo $decoded_string
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

Updating Jay's answer for Python 3.5+:
echo "%31+%32%0A%33+%34" | python -c "import sys; from urllib.parse import unquote ; print(unquote(sys.stdin.read()))"
Still, brendan's bash solution with explanation seems more direct and elegant.

With GNU awk:
LC_ALL=C gawk -vRS='%[[:xdigit:]]{2}' '
RT {RT = sprintf("%c",strtonum("0x" substr(RT, 2)))}
{gsub(/\+/," ");printf "%s", $0 RT}'
Would take URI-encoded on stdin and print the decoded output on stdout.
We set the record separator as a regexp that matches a %XX sequence. In GNU awk, the input that matched it is stored in the RT special variable. We extract the hex digits from there, append to "0x" for strnum() to turn into a number, passed in turn to sprintf("%c") which in the C locale would convert to the corresponding byte value.

With sed:
#!/bin/bash
URL_DECODE="$(echo "$1" | sed -E 's/%([0-9a-fA-F]{2})/\\x\1/g;s/\+/ /g'"
echo -e "$URL_DECODE"
s/%([0-9a-fA-F]{2})/\\x\1/g replaces % with \x to transform urlencoded to hexadecimal
s/\+/ /g replace + to space ' ', in case using + in query string
Just save it to decodeurl.sh and make it executable with chmod +x decodeurl.sh
If you need a way do encode too, this complete code will help:
#!/bin/bash
#
# Enconding e Decoding de URL com sed
#
# Por Daniel Cambría
# daniel.cambria#bureau-it.com
#
# jul/2021
function url_decode() {
echo "$#" \
| sed -E 's/%([0-9a-fA-F]{2})/\\x\1/g;s/\+/ /g'
}
function url_encode() {
# Conforme RFC 3986
echo "$#" \
| sed \
-e 's/ /%20/g' \
-e 's/:/%3A/g' \
-e 's/,/%2C/g' \
-e 's/\?/%3F/g' \
-e 's/#/%23/g' \
-e 's/\[/%5B/g' \
-e 's/\]/%5D/g' \
-e 's/#/%40/g' \
-e 's/!/%41/g' \
-e 's/\$/%24/g' \
-e 's/&/%26/g' \
-e "s/'/%27/g" \
-e 's/(/%28/g' \
-e 's/)/%29/g' \
-e 's/\*/%2A/g' \
-e 's/\+/%2B/g' \
-e 's/,/%2C/g' \
-e 's/;/%3B/g' \
-e 's/=/%3D/g'
}
echo -e "URL decode: " $(url_decode "$1")
echo -e "URL encode: " $(url_encode "$1")

$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(echo -e "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$

With the zsh shell (instead of bash), the only shell whose variables can hold any byte value including NUL (encoded as %00):
set -o extendedglob +o multibyte
string='Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en'
decoded=${${string//+/ }//(#b)%([[:xdigit:]](#c2))/${(#):-0x$match[1]}}
${var//pattern/replacement}: ksh-style parameter expansion operator to expand to the value of $var with every string matching pattern replaced with replacement.
(#b) activate back references so every part inside brackets in the pattern can be accessed as corresponding $match[n] in the replacement.
(#c2): equivalent of ERE {2}
${(#)param-expansion}: parameter expansion where the # flag causes the result to be interpreted as an arithmetic expression and the corresponding byte value to be returned.
${var:-value}: expands to value if $var is empty, here applied to no variable at all, so we can just specify an arbitrary string as the subject of a parameter expansion.
To make it a function that decodes the contents of a variable in-place:
uridecode_var() {
emulate -L zsh
set -o extendedglob +o multibyte
eval $1='${${'$1'//+/ }//(#b)%([[:xdigit:]](#c2))/${(#):-0x$match[1]}}'
}
$ string='Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en'
$ uridecode_var string
$ print -r -- $string
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

python, for zshrc
# Usage: decodeUrl %3A%2F%2F
function decodeUrl(){
echo "$1" | python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"
}
# Usage: encodeUrl https://google.com/search?q=urldecode+bash
# return: https://google.com/search\?q\=urldecode+bash
function encodeUrl(){
echo "$1" | python3 -c "import sys; from urllib.parse import quote; print(quote(sys.stdin.read()));"
}

used gridsite-clients
1. yum install gridsite-clients / or apt-get install gridsite-clients
2. grep -a 'http' access.log | xargs urlencode -d

Here is a solution that is done in pure bash where input and output are bash variables. It will decode '+' as a space and handle the '%20' space, as well as other %-encoded characters.
#!/bin/bash
#here is text that contains both '+' for spaces and a %20
text="hello+space+1%202"
decoded=$(echo -e `echo $text | sed 's/+/ /g;s/%/\\\\x/g;'`)
echo decoded=$decoded

Expanding to
https://stackoverflow.com/a/37840948/8142470
to work with HTML entities
$ htmldecode() { : "${*//+/ }"; echo -e "${_//&#x/\x}" | tr -d
';'; } $ htmldecode
"http://google.com/search&?q=urldecode+bash" http://google.com/search&?q=urldecode+bash
(argument must be quoted)

Just a quick hint for other who are searching for a busybox compatible solution. In busybox shell you can use
httpd -d $ENCODED_URL
Example use case for busybox:
Download a file with wget and save it with the original decoded filename:
wget --no-check-certificate $ENCODED_URL -O $(basename $(httpd -d $ENCODED_URL))

If you prefer gawk, there's absolutely no need to force LC_ALL=C or gawk -b just to decode URL-encoded -
here's a fully functional proof-of-concept showcasing how gawk-unicode mode could directly decode purely binary files like MP3-audio or MP4-video files that were URL-encoded,and get back the exact same file, as confirmed by hashing.
It uses FS | OFS to handle the spaces that were set to +, similar to python3's quote-plus in their urllib :
( fg && fg && fg ) 2>/dev/null;
gls8x "${f}"
echo
pvE0 < "${f}" | xxh128sum | lgp3
echo ; echo
pvE0 < "${f}" | urlencodeAWKchk \
\
| gawk -ne '
BEGIN {
RS="[%][[:xdigit:]]{2}";
FS="[+]"
_=(4^5)*54 # if this offset doesn-t
# work, try
# 8^7
# instead
} (NF+="_"*(ORS = sprintf("%.*s", RT != "",
sprintf("%c",\
_+("0x" \
substr( RT, 2 ))))))~""' |pvE9|xxh128sum|lgp3
1 -rwxrwxrwx 1 5555 staff 9290187 May 27 2021 genieaudio_16277926_.lossless.mp3*
in0: 8.86MiB 0:00:00 [3.56GiB/s] [3.56GiB/s][=================>] 100%
5d43c221bf6c85abac80eea8dbb412a1 stdin
in0: 8.86MiB 0:00:00 [3.47GiB/s] [3.47GiB/s] [=================>] 100%
out9: 8.86MiB 0:00:05 [1.72MiB/s] [1.72MiB/s] [ <=> ]
5d43c221bf6c85abac80eea8dbb412a1 stdin
1 -rw-r--r-- 1 5555 staff 215098877 Feb 8 17:30 vg3.mp4
in0: 205MiB 0:00:00 [2.66GiB/s] [2.66GiB/s] [=================>] 100%
2778670450b08cee694dcefc23cd4d93 stdin
in0: 205MiB 0:00:00 [3.31GiB/s] [3.31GiB/s] [=================>] 100%
out9: 205MiB 0:02:01 [1.69MiB/s] [1.69MiB/s] [ <=> ]
2778670450b08cee694dcefc23cd4d93 stdin

Minimalistic uridecode [-v varname] bash function:
Comming late on this SO Question (11 year ago), I see:
First answer suggesting the use of printf -v varname %b... was offer by jamp, near than 3 year after question was asked.
Fist answer offering a function for doing this was offered 10 years and 6 month after question, by Robin A. Meade.
Here is my smaller function:
uridecode() {
if [[ $1 == -v ]];then local -n _res="$2"; shift 2; else local _res; fi
: "${*//+/ }"; printf -v _res %b "${_//%/\\x}"
[[ ${_res#A} == _res=* ]] && echo "$_res"
}
Or less condensed:
uridecode() {
if [[ $1 == -v ]];then # If 1st argument is ``-v''
local -n _res="$2" # _res is a nameref to ``$2''
shift 2 # drop 1st two arguments
else
local _res # _res is a local variable
fi
: "${*//+/ }" # _ hold argumenrs having ``+'' replaced by spaces
printf -v _res %b "${_//%/\\x}" # store in _res rendered string
[[ ${_res#A} == _res=* ]] && # print _res if local
echo "$_res"
}
Usage:
uridecode Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en
uridecode -v myvar Hell%6f w%6Frld%21
echo $myvar
Hello world!
As I use $* instead of $1, and because URI doesn't hold special characters, there is no need to quote arguments.

A slightly modified version of the Python answer that accepts an input and output file in a one liner.
cat inputfile.txt | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());" > ouputfile.txt

$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(printf "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$

Related

make the bash script to be faster

I have a fairly large list of websites in "file.txt" and wanted to check if the words "Hello World!" in the site in the list using looping and curl.
i.e in "file.txt" :
blabla.com
blabla2.com
blabla3.com
then my code :
#!/bin/bash
put() {
printf "list : "
read list
run=$(cat $list)
}
put
scan_list() {
for run in $(cat $list);do
if [[ $(curl -skL ${run}) =~ "Hello World!" ]];then
printf "${run} Hello World! \n"
else
printf "${run} No Hello:( \n"
fi
done
}
scan_list
this takes a lot of time, is there a way to make the checking process faster?
Use xargs:
% tr '\12' '\0' < file.txt | \
xargs -0 -r -n 1 -t -P 3 sh -c '
if curl -skL "$1" | grep -q "Hello World!"; then
echo "$1 Hello World!"
exit
fi
echo "$1 No Hello:("
' _
Use tr to convert returns in the file.txt to nulls (\0).
Pass through xargs with -0 option to parse by nulls.
The -r option prevents the command from being ran if the input is empty. This is only available on Linux, so for macOS or *BSD you will need to check that file.txt is not empty before running.
The -n 1 permits only one file per execution.
The -t option is debugging, it prints the command before it is ran.
We allow 3 simultaneous commands in parallel with the -P 3 option.
Using sh -c with a single quoted multi-line command, we substitute $1 for the entries from the file.
The _ fills in the $0 argument, so our entries are $1.

Take multiple (any number of input) input strings and concatenate in shell

I want to input multiple strings.
For example:
abc
xyz
pqr
and I want output like this (including quotes) in a file:
"abc","xyz","pqr"
I tried the following code, but it doesn't give the expected output.
NextEmail=","
until [ "a$NextEmail" = "a" ];do
echo "Enter next E-mail: "
read NextEmail
Emails="\"$Emails\",\"$NextEmail\""
done
echo -e $Emails
This seems to work:
#!/bin/bash
# via https://stackoverflow.com/questions/1527049/join-elements-of-an-array
function join_by { local IFS="$1"; shift; echo "$*"; }
emails=()
while read line
do
if [[ -z $line ]]; then break; fi
emails+=("$line")
done
join_by ',' "${emails[#]}"
$ bash vvuv.sh
my-email
another-email
third-email
my-email,another-email,third-email
$
With sed and paste:
sed 's/.*/"&"/' infile | paste -sd,
The sed command puts "" around each line; paste does serial pasting (-s) and uses , as the delimiter (-d,).
If input is from standard input (and not a file), you can just remove the input filename (infile) from the command; to store in a file, add a redirection at the end (> outfile).
If you can withstand a trailing comma, then printf can convert an array, with no loop required...
$ readarray -t a < <(printf 'abc\nxyx\npqr\n' )
$ declare -p a
declare -a a=([0]="abc" [1]="xyx" [2]="pqr")
$ printf '"%s",' "${a[#]}"; echo
"abc","xyx","pqr",
(To be fair, there's a loop running inside bash, to step through the array, but it's written in C, not bash. :) )
If you wanted, you could replace the final line with:
$ printf -v s '"%s",' "${a[#]}"
$ s="${s%,}"
$ echo "$s"
"abc","xyx","pqr"
This uses printf -v to store the imploded text into a variable, $s, which you can then strip the trailing comma off using Parameter Expansion.

Remove leading zeros from MAC address

I have a MAC address that looks like this.
01:AA:BB:0C:D0:E1
I want to convert it to lowercase and strip the leading zeros.
1:aa:bb:c:d0:e1
What's the simplest way to do that in a Bash script?
$ echo 01:AA:BB:0C:D0:E1 | sed 's/\(^\|:\)0/\1/g;s/.*/\L\0/'
1:aa:bb:c:d0:e1
\(^\|:\)0 represents either the line start (^) or a :, followed by a 0.
We want to replace this by the capture (either line start or :), which removed the 0.
Then, a second substitution (s/.*/\L\0/) put the whole line in lowercase.
$ sed --version | head -1
sed (GNU sed) 4.2.2
EDIT: Alternatively:
echo 01:AA:BB:0C:D0:E1 | sed 's/0\([0-9A-Fa-f]\)/\1/g;s/.*/\L\0/'
This replaces 0x (with x any hexa digit) by x.
EDIT: if your sed does not support \L, use tr:
echo 01:AA:BB:0C:D0:E1 | sed 's/0\([0-9A-Fa-f]\)/\1/g' | tr '[:upper:]' '[:lower:]'
Here's a pure Bash≥4 possibility:
mac=01:AA:BB:0C:D0:E1
IFS=: read -r -d '' -a macary < <(printf '%s:\0' "$mac")
macary=( "${macary[#]#0}" )
macary=( "${macary[#],,}" )
IFS=: eval 'newmac="${macary[*]}"'
The line IFS=: read -r -d '' -a macary < <(printf '%s:\0' "$mac") is the canonical way to split a string into an array,
the expansion "${macary[#]#0}" is that of the array macary with leading 0 (if any) removed,
the expansion "${macary[#],,}" is that of the array macary in lowercase,
IFS=: eval 'newmac="${macary[*]}"' is a standard way to join the fields of an array (note that the use of eval is perfectly safe).
After that:
declare -p newmac
yields
declare -- newmac="1:aa:bb:c:d0:e1"
as required.
A more robust way is to validate the MAC address first:
mac=01:AA:BB:0C:D0:E1
a='([[:xdigit:]]{2})' ; regex="^$a:$a:$a:$a:$a:$a$"
[[ $mac =~ $regex ]] || { echo "Invalid MAC address" >&2; exit 1; }
And then, using the valid result of the regex match (BASH_REMATCH):
set -- $(printf '%x ' $(printf '0x%s ' "${BASH_REMATCH[#]:1}" ))
IFS=: eval 'printf "%s\n" "$*"'
Which will print:
1:aa:bb:c:d0:e1
Hex values without leading zeros and in lowercase.
If Uppercase is needed, change the printf '%x ' to printf '%X '.
If Leading zeros are needed change the same to printf '%02x '.

How to perform a for loop on each character in a string in Bash?

I have a variable like this:
words="这是一条狗。"
I want to make a for loop on each of the characters, one at a time, e.g. first character="这", then character="是", character="一", etc.
The only way I know is to output each character to separate line in a file, then use while read line, but this seems very inefficient.
How can I process each character in a string through a for loop?
You can use a C-style for loop:
foo=string
for (( i=0; i<${#foo}; i++ )); do
echo "${foo:$i:1}"
done
${#foo} expands to the length of foo. ${foo:$i:1} expands to the substring starting at position $i of length 1.
With sed on dash shell of LANG=en_US.UTF-8, I got the followings working right:
$ echo "你好嗎 新年好。全型句號" | sed -e 's/\(.\)/\1\n/g'
你
好
嗎
新
年
好
。
全
型
句
號
and
$ echo "Hello world" | sed -e 's/\(.\)/\1\n/g'
H
e
l
l
o
w
o
r
l
d
Thus, output can be looped with while read ... ; do ... ; done
edited for sample text translate into English:
"你好嗎 新年好。全型句號" is zh_TW.UTF-8 encoding for:
"你好嗎" = How are you[ doing]
" " = a normal space character
"新年好" = Happy new year
"。全型空格" = a double-byte-sized full-stop followed by text description
${#var} returns the length of var
${var:pos:N} returns N characters from pos onwards
Examples:
$ words="abc"
$ echo ${words:0:1}
a
$ echo ${words:1:1}
b
$ echo ${words:2:1}
c
so it is easy to iterate.
another way:
$ grep -o . <<< "abc"
a
b
c
or
$ grep -o . <<< "abc" | while read letter; do echo "my letter is $letter" ; done
my letter is a
my letter is b
my letter is c
I'm surprised no one has mentioned the obvious bash solution utilizing only while and read.
while read -n1 character; do
echo "$character"
done < <(echo -n "$words")
Note the use of echo -n to avoid the extraneous newline at the end. printf is another good option and may be more suitable for your particular needs. If you want to ignore whitespace then replace "$words" with "${words// /}".
Another option is fold. Please note however that it should never be fed into a for loop. Rather, use a while loop as follows:
while read char; do
echo "$char"
done < <(fold -w1 <<<"$words")
The primary benefit to using the external fold command (of the coreutils package) would be brevity. You can feed it's output to another command such as xargs (part of the findutils package) as follows:
fold -w1 <<<"$words" | xargs -I% -- echo %
You'll want to replace the echo command used in the example above with the command you'd like to run against each character. Note that xargs will discard whitespace by default. You can use -d '\n' to disable that behavior.
Internationalization
I just tested fold with some of the Asian characters and realized it doesn't have Unicode support. So while it is fine for ASCII needs, it won't work for everyone. In that case there are some alternatives.
I'd probably replace fold -w1 with an awk array:
awk 'BEGIN{FS=""} {for (i=1;i<=NF;i++) print $i}'
Or the grep command mentioned in another answer:
grep -o .
Performance
FYI, I benchmarked the 3 aforementioned options. The first two were fast, nearly tying, with the fold loop slightly faster than the while loop. Unsurprisingly xargs was the slowest... 75x slower.
Here is the (abbreviated) test code:
words=$(python -c 'from string import ascii_letters as l; print(l * 100)')
testrunner(){
for test in test_while_loop test_fold_loop test_fold_xargs test_awk_loop test_grep_loop; do
echo "$test"
(time for (( i=1; i<$((${1:-100} + 1)); i++ )); do "$test"; done >/dev/null) 2>&1 | sed '/^$/d'
echo
done
}
testrunner 100
Here are the results:
test_while_loop
real 0m5.821s
user 0m5.322s
sys 0m0.526s
test_fold_loop
real 0m6.051s
user 0m5.260s
sys 0m0.822s
test_fold_xargs
real 7m13.444s
user 0m24.531s
sys 6m44.704s
test_awk_loop
real 0m6.507s
user 0m5.858s
sys 0m0.788s
test_grep_loop
real 0m6.179s
user 0m5.409s
sys 0m0.921s
I believe there is still no ideal solution that would correctly preserve all whitespace characters and is fast enough, so I'll post my answer. Using ${foo:$i:1} works, but is very slow, which is especially noticeable with large strings, as I will show below.
My idea is an expansion of a method proposed by Six, which involves read -n1, with some changes to keep all characters and work correctly for any string:
while IFS='' read -r -d '' -n 1 char; do
# do something with $char
done < <(printf %s "$string")
How it works:
IFS='' - Redefining internal field separator to empty string prevents stripping of spaces and tabs. Doing it on a same line as read means that it will not affect other shell commands.
-r - Means "raw", which prevents read from treating \ at the end of the line as a special line concatenation character.
-d '' - Passing empty string as a delimiter prevents read from stripping newline characters. Actually means that null byte is used as a delimiter. -d '' is equal to -d $'\0'.
-n 1 - Means that one character at a time will be read.
printf %s "$string" - Using printf instead of echo -n is safer, because echo treats -n and -e as options. If you pass "-e" as a string, echo will not print anything.
< <(...) - Passing string to the loop using process substitution. If you use here-strings instead (done <<< "$string"), an extra newline character is appended at the end. Also, passing string through a pipe (printf %s "$string" | while ...) would make the loop run in a subshell, which means all variable operations are local within the loop.
Now, let's test the performance with a huge string.
I used the following file as a source:
https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt
The following script was called through time command:
#!/bin/bash
# Saving contents of the file into a variable named `string'.
# This is for test purposes only. In real code, you should use
# `done < "filename"' construct if you wish to read from a file.
# Using `string="$(cat makefiles.txt)"' would strip trailing newlines.
IFS='' read -r -d '' string < makefiles.txt
while IFS='' read -r -d '' -n 1 char; do
# remake the string by adding one character at a time
new_string+="$char"
done < <(printf %s "$string")
# confirm that new string is identical to the original
diff -u makefiles.txt <(printf %s "$new_string")
And the result is:
$ time ./test.sh
real 0m1.161s
user 0m1.036s
sys 0m0.116s
As we can see, it is quite fast.
Next, I replaced the loop with one that uses parameter expansion:
for (( i=0 ; i<${#string}; i++ )); do
new_string+="${string:$i:1}"
done
The output shows exactly how bad the performance loss is:
$ time ./test.sh
real 2m38.540s
user 2m34.916s
sys 0m3.576s
The exact numbers may very on different systems, but the overall picture should be similar.
I've only tested this with ascii strings, but you could do something like:
while test -n "$words"; do
c=${words:0:1} # Get the first character
echo character is "'$c'"
words=${words:1} # trim the first character
done
It is also possible to split the string into a character array using fold and then iterate over this array:
for char in `echo "这是一条狗。" | fold -w1`; do
echo $char
done
The C style loop in #chepner's answer is in the shell function update_terminal_cwd, and the grep -o . solution is clever, but I was surprised not to see a solution using seq. Here's mine:
read word
for i in $(seq 1 ${#word}); do
echo "${word:i-1:1}"
done
#!/bin/bash
word=$(echo 'Your Message' |fold -w 1)
for letter in ${word} ; do echo "${letter} is a letter"; done
Here is the output:
Y is a letter
o is a letter
u is a letter
r is a letter
M is a letter
e is a letter
s is a letter
s is a letter
a is a letter
g is a letter
e is a letter
To iterate ASCII characters on a POSIX-compliant shell, you can avoid external tools by using the Parameter Expansions:
#!/bin/sh
str="Hello World!"
while [ ${#str} -gt 0 ]; do
next=${str#?}
echo "${str%$next}"
str=$next
done
or
str="Hello World!"
while [ -n "$str" ]; do
next=${str#?}
echo "${str%$next}"
str=$next
done
sed works with unicode
IFS=$'\n'
for z in $(sed 's/./&\n/g' <(printf '你好嗎')); do
echo hello: "$z"
done
outputs
hello: 你
hello: 好
hello: 嗎
Another approach, if you don't care about whitespace being ignored:
for char in $(sed -E s/'(.)'/'\1 '/g <<<"$your_string"); do
# Handle $char here
done
Another way is:
Characters="TESTING"
index=1
while [ $index -le ${#Characters} ]
do
echo ${Characters} | cut -c${index}-${index}
index=$(expr $index + 1)
done
fold and while read are great for the job as shown in some answers here. Contrary to those answers, I think it's much more intuitive to pipe in the order of execution:
echo "asdfg" | fold -w 1 | while read c; do
echo -n "$c "
done
Outputs: a s d f g
I share my solution:
read word
for char in $(grep -o . <<<"$word") ; do
echo $char
done
TEXT="hello world"
for i in {1..${#TEXT}}; do
echo ${TEXT[i]}
done
where {1..N} is an inclusive range
${#TEXT} is a number of letters in a string
${TEXT[i]} - you can get char from string like an item from an array

Splitting /proc/cmdline arguments with spaces

Most scripts that parse /proc/cmdline break it up into words and then filter out arguments with a case statement, example:
CMDLINE="quiet union=aufs wlan=FOO"
for x in $CMDLINE
do
»···case $x in
»···»···wlan=*)
»···»···echo "${x//wlan=}"
»···»···;;
»···esac
done
The problem is when the WLAN ESSID has spaces. Users expect to set wlan='FOO
BAR' (like a shell variable) and then get the unexpected result of 'FOO with the above code, since the for loop splits on spaces.
Is there a better way of parsing the /proc/cmdline from a shell script falling short of almost evaling it?
Or is there some quoting tricks? I was thinking I could perhaps ask users to entity quote spaces and decode like so: /bin/busybox httpd -d "FOO%20BAR". Or is that a bad solution?
There are some ways:
cat /proc/PID/cmdline | tr '\000' ' '
cat /proc/PID/cmdline | xargs -0 echo
These will work with most cases, but will fail when arguments have spaces in them. However I do think that there would be better approaches than using /proc/PID/cmdline.
set -- $(cat /proc/cmdline)
for x in "$#"; do
case "$x" in
wlan=*)
echo "${x#wlan=}"
;;
esac
done
Most commonly, \0ctal escape sequences are used when spaces are unacceptable.
In Bash, printf can be used to unescape them, e.g.
CMDLINE='quiet union=aufs wlan=FOO\040BAR'
for x in $CMDLINE; do
[[ $x = wlan=* ]] || continue
printf '%b\n' "${x#wlan=}"
done
Since you want the shell to parse the /proc/cmdline contents, it's hard to avoid eval'ing it.
#!/bin/bash
eval "kernel_args=( $(cat /proc/cmdline) )"
for arg in "${kernel_args[#]}" ; do
case "${arg}" in
wlan=*)
echo "${arg#wlan=}"
;;
esac
done
This is obviously dangerous though as it would blindly run anything that was specified on the kernel command-line like quiet union=aufs wlan=FOO ) ; touch EVIL ; q=( q.
Escaping spaces (\x20) sounds like the most straightforward and safe way.
A heavy alternative is to use some parser, which understand shell-like syntax.
In this case, you may not even need the shell anymore.
For example, with python:
$ cat /proc/cmdline
quiet union=aufs wlan='FOO BAR' key="val with space" ) ; touch EVIL ; q=( q
$ python -c 'import shlex; print shlex.split(None)' < /proc/cmdline
['quiet', 'union=aufs', 'wlan=FOO BAR', 'key=val with space', ')', ';', 'touch', 'EVIL', ';', 'q=(', 'q']
Use xargs -n1:
[centos#centos7 ~]$ CMDLINE="quiet union=aufs wlan='FOO BAR'"
[centos#centos7 ~]$ echo $CMDLINE
quiet union=aufs wlan='FOO BAR'
[centos#centos7 ~]$ echo $CMDLINE | xargs -n1
quiet
union=aufs
wlan=FOO BAR
[centos#centos7 ~]$ xargs -n1 -a /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.el7.x86_64
root=UUID=3260cdba-e07e-408f-93b3-c4e9ff55ab10
ro
consoleblank=0
crashkernel=auto
rhgb
quiet
LANG=en_US.UTF-8
You could do something like the following using bash, which would turn those arguments in to variables like $cmdline_union and $cmdline_wlan:
bash -c "for i in $(cat /proc/cmdline); do printf \"cmdline_%q\n\" \"\$i\"; done" | grep = > /tmp/cmdline.sh
. /tmp/cmdline.sh
Then you would quote and/or escape things just like you would in a normal shell.
In posh:
$ f() { echo $1 - $3 - $2 - $4
> }
$ a="quiet union=aufs wlan=FOO"
$ f $a
quiet - wlan=FOO - union=aufs -
You can define a function and give your $CMDLINE unquoted as an argument to the function. Then you'll invoke shell's parsing mechanisms. Note, that you should test this on the shell it will be working in -- zsh does some funny things with quoting ;-).
Then you can just tell the user to do quoting like in shell:
#!/bin/posh
CMDLINE="quiet union=aufs wlan=FOO"
f() {
while test x"$1" != x
do
case $1 in
union=*) echo ${1##union=}; shift;;
*) shift;;
esac
done
}
f $CMDLINE
(posh - Policy-compliant Ordinary SHell, a shell stripped of any features beyond standard POSIX)
Found here a nice way to do it with awk, unfortunately it will work only with doublequotes:
# Replace spaces outside double quotes with newlines
args=`cat /proc/cmdline | tr -d '\n' | awk 'BEGIN {RS="\"";ORS="\"" }{if (NR%2==1){gsub(/ /,"\n",$0);print $0} else {print $0}}'`
IFS='
'
for line in $args; do
key=${line%%=*}
value=${line#*=}
value=`echo $value | sed -e 's/^"//' -e 's/"$//'`
printf "%20s = %s\n" "$key" "$value"
done

Resources