Bash: decode string with url escaped hex codes [duplicate] - bash

I'm looking for a way to turn this:
hello < world
to this:
hello < world
I could use sed, but how can this be accomplished without using cryptic regex?

Try recode (archived page; GitHub mirror; Debian page):
$ echo '<' |recode html..ascii
<
Install on Linux and similar Unix-y systems:
$ sudo apt-get install recode
Install on Mac OS using:
$ brew install recode

With perl:
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
With php from the command line:
cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'

An alternative is to pipe through a web browser -- such as:
echo '!' | w3m -dump -T text/html
This worked great for me in cygwin, where downloading and installing distributions are difficult.
This answer was found here

Using xmlstarlet:
echo 'hello < world' | xmlstarlet unesc

A python 3.2+ version:
cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:
sed 's/ / /g; s/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/#'/\'"'"'/g; s/“/\"/g; s/”/\"/g;'
Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu -
Code Version Control between local files and Ask Ubuntu answers
Edit June 26, 2017
Using sed was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.
Here's the function:
LineOut="" # Make global
HTMLtoText () {
LineOut=$1 # Parm 1= Input line
# Replace external command: Line=$(sed 's/&/\&/g; s/</\</g;
# s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g;
# s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
LineOut="${LineOut// / }"
LineOut="${LineOut//&/&}"
LineOut="${LineOut//</<}"
LineOut="${LineOut//>/>}"
LineOut="${LineOut//"/'"'}"
LineOut="${LineOut//'/"'"}"
LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

On macOS, you can use the built-in command textutil (which is a handy utility in general):
echo '👋 hello < world 🌐' | textutil -convert txt -format html -stdin -stdout
outputs:
👋 hello < world 🌐

To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.
But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):
#!/bin/sh
htmlEscDec2Hex() {
file=$1
[ ! -r "$file" ] && file=$(mktemp) && cat >"$file"
printf -- \
"$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
$(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')
[ x"$1" != x"$file" ] && rm -f -- "$file"
}
htmlHexUnescape() {
printf -- "$(
sed 's/\\/\\\\/g;s/%/%%/g
;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000\1;/g
;s/&#x0*\([0-9a-fA-F]\{4\}\);/\\u\1/g
;s/&#x0*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
}
htmlEscDec2Hex "$1" | htmlHexUnescape \
| sed -f named_entities.sed
Note, however, that a printf implementation supporting \uHHHH and \UHHHHHHHH sequences is required, such as the GNU utility’s. To test, check for example that printf "\u00A7\n" prints §. To call the utility instead of the shell built-in, replace the occurrences of printf with env printf.
This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:
<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
const referenceURL = 'https://html.spec.whatwg.org/entities.json';
function writeln(element, text) {
element.appendChild( document.createTextNode(text) );
element.appendChild( document.createElement("br") );
}
(async function(container) {
const json = await (await fetch(referenceURL)).json();
container.innerHTML = "";
writeln(container, "#!/usr/bin/sed -f");
const addLast = [];
for (const name in json) {
const characters = json[name].characters
.replace("\\", "\\\\")
.replace("/", "\\/");
const command = "s/" + name + "/" + characters + "/g";
if ( name.endsWith(";") ) {
writeln(container, command);
} else {
addLast.push(command);
}
}
for (const command of addLast) { writeln(container, command); }
})( document.getElementById("sed-script") );
</script>
</body></html>
Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.
Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.
For example, if for some reason it is needed to process the data by chunks (it might be the case if printf is not a shell built-in and the data to process is large), one could use it as:
nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done
Explanation of the script follows.
There are three types of escape sequences that need to be supported:
&#D; where D is the decimal value of the escaped character’s Unicode code point;
&#xH; where H is the hexadecimal value of the escaped character’s Unicode code point;
&N; where N is the name of one of the named entities for the escaped character.
The &N; escapes are supported by the generated named_entities.sed script which simply performs the list of substitutions.
The central piece of this method for supporting the code point escapes is the printf utility, which is able to:
print numbers in hexadecimal format, and
print characters from their code point’s hexadecimal value (using the escapes \uHHHH or \UHHHHHHHH).
The first feature, with some help from sed and grep, is used to reduce the &#D; escapes into &#xH; escapes. The shell function htmlEscDec2Hex does that.
The function htmlHexUnescape uses sed to transform the &#xH; escapes into printf’s \u/\U escapes, then uses the second feature to print the unescaped characters.

I like the Perl answer given in https://stackoverflow.com/a/13161719/1506477.
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
But, it produced an unequal number of lines on plain text files. (and I dont know perl enough to debug it.)
I like the python answer given in https://stackoverflow.com/a/42672936/1506477 --
python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
but it creates a list [ ... for l in sys.stdin] in memory, that is forbidden for large files.
Here is another easy pythonic way without buffering in memory: using awkg.
$ echo 'hello < : " world' | \
awkg -b 'from html import unescape' 'print(unescape(R0))'
hello < : " world
awkg is a python based awk-like line processor. You may install it using pip https://pypi.org/project/awkg/:
pip install awkg
-b is awk's BEGIN{} block that runs once in the beginning.
Here we just did from html import unescape.
Each line record is in R0 variable, for which we did
print(unescape(R0))
Disclaimer:
I am the maintainer of awkg

I have created a sed script based on the list of entities so it must handle most of the entities.
sed -f htmlentities.sed < file.html

My original answer got some comments, that recode does not work for UTF-8 encoded HTML files. This is correct. recode supports only HTML 4. The encoding HTML is an alias for HTML_4.0:
$ recode -l | grep -iw html
HTML-i18n 2070 RFC2070
HTML_4.0 h h4 HTML
The default encoding for HTML 4 is Latin-1. This has changed in HTML 5. The default encoding for HTML 5 is UTF-8. This is the reason, why recode does not work for HTML 5 files.
HTML 5 defines the list of entities here:
https://html.spec.whatwg.org/multipage/named-characters.html
The definition includes a machine readable specification in JSON format:
https://html.spec.whatwg.org/entities.json
The JSON file can be used to perform a simple text replacement. The following example is a self modifying Perl script, which caches the JSON specification in its DATA chunk.
Note: For some obscure compatibility reasons, the specification allows entities without a terminating semicolon. Because of that the entities are sorted by length in reverse order to make sure, that the correct entities are replaced first so that they do not get destroyed by entities without the ending semicolon.
#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
use LWP::Simple;
use JSON::Parse qw(parse_json);
my $entities;
INIT {
if (eof DATA) {
my $data = tell DATA;
open DATA, '+<', $0;
seek DATA, $data, 0;
my $entities_json = get 'https://html.spec.whatwg.org/entities.json';
print DATA $entities_json;
truncate DATA, tell DATA;
close DATA;
$entities = parse_json ($entities_json);
} else {
local $/ = undef;
$entities = parse_json (<DATA>);
}
}
local $/ = undef;
my $html = <>;
for my $entity (sort { length $b <=> length $a } keys %$entities) {
my $characters = $entities->{$entity}->{characters};
$html =~ s/$entity/$characters/g;
}
print $html;
__DATA__
Example usage:
$ echo '😊 & ٱلْعَرَبِيَّة' | ./html5-to-utf8.pl
😊 & ٱلْعَرَبِيَّة

With Xidel:
echo 'hello < : " world' | xidel -s - -e 'parse-html($raw)'
hello < : " world

Related

Trying to comment a source file

I am trying to comment lines of source so that something like
LANG = 'ENG';
becomes
// LANG = 'ENG';
There are over a thousand lines in the source file and 'ENG' is not unique but the entire line IS.
I gave up on wild carding the spaces and just tried the entire extant line 'as-is' but no joy.
Something like (commented shell)--
enter code here
#!/bin/bash
#if [ -n "$5" ] ; then
#if [ "$5" == "ENG" ] ; then
sed -i "s/' LANG = '\''ENG'\''/\/\/' LANG'
= '\''ENG'\''/" vc.pas > vc.out
#fi
#fi
So it reduces down to a single line. No joy whatever I try.
TIA !
Howie
This works for me, passing any whitespace through to the output:
$ echo "LANG='ENG';" | sed "s#^\(\s*LANG\s*=\s*'ENG'\s*;\)#// \1#"
// LANG='ENG';
$ echo " LANG = 'ENG' ; " | sed "s#^\(\s*LANG\s*=\s*'ENG'\s*;\)#// \1#"
// LANG = 'ENG' ;
Technically it needs double backslashes inside the double-quoted string, but because none of the sequences form a valid escape sequence, bash doesn't mind.
With GNU sed, use
sed -i "s,.*LANG *= *'ENG';.*,//&," vc.pas
where
-i - enables inline file modification
s - substitution command
, - delimiter
.*LANG *= *'ENG';.* - text containing LANG = 'ENG'; with any amount of spaces around =
//& - replaces the matched line with // and the line itself
Use a regular expression for the line selection, followed by a substitution. Put the replacement commands in a file:
$ cat dt.sed
/^LANG[[:space:]]*=[[:space:]]*'[^']*'/s;^;// ;
$
Then run sed(1) with that script:
$ echo "LANG='FOO'" | sed -f dt.sed
// LANG='FOO'
This works on a Fedora 34 system, but should work on any Linux.

Awk script vs shell command - different results?

First thing - sorry for a bit misleading title, not sure how to describe this yet.
Basically, I have a list of keywords and I want to fetch the number of documents google returns per query. I have created the following awk script:
{
x = ""
for(i=1;i<=NF;i++) {
if(i==NF) {
x = x $i
} else {
x = x $i "+"
}
}
tab = "777" # id of an existing chrome tab as reported by 'chrome-cli list tabs'
system("chrome-cli open http://www.google.com/search?hl=en\\&q="x" -t " tab)
system("chrome-cli source -t " tab " | grep '<div id=\"resultStats\">About .* results<nobr>' | head -1 | sed -e 's/.*>About \(.*\) results<nobr>.*/\1/' | awk '{print $1\"\t"x"\"}' >> freq.log " );
system("cat freq.log" );
system("sleep 0.5");
}
What happens here is that I firstly replace all spaces with + signs, execute chrome-cli command to open chrome at that particular window, download source code and parse the number between "About" and "results" strings out and append the result to freq.log. This, however, outputs the following string into the file (for term alarm):
"})();</script><div alarm"
When I execute the same command from iOS terminal, I get a correct number (returns 127.000.000):
chrome-cli source -t 777 | grep '<div id="resultStats">About .* results<nobr>' | head -1 | sed -e 's/.*>About \(.*\) results<nobr>.*/\1/'
So my problem basically is that while everything works correctly from the terminal, as soon as I move my code to awk and execute it using a call to system, something breaks and regex doesn't work anymore.
You've properly escaped " in your system commands, but it looks like you haven't escaped the \ in your sed command. By the time it reaches sed, \( is being seen as a plain (.
Try changing your system statements to print and you'll see what I mean.
Worst case scenario, you can bundle the series of system commands into a shell script and have awk call it instead... but in that case, you might as well entirely use shell scripting instead of awk.

Different MD5 outputs in shell and script

This is driving me crazy, im trying to do some MD5 calculation based on the fritzbox SPEC for logging in. Basically you have to convert a challenge and the password into UTF-16LE and then hash it by md5, then concat challenge-md5(uft-16le(challenge-password))
To do so i'm using iconv and md5 from mac OSX in a script
echo -n "challenge-password1234" | iconv -f ISO8859-1 -t UTF-16LE | md5
Which outputs to 2f42ad272c7aec4c94f0d9525080e6de
Doing the exact thing by just pasting it in the shell outputs to 1722e126192656712a1d352e550f1317
The latter one is correct (accepted by fritzbox) the first one is wrong.
Calling the script with bash script.sh results in the proper hash, calling it with sh script.sh results in the wrong hash, which leads to the new question: How come the output is any different between sh and bash?
Different versions of echo behave in very different ways. Some take command options (like -n) that modify their behavior (including -n suppressing the trailing linefeed), and some don't. Some interpret escape sequences in the string itself (including \c at the end of the string suppressing the trailing linefeed)... and some don't. Some do both. It appears the version of echo (/bin/echo) on your system doesn't take options, and therefore treats -n as a string to be printed. If you're using bash, its builtin version overrides /bin/echo, and does interpret flags.
Basically, echo is a mess of inconsistency and portability traps. So don't use it, use printf instead. It's a little more complicated because you have to specify a format string, then the actual stuff you want printed, but it can save a ton of headaches.
$ printf "%s" "challenge-password1234" | iconv -f ISO8859-1 -t UTF-16LE | md5
1722e126192656712a1d352e550f1317
And by the way, here's what the echo command was actually printing:
$ printf "%s\n" "-n challenge-password1234" | iconv -f ISO8859-1 -t UTF-16LE | md5
2f42ad272c7aec4c94f0d9525080e6de
You are specifically asking for the difference between script and command line result. Please note that there can be other cases in which the script result will not be useable for your Fritzbox.
The sample code for the session handling in the AVM Fritzbox documentation is written in C# see
https://avm.de/fileadmin/user_upload/Global/Service/Schnittstellen/AVM_Technical_Note_-_Session_ID.pdf
from that code I derived https://github.com/WolfgangFahl/fritz-csharp-api
and added some tests:
https://github.com/WolfgangFahl/fritz-csharp-api/blob/master/fritzsimpletest.cs
which where inspired by the tests in:
https://github.com/WolfgangFahl/fritzbox-java-api/blob/master/src/test/java/com/github/kaklakariada/fritzbox/Md5ServiceTest.java
Basically there were 5 examples:
"" -> "d41d8cd98f00b204e9800998ecf8427e"
"secret", "09433e1853385270b51511571e35eeca"
"test", "c8059e2ec7419f590e79d7f1b774bfe6"
"1234567z-äbc", "9e224a41eeefa284df7bb0f26c2913e2";
"!\"§$%&/()=?ßüäöÜÄÖ-.,;:_`´+*#'<>≤|" -> "ad44a7cb10a95cb0c4d7ae90b0ff118a"
and yours is now example number 6:
"challenge-password1234" -> "1722e126192656712a1d352e550f1317"
and these behave the same in the Java and C# implementation. Now trying out these with the bash script below which has
echo -n "$l_s" | iconv --from-code ISO8859-1 --to-code UTF-16LE | md5sum -b | gawk '{print substr($0,1,32)}'
which is e.g. discussed in https://www.ip-phone-forum.de/threads/fritzbox-challenge-response-in-sh.264639/
as it's getmd5 function gives different results for the Umlaut cases e.g. in my bash on Mac OS Sierra.
There is already some debug output added.
The encoding given for 1234567z-äbc has the byte sequence 2d c3 a4 62 63 while e.g. the java implementation has 2d e4 62 63.
So beware of umlauts in your password - the fritzbox access might fail using this script solution. I am looking for a workaround and will post it here when i find it.
bash script
#!/bin/bash
# WF 2017-10-30
# Fritzbox handling
#
# get the property with the given name
# params
# 1: the property name e.g. fritzbox.url, fritzbox.username, fritzbox.password
#
getprop() {
local l_prop="$1"
cat $HOME/.fritzbox/application.properties | grep "$l_prop" | cut -f2 -d=
}
#
# get a value from the fritzbox login_sid.lua
#
getboxval() {
local l_node="$1"
local l_response="$2"
if [ "$l_response" != "" ]
then
l_data="&response=$l_response"
fi
fxml=/tmp/fxml$$
curl --insecure -s "${box_url}/login_sid.lua?username=${username}$l_response" > $fxml
cat $fxml |
gawk -v node=$l_node 'match($0,"<"node">([0-9a-f]+)</"node">",m) { print m[1] }'
cat $fxml
rm $fxml
}
#
# get the md5 for the given string
#
# see https://avm.de/fileadmin/user_upload/Global/Service/Schnittstellen/AVM_Technical_Note_-_Session_ID.pdf
#
# param
# 1: s - the string
#
# return
# md5
#
getmd5() {
local l_s="$1"
echo -n "$l_s" | iconv -f ISO8859-1 -t UTF-16LE | od -x
echo -n "$l_s" | iconv --from-code ISO8859-1 --to-code UTF-16LE | md5sum -b | gawk '{print substr($0,1,32)}'
}
# get global settings from application properties
box_url=$(getprop fritzbox.url)
username=$(getprop fritzbox.username)
password=$(getprop fritzbox.password)
# uncomment to test
getmd5 ""
# should be d41d8cd98f00b204e9800998ecf8427e
getmd5 secret
# should be 09433e1853385270b51511571e35eeca
getmd5 test
# should be c8059e2ec7419f590e79d7f1b774bfe6
getmd5 1234567z-äbc
# should be 9e224a41eeefa284df7bb0f26c2913e2
getmd5 "!\"§$%&/()=?ßüäöÜÄÖ-.,;:_\`´+*#'<>≤|"
# should be ad44a7cb10a95cb0c4d7ae90b0ff118a
exit
# Login and get SID
challenge=$(getboxval Challenge "")
echo "challenge=$challenge"
md5=$(getmd5 "${challenge}-${password}")
echo "md5=$md5"
response="${challenge}-${md5}"
echo "response=$response"
getboxval SID "$response"

Storing files inside BASH scripts

Is there a way to store binary data inside a BASH script so that it can be piped to a program later in that script?
At the moment (on Mac OS X) I'm doing
play sound.m4a
# do stuff
I'd like to be able to do something like:
SOUND <<< the m4a data, encoded somehow?
END
echo $SOUND | play
#do stuff
Is there a way to do this?
Base64 encode it. For example:
$ openssl base64 < sound.m4a
and then in the script:
S=<<SOUND
YOURBASE64GOESHERE
SOUND
echo $S | openssl base64 -d | play
I know this is like riding a dead horse since this post is rather old, but I'd like to improve Sionide21 answer as his solution stores the binary data in a variable which is not necessary.
openssl base64 -d <<SOUND | play
YOURBASE64DATAHERE
SOUND
Note: HereDoc Syntax requires that you don't indent the last 'SOUND'
and base64 decoding sometimes failed on me when i indented that
'YOURBASE64DATAHERE' section. So it's best practice to keep the Base64
Data as well the end-token unindented.
I've found this looking for a more elegant way to store binary data in shell scripts, but i had already solved it like described here. Only difference is I'm transporting some tar-bzipped files this way. My platform knows a separate base64 binary so I don't have to use openssl.
base64 -d <<EOF | tar xj
BASE64ENCODEDTBZ
EOF
There is a Unix format called shar (shell archive) that allows you to store binary data in a shell script. You create a shar file using the shar command.
When I've done this I've used a shell here document piped through atob.
function emit_binary {
cat << 'EOF' | atob
--junk emitted by btoa here
EOF
}
the single quotes around 'EOF' prevent parameter expansion in the body of the here document.
atob and btoa are very old programs, and for some reason they are often absent from modern Unix distributions. A somewhat less efficient but more ubiquitous alternative is to use mimencode -b instead of btoa. mimencode will encode into base64 ASCII. The corresponding decoding command is mimencode -b -u instead of atob. The openssl command will also do base64 encoding.
Here's some code I wrote a long time ago that packs a choice executable into a bash script. I can't remember exactly how it works, but I suspect you could pretty easily modify it to do what you want.
#!/usr/bin/perl
use strict;
print "Stub Creator 1.0\n";
unless($#ARGV == 1)
{
print "Invalid argument count, usage: ./makestub.pl InputExecutable OutputCompressedExecutable\n";
exit;
}
unless(-r $ARGV[0])
{
die "Unable to read input file $ARGV[0]: $!\n";
}
my $OUTFILE;
open(OUTFILE, ">$ARGV[1]") or die "Unable to create $ARGV[1]: $!\n";
print "\nCreating stub script...";
print OUTFILE "#!/bin/bash\n";
print OUTFILE "a=/tmp/\`date +%s%N\`;tail -n+3 \$0 | zcat > \$a;chmod 700 \$a;\$a \${*};rm -f \$a;exit;\n";
close(OUTFILE);
print "done.\nCompressing input executable and appending...";
`gzip $ARGV[0] -n --best -c >> $ARGV[1]`;
`chmod +x $ARGV[1]`;
my $OrigSize;
$OrigSize = -s $ARGV[0];
my $NewSize;
$NewSize = -s $ARGV[1];
my $Temp;
if($OrigSize == 0)
{
$NewSize = 1;
}
$Temp = ($NewSize / $OrigSize) * 100;
$Temp *= 1000;
$Temp = int($Temp);
$Temp /= 1000;
print "done.\nStub successfully composed!\n\n";
print <<THEEND;
Original size: $OrigSize
New size: $NewSize
Compression: $Temp\%
THEEND
If it's a single block of data to use, the trick I've used is to put a "start of data" marker at the end of the file, then use sed in the script to filter out the leading stuff. For example, create the following as "play-sound.bash":
#!/bin/bash
sed '1,/^START OF DATA/d' $0 | play
exit 0
START OF DATA
Then, you can just append your data to the end of this file:
cat sound.m4a >> play-sound.bash
and now, executing the script should play the sound directly.
Since Python is available on OS X by default, you can do as below:
ENCODED=$(python -m base64 foo.m4a)
Then decode it as below:
echo $ENCODED | python -m base64 -d | play

How to urlencode data for curl command?

I am trying to write a bash script for testing that takes a parameter and sends it through curl to web site. I need to url encode the value to make sure that special characters are processed properly. What is the best way to do this?
Here is my basic script so far:
#!/bin/bash
host=${1:?'bad host'}
value=$2
shift
shift
curl -v -d "param=${value}" http://${host}/somepath $#
Use curl --data-urlencode; from man curl:
This posts data, similar to the other --data options with the exception that this performs URL-encoding. To be CGI-compliant, the <data> part should begin with a name followed by a separator and a content specification.
Example usage:
curl \
--data-urlencode "paramName=value" \
--data-urlencode "secondParam=value" \
http://example.com
See the man page for more info.
This requires curl 7.18.0 or newer (released January 2008). Use curl -V to check which version you have.
You can as well encode the query string:
curl --get \
--data-urlencode "p1=value 1" \
--data-urlencode "p2=value 2" \
http://example.com
# http://example.com?p1=value%201&p2=value%202
Another option is to use jq:
$ printf %s 'input text'|jq -sRr #uri
input%20text
$ jq -rn --arg x 'input text' '$x|#uri'
input%20text
-r (--raw-output) outputs the raw contents of strings instead of JSON string literals. -n (--null-input) doesn't read input from STDIN.
-R (--raw-input) treats input lines as strings instead of parsing them as JSON, and -sR (--slurp --raw-input) reads the input into a single string. You can replace -sRr with -Rr if your input only contains a single line or if you don't want to replace linefeeds with %0A:
$ printf %s\\n multiple\ lines of\ text|jq -Rr #uri
multiple%20lines
of%20text
$ printf %s\\n multiple\ lines of\ text|jq -sRr #uri
multiple%20lines%0Aof%20text%0A
Or this percent-encodes all bytes:
xxd -p|tr -d \\n|sed 's/../%&/g'
Here is the pure BASH answer.
Update: Since many changes have been discussed, I have placed this on https://github.com/sfinktah/bash/blob/master/rawurlencode.inc.sh for anybody to issue a PR against.
Note: This solution is not intended to encode unicode or multi-byte characters - which are quite outside BASH's humble native capabilities. It's only intended to encode symbols that would otherwise ruin argument passing in POST or GET requests, e.g. '&', '=' and so forth.
Very Important Note: DO NOT ATTEMPT TO WRITE YOUR OWN UNICODE CONVERSION FUNCTION, IN ANY LANGUAGE, EVER. See end of answer.
rawurlencode() {
local string="${1}"
local strlen=${#string}
local encoded=""
local pos c o
for (( pos=0 ; pos<strlen ; pos++ )); do
c=${string:$pos:1}
case "$c" in
[-_.~a-zA-Z0-9] ) o="${c}" ;;
* ) printf -v o '%%%02x' "'$c"
esac
encoded+="${o}"
done
echo "${encoded}" # You can either set a return variable (FASTER)
REPLY="${encoded}" #+or echo the result (EASIER)... or both... :p
}
You can use it in two ways:
easier: echo http://url/q?=$( rawurlencode "$args" )
faster: rawurlencode "$args"; echo http://url/q?${REPLY}
[edited]
Here's the matching rawurldecode() function, which - with all modesty - is awesome.
# Returns a string in which the sequences with percent (%) signs followed by
# two hex digits have been replaced with literal characters.
rawurldecode() {
# This is perhaps a risky gambit, but since all escape characters must be
# encoded, we can replace %NN with \xNN and pass the lot to printf -b, which
# will decode hex for us
printf -v REPLY '%b' "${1//%/\\x}" # You can either set a return variable (FASTER)
echo "${REPLY}" #+or echo the result (EASIER)... or both... :p
}
With the matching set, we can now perform some simple tests:
$ diff rawurlencode.inc.sh \
<( rawurldecode "$( rawurlencode "$( cat rawurlencode.inc.sh )" )" ) \
&& echo Matched
Output: Matched
And if you really really feel that you need an external tool (well, it will go a lot faster, and might do binary files and such...) I found this on my OpenWRT router...
replace_value=$(echo $replace_value | sed -f /usr/lib/ddns/url_escape.sed)
Where url_escape.sed was a file that contained these rules:
# sed url escaping
s:%:%25:g
s: :%20:g
s:<:%3C:g
s:>:%3E:g
s:#:%23:g
s:{:%7B:g
s:}:%7D:g
s:|:%7C:g
s:\\:%5C:g
s:\^:%5E:g
s:~:%7E:g
s:\[:%5B:g
s:\]:%5D:g
s:`:%60:g
s:;:%3B:g
s:/:%2F:g
s:?:%3F:g
s^:^%3A^g
s:#:%40:g
s:=:%3D:g
s:&:%26:g
s:\$:%24:g
s:\!:%21:g
s:\*:%2A:g
While it is not impossible to write such a script in BASH (probably using xxd and a very lengthy ruleset) capable of handing UTF-8 input, there are faster and more reliable ways. Attempting to decode UTF-8 into UTF-32 is a non-trivial task to do with accuracy, though very easy to do inaccurately such that you think it works until the day it doesn't.
Even the Unicode Consortium removed their sample code after discovering it was no longer 100% compatible with the actual standard.
The Unicode standard is constantly evolving, and has become extremely nuanced. Any implementation you can whip together will not be properly compliant, and if by some extreme effort you managed it, it wouldn't stay compliant.
Use Perl's URI::Escape module and uri_escape function in the second line of your bash script:
...
value="$(perl -MURI::Escape -e 'print uri_escape($ARGV[0]);' "$2")"
...
Edit: Fix quoting problems, as suggested by Chris Johnsen in the comments. Thanks!
One of variants, may be ugly, but simple:
urlencode() {
local data
if [[ $# != 1 ]]; then
echo "Usage: $0 string-to-urlencode"
return 1
fi
data="$(curl -s -o /dev/null -w %{url_effective} --get --data-urlencode "$1" "")"
if [[ $? != 3 ]]; then
echo "Unexpected error" 1>&2
return 2
fi
echo "${data##/?}"
return 0
}
Here is the one-liner version for example (as suggested by Bruno):
date | curl -Gso /dev/null -w %{url_effective} --data-urlencode #- "" | cut -c 3-
# If you experience the trailing %0A, use
date | curl -Gso /dev/null -w %{url_effective} --data-urlencode #- "" | sed -E 's/..(.*).../\1/'
for the sake of completeness, many solutions using sed or awk only translate a special set of characters and are hence quite large by code size and also dont translate other special characters that should be encoded.
a safe way to urlencode would be to just encode every single byte - even those that would've been allowed.
echo -ne 'some random\nbytes' | xxd -plain | tr -d '\n' | sed 's/\(..\)/%\1/g'
xxd is taking care here that the input is handled as bytes and not characters.
edit:
xxd comes with the vim-common package in Debian and I was just on a system where it was not installed and I didnt want to install it. The altornative is to use hexdump from the bsdmainutils package in Debian. According to the following graph, bsdmainutils and vim-common should have an about equal likelihood to be installed:
http://qa.debian.org/popcon-png.php?packages=vim-common%2Cbsdmainutils&show_installed=1&want_legend=1&want_ticks=1
but nevertheless here a version which uses hexdump instead of xxd and allows to avoid the tr call:
echo -ne 'some random\nbytes' | hexdump -v -e '/1 "%02x"' | sed 's/\(..\)/%\1/g'
I find it more readable in python:
encoded_value=$(python3 -c "import urllib.parse; print urllib.parse.quote('''$value''')")
the triple ' ensures that single quotes in value won't hurt. urllib is in the standard library. It work for example for this crazy (real world) url:
"http://www.rai.it/dl/audio/" "1264165523944Ho servito il re d'Inghilterra - Puntata 7
I've found the following snippet useful to stick it into a chain of program calls, where URI::Escape might not be installed:
perl -p -e 's/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg'
(source)
If you wish to run GET request and use pure curl just add --get to #Jacob's solution.
Here is an example:
curl -v --get --data-urlencode "access_token=$(cat .fb_access_token)" https://graph.facebook.com/me/feed
This may be the best one:
after=$(echo -e "$before" | od -An -tx1 | tr ' ' % | xargs printf "%s")
Direct link to awk version : http://www.shelldorado.com/scripts/cmds/urlencode
I used it for years and it works like a charm
:
##########################################################################
# Title : urlencode - encode URL data
# Author : Heiner Steven (heiner.steven#odn.de)
# Date : 2000-03-15
# Requires : awk
# Categories : File Conversion, WWW, CGI
# SCCS-Id. : #(#) urlencode 1.4 06/10/29
##########################################################################
# Description
# Encode data according to
# RFC 1738: "Uniform Resource Locators (URL)" and
# RFC 1866: "Hypertext Markup Language - 2.0" (HTML)
#
# This encoding is used i.e. for the MIME type
# "application/x-www-form-urlencoded"
#
# Notes
# o The default behaviour is not to encode the line endings. This
# may not be what was intended, because the result will be
# multiple lines of output (which cannot be used in an URL or a
# HTTP "POST" request). If the desired output should be one
# line, use the "-l" option.
#
# o The "-l" option assumes, that the end-of-line is denoted by
# the character LF (ASCII 10). This is not true for Windows or
# Mac systems, where the end of a line is denoted by the two
# characters CR LF (ASCII 13 10).
# We use this for symmetry; data processed in the following way:
# cat | urlencode -l | urldecode -l
# should (and will) result in the original data
#
# o Large lines (or binary files) will break many AWK
# implementations. If you get the message
# awk: record `...' too long
# record number xxx
# consider using GNU AWK (gawk).
#
# o urlencode will always terminate it's output with an EOL
# character
#
# Thanks to Stefan Brozinski for pointing out a bug related to non-standard
# locales.
#
# See also
# urldecode
##########################################################################
PN=`basename "$0"` # Program name
VER='1.4'
: ${AWK=awk}
Usage () {
echo >&2 "$PN - encode URL data, $VER
usage: $PN [-l] [file ...]
-l: encode line endings (result will be one line of output)
The default is to encode each input line on its own."
exit 1
}
Msg () {
for MsgLine
do echo "$PN: $MsgLine" >&2
done
}
Fatal () { Msg "$#"; exit 1; }
set -- `getopt hl "$#" 2>/dev/null` || Usage
[ $# -lt 1 ] && Usage # "getopt" detected an error
EncodeEOL=no
while [ $# -gt 0 ]
do
case "$1" in
-l) EncodeEOL=yes;;
--) shift; break;;
-h) Usage;;
-*) Usage;;
*) break;; # First file name
esac
shift
done
LANG=C export LANG
$AWK '
BEGIN {
# We assume an awk implementation that is just plain dumb.
# We will convert an character to its ASCII value with the
# table ord[], and produce two-digit hexadecimal output
# without the printf("%02X") feature.
EOL = "%0A" # "end of line" string (encoded)
split ("1 2 3 4 5 6 7 8 9 A B C D E F", hextab, " ")
hextab [0] = 0
for ( i=1; i<=255; ++i ) ord [ sprintf ("%c", i) "" ] = i + 0
if ("'"$EncodeEOL"'" == "yes") EncodeEOL = 1; else EncodeEOL = 0
}
{
encoded = ""
for ( i=1; i<=length ($0); ++i ) {
c = substr ($0, i, 1)
if ( c ~ /[a-zA-Z0-9.-]/ ) {
encoded = encoded c # safe character
} else if ( c == " " ) {
encoded = encoded "+" # special handling
} else {
# unsafe character, encode it as a two-digit hex-number
lo = ord [c] % 16
hi = int (ord [c] / 16);
encoded = encoded "%" hextab [hi] hextab [lo]
}
}
if ( EncodeEOL ) {
printf ("%s", encoded EOL)
} else {
print encoded
}
}
END {
#if ( EncodeEOL ) print ""
}
' "$#"
Here's a Bash solution which doesn't invoke any external programs:
uriencode() {
s="${1//'%'/%25}"
s="${s//' '/%20}"
s="${s//'"'/%22}"
s="${s//'#'/%23}"
s="${s//'$'/%24}"
s="${s//'&'/%26}"
s="${s//'+'/%2B}"
s="${s//','/%2C}"
s="${s//'/'/%2F}"
s="${s//':'/%3A}"
s="${s//';'/%3B}"
s="${s//'='/%3D}"
s="${s//'?'/%3F}"
s="${s//'#'/%40}"
s="${s//'['/%5B}"
s="${s//']'/%5D}"
printf %s "$s"
}
url=$(echo "$1" | sed -e 's/%/%25/g' -e 's/ /%20/g' -e 's/!/%21/g' -e 's/"/%22/g' -e 's/#/%23/g' -e 's/\$/%24/g' -e 's/\&/%26/g' -e 's/'\''/%27/g' -e 's/(/%28/g' -e 's/)/%29/g' -e 's/\*/%2a/g' -e 's/+/%2b/g' -e 's/,/%2c/g' -e 's/-/%2d/g' -e 's/\./%2e/g' -e 's/\//%2f/g' -e 's/:/%3a/g' -e 's/;/%3b/g' -e 's//%3e/g' -e 's/?/%3f/g' -e 's/#/%40/g' -e 's/\[/%5b/g' -e 's/\\/%5c/g' -e 's/\]/%5d/g' -e 's/\^/%5e/g' -e 's/_/%5f/g' -e 's/`/%60/g' -e 's/{/%7b/g' -e 's/|/%7c/g' -e 's/}/%7d/g' -e 's/~/%7e/g')
this will encode the string inside of $1 and output it in $url. although you don't have to put it in a var if you want. BTW didn't include the sed for tab thought it would turn it into spaces
Using php from a shell script:
value="http://www.google.com"
encoded=$(php -r "echo rawurlencode('$value');")
# encoded = "http%3A%2F%2Fwww.google.com"
echo $(php -r "echo rawurldecode('$encoded');")
# returns: "http://www.google.com"
http://www.php.net/manual/en/function.rawurlencode.php
http://www.php.net/manual/en/function.rawurldecode.php
If you don't want to depend on Perl you can also use sed. It's a bit messy, as each character has to be escaped individually. Make a file with the following contents and call it urlencode.sed
s/%/%25/g
s/ /%20/g
s/ /%09/g
s/!/%21/g
s/"/%22/g
s/#/%23/g
s/\$/%24/g
s/\&/%26/g
s/'\''/%27/g
s/(/%28/g
s/)/%29/g
s/\*/%2a/g
s/+/%2b/g
s/,/%2c/g
s/-/%2d/g
s/\./%2e/g
s/\//%2f/g
s/:/%3a/g
s/;/%3b/g
s//%3e/g
s/?/%3f/g
s/#/%40/g
s/\[/%5b/g
s/\\/%5c/g
s/\]/%5d/g
s/\^/%5e/g
s/_/%5f/g
s/`/%60/g
s/{/%7b/g
s/|/%7c/g
s/}/%7d/g
s/~/%7e/g
s/ /%09/g
To use it do the following.
STR1=$(echo "https://www.example.com/change&$ ^this to?%checkthe#-functionality" | cut -d\? -f1)
STR2=$(echo "https://www.example.com/change&$ ^this to?%checkthe#-functionality" | cut -d\? -f2)
OUT2=$(echo "$STR2" | sed -f urlencode.sed)
echo "$STR1?$OUT2"
This will split the string into a part that needs encoding, and the part that is fine, encode the part that needs it, then stitches back together.
You can put that into a sh script for convenience, maybe have it take a parameter to encode, put it on your path and then you can just call:
urlencode https://www.exxample.com?isThisFun=HellNo
source
You can emulate javascript's encodeURIComponent in perl. Here's the command:
perl -pe 's/([^a-zA-Z0-9_.!~*()'\''-])/sprintf("%%%02X", ord($1))/ge'
You could set this as a bash alias in .bash_profile:
alias encodeURIComponent='perl -pe '\''s/([^a-zA-Z0-9_.!~*()'\''\'\'''\''-])/sprintf("%%%02X",ord($1))/ge'\'
Now you can pipe into encodeURIComponent:
$ echo -n 'hèllo wôrld!' | encodeURIComponent
h%C3%A8llo%20w%C3%B4rld!
Python 3 based on #sandro's good answer from 2010:
echo "Test & /me" | python -c "import urllib.parse;print (urllib.parse.quote(input()))"
Test%20%26%20/me
This nodejs-based answer will use encodeURIComponent on stdin:
uriencode_stdin() {
node -p 'encodeURIComponent(require("fs").readFileSync(0))'
}
echo -n $'hello\nwörld' | uriencode_stdin
hello%0Aw%C3%B6rld
For those of you looking for a solution that doesn't need perl, here is one that only needs hexdump and awk:
url_encode() {
[ $# -lt 1 ] && { return; }
encodedurl="$1";
# make sure hexdump exists, if not, just give back the url
[ ! -x "/usr/bin/hexdump" ] && { return; }
encodedurl=`
echo $encodedurl | hexdump -v -e '1/1 "%02x\t"' -e '1/1 "%_c\n"' |
LANG=C awk '
$1 == "20" { printf("%s", "+"); next } # space becomes plus
$1 ~ /0[adAD]/ { next } # strip newlines
$2 ~ /^[a-zA-Z0-9.*()\/-]$/ { printf("%s", $2); next } # pass through what we can
{ printf("%%%s", $1) } # take hex value of everything else
'`
}
Stitched together from a couple of places across the net and some local trial and error. It works great!
uni2ascii is very handy:
$ echo -ne '你好世界' | uni2ascii -aJ
%E4%BD%A0%E5%A5%BD%E4%B8%96%E7%95%8C
Simple PHP option:
echo 'part-that-needs-encoding' | php -R 'echo urlencode($argn);'
What would parse URLs better than javascript?
node -p "encodeURIComponent('$url')"
Here is a POSIX function to do that:
url_encode() {
awk 'BEGIN {
for (n = 0; n < 125; n++) {
m[sprintf("%c", n)] = n
}
n = 1
while (1) {
s = substr(ARGV[1], n, 1)
if (s == "") {
break
}
t = s ~ /[[:alnum:]_.!~*\47()-]/ ? t s : t sprintf("%%%02X", m[s])
n++
}
print t
}' "$1"
}
Example:
value=$(url_encode "$2")
The question is about doing this in bash and there's no need for python or perl as there is in fact a single command that does exactly what you want - "urlencode".
value=$(urlencode "${2}")
This is also much better, as the above perl answer, for example, doesn't encode all characters correctly. Try it with the long dash you get from Word and you get the wrong encoding.
Note, you need "gridsite-clients" installed to provide this command:
sudo apt install gridsite-clients
Here's the node version:
uriencode() {
node -p "encodeURIComponent('${1//\'/\\\'}')"
}
Another php approach:
echo "encode me" | php -r "echo urlencode(file_get_contents('php://stdin'));"
Here is my version for busybox ash shell for an embedded system, I originally adopted Orwellophile's variant:
urlencode()
{
local S="${1}"
local encoded=""
local ch
local o
for i in $(seq 0 $((${#S} - 1)) )
do
ch=${S:$i:1}
case "${ch}" in
[-_.~a-zA-Z0-9])
o="${ch}"
;;
*)
o=$(printf '%%%02x' "'$ch")
;;
esac
encoded="${encoded}${o}"
done
echo ${encoded}
}
urldecode()
{
# urldecode <string>
local url_encoded="${1//+/ }"
printf '%b' "${url_encoded//%/\\x}"
}
Ruby, for completeness
value="$(ruby -r cgi -e 'puts CGI.escape(ARGV[0])' "$2")"
Here's a one-line conversion using Lua, similar to blueyed's answer except with all the RFC 3986 Unreserved Characters left unencoded (like this answer):
url=$(echo 'print((arg[1]:gsub("([^%w%-%.%_%~])",function(c)return("%%%02X"):format(c:byte())end)))' | lua - "$1")
Additionally, you may need to ensure that newlines in your string are converted from LF to CRLF, in which case you can insert a gsub("\r?\n", "\r\n") in the chain before the percent-encoding.
Here's a variant that, in the non-standard style of application/x-www-form-urlencoded, does that newline normalization, as well as encoding spaces as '+' instead of '%20' (which could probably be added to the Perl snippet using a similar technique).
url=$(echo 'print((arg[1]:gsub("\r?\n", "\r\n"):gsub("([^%w%-%.%_%~ ]))",function(c)return("%%%02X"):format(c:byte())end):gsub(" ","+"))' | lua - "$1")
In this case, I needed to URL encode the hostname. Don't ask why. Being a minimalist, and a Perl fan, here's what I came up with.
url_encode()
{
echo -n "$1" | perl -pe 's/[^a-zA-Z0-9\/_.~-]/sprintf "%%%02x", ord($&)/ge'
}
Works perfectly for me.

Resources