Replacing HTML ascii codes via a bash script? - bash

I need a way to replace HTML ASCII codes like ! with their correct character in bash.
Is there a utility I could run my output through to do this, or something along those lines?

$ echo '!' | recode html/..
!
$ echo '<∞>' | recode html/..
<∞>

I don't know of an easy way, here is what I suppose I would do...
You might be able to script a browser into reading the file in and then saving it as text. If lynx supports html character entities then it might be worth looking in to. If that doesn't work out...
The general solution to something like this is done with sed. You need a "higher order" edit for this, as you would first start with an entity table and then you would edit that table into an edit script itself with a multiple-step procedure. Something like:
. . .
s/&Dagger;/‡/g<br />
s/&#8221;/”/g<br />
. . .
Then, encapsulate this as html, read it in to a browser, and save it as text in the character set you are targeting. If you get it to produce lines like:
s/</</g
then you win. A bash script that calls sed or ex can be driven by the substitute commands in the file.

Here is my solution with the standard Linux toolbox.
$ foo="This is a line feed
And e acute:é with a grinning face 😀."
$ echo "$foo"
This is a line feed
And e acute:é with a grinning face 😀.
$ eval "$(printf '%s' "$foo" | sed 's/^/printf "/;s/&#0*\([0-9]*\);/\$( [ \1 -lt 128 ] \&\& printf "\\\\$( printf \"%.3o\\201\" \1)" || \$(which printf) \\\\U\$( printf \"%.8x\" \1) )/g;s/$/\\n"/')" | sed "s/$(printf '\201')//g"
This is a line feed
And e acute:é with a grinning face 😀.
You see that it works for all kinds of escapes, even Line Feed, e acute (é) which is a 2 byte UTF-8 and even the new emoticons which are in the extended plane (4 bytes unicode).
This command works ALSO with dash which is a trimmed down shell (default shell on Ubuntu) and is also compatible with bash and shells like ash used by the Synology.
If you don't mind sticking with bash and dropping the compatibility, you can make is much simpler.
Bits used should be in any decent Linux box (or OS X?)
- which
- printf (GNU and builtin)
- GNU sed
- eval (shell builtin)
The bash only version don't need which nor the GNU printf.

Related

How do I ignore a byte order marker from a while read loop in zsh

I need to verify that all images mentioned in a csv are present inside a folder. I wrote a small shell script for that
#!/bin/zsh
red='\033[0;31m'
color_Off='\033[0m'
csvfile=$1
imgpath=$2
cat $csvfile | while IFS=, read -r filename rurl
do
if [ -f "${imgpath}/${filename}" ]
then
echo -n
else
echo -e "$filename ${red}MISSING${color_Off}"
fi
done
My CSV looks something like
Image1.jpg,detail-1
Image2.jpg,detail-1
Image3.jpg,detail-1
The csv was created by excel.
Now all 3 images are present in imgpath but for some reason my output says
Image1.jpg MISSING
Upon using zsh -x to run the script i found that my CSV file has a BOM at the very beginning making the image name as \ufeffImage1.jpg which is causing the whole issue.
How can I ignore a BOM(byte-order marker) in a while read operation?
zsh provides a parameter expansion (also available in POSIX shells) to remove a prefix: ${var#prefix} will expand to $var with prefix removed from the front of the string.
zsh also, like ksh93 and bash, supports ANSI C-like string syntax: $'\ufeff' refers to the Unicode sequence for a BOM.
Combining these, one can refer to ${filename#$'\ufeff'} to refer to the content of $filename but with the Unicode sequence for a BOM removed if it's present at the front.
The below also makes some changes for better performance, more reliable behavior with odd filenames, and compatibility with non-zsh shells.
#!/bin/zsh
red='\033[0;31m'
color_Off='\033[0m'
csvfile=$1
imgpath=$2
while IFS=, read -r filename rurl; do
filename=${filename#$'\ufeff'}
if ! [ -f "${imgpath}/${filename}" ]; then
printf '%s %bMISSING%b\n' "$filename" "$red" "$color_Off"
fi
done <"$csvfile"
Notes on changes unrelated to the specific fix:
Replacing echo -e with printf lets us pick which specific variables get escape sequences expanded: %s for filenames means backslashes and other escapes in them are unmodified, whereas %b for $red and $color_Off ensures that we do process highlighting for them.
Replacing cat $csvfile | with < "$csvfile" avoids the overhead of starting up a separate cat process, and ensures that your while read loop is run in the same shell as the rest of your script rather than a subshell (which may or may not be an issue for zsh, but is a problem with bash when run without the non-default lastpipe flag).
echo -n isn't reliable as a noop: some shells print -n as output, and the POSIX echo standard, by marking behavior when -n is present as undefined, permits this. If you need a noop, : or true is a better choice; but in this case we can just invert the test and move the else path into the truth path.

insert text allocated in a variable before the first empty line

I have some text files $f resembling the following
function
%blah
%blah
%blah
code here
I want to append the following text before the first empty line:
%
%This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike
%3.0 Unported License. See notes at the end of this file for more information.
I tried the following:
top=$(cat ./PATH/text.txt)
top="${top//$'\n'/\\n}"
sed -i.bak 's#^$#'"$top"'\\n#' $f
where the second line (I think) preserves the new line in the text and the third line (I think) substitutes the first empty line with the text plus a new empty line.
Two problems:
1- My code appends the following text:
%n%This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike n%3.0 Unported License. See notes
at the end of this file for more information.\n
2- It appends it at end of the file.
Can someone please help me understand the problems with my code?
If you are using GNU sed, following would work.
Use ^$ to find the empty line and then use sed to replace/put the text that you want.
# Define your replacement text in a variable
a="%\n%This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike\n%3.0 Unported License. See notes at the end of this file for more information."
Note, $a should include those \n that will be directly interpreted by sed as newlines.
$ sed "0,/^$/s//$a/" inputfile.txt
In the above syntax, 0 represents the first occurrence.
Output:
function
%blah
%blah
%
%This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike
%3.0 Unported License. See notes at the end of this file for more information.
%blah
code here
test
You've included bash and sed tags in your question. Since I can't seem to come up with a way of doing this in sed, here's a bash-only solution. It's likely to perform the worst of all working solutions you might find.
The following works with your sample input:
$ while read -r x; do [[ -z "$x" ]] && cat boilerplate; printf '%s\n' "$x"; done < src
This will however insert the boilerplate before EVERY blank line, which is probably not what you're after. Instead, we should probably make this more than a one-liner:
#!/usr/bin/env bash
y=true
while read -r x; do
if [[ -z "$x" ]] && $y; then
cat boilerplate
y=false
fi
printf '%s\n' "$x"
done < src
Note that unlike the code in your question, this doesn't store your boilerplate in a variable, it just cats it "at the right time".
Note that this sends the combined output to stdout. If your goal is to modify the original file, you'll need to wrap this in something that moves around temporary files. (Note that sed's -i option also doesn't really edit files in place, it only hides the moving-around-temp-files from you.)
The following alternatives are probably a better idea.
A similar solution to the bash one might be achieved with better performance using awk:
awk 'NR==FNR{b=b $0 ORS;next} /^$/&&!y{printf "%s",b;y++} 1' boilerplate src
This awk solution obviously reads your boilerplate into a variable, though it's not a shell variable.
Notwithstanding non-standard platform-specific extensions, awk does not have any facility for editing files "in place" either. A portable solution using awk would still need to push temp files around.
And of course, the following old standard of ed is great to keep in your back pocket:
printf 'H\n/^$/\n-\n.r boilerplate\nw\nq\n' | ed src
In bash, of course, you could always use heretext, which might be clearer:
$ ed src <<< $'H\n/^$/\n-\n.r boilerplate\nw\nq\n'
The ed command is non-stream version of sed. Or rather, sed is the stream version of ed, which has been around since before the dinosaurs and is still going strong.
The commands we're using are separated by newlines and fed to ed's standard input. You can discard stdout if you feel the urge. The commands shown here are:
H - instruct ed to print more useful errors, if it gets any.
/^$/ - search for the first occurrence of a newline.
- - GO BACK ONE LINE. Awesome, right?
.r boilerplate - Read your boilerplate at the current line,
w - and write the file.
q - Quit.
Note that this does not keep a .bak file. You'll need to do that yourself if you really want one.
And if, as you suggested in comments, the filename you're reading is to be constructed from a variable, note that variable expansion does not happen inside format quoting ($' .. '). You can either switch quoting mechanisms mid-script:
ed "$file" <<< $'H\n/^$/\n-\n.r ./TATTOO_'"$currn"$'/top.txt\nw\nq\n'
Or you could put ed script in a variable constructed by printf
printf -v scr 'H\n/^$/\n-\n.r ./TATTOO_%s/top.txt\nw\nq\n' "$currn"
ed "$file" <<< "$scr"`
Adding the text to a variable so you can interpolate the variable is wasteful and an unnecessary complication. sed can easily read the contents of a file by itself.
sed -i.bak '1r./PATH/text.txt' "$f"
Unfortunately, this part of sed is poorly standardized, so you may have to experiment a little bit. Some dialects require a newline (perhaps, or perhaps not, preceded by a backslash) before the filename.
sed -i.bak '1r\
./PATH/text.txt' "$f"
(Notice also the double quotes around the file name. You generally always want double quotes around variables which contain file names. More here.)
Adapting the recipe from here we can extend this to apply to the first empty line instead of the first line.
sed -i.bak -e '/^$/!b' -e 'r./PATH/text.txt' -e :a -e '$!{' -e n -e ba -e } "$f"
This adds the boilerplate after the first empty line but perhaps that's acceptable. Refactoring it to replace it or add an empty line after should not be too challenging anyway. (Maybe use sed -n and instead explicitly print everything except the empty line.)
In brief terms, this skips to the end (simply prints) up until we find the first empty line. Then, we read and print the file, and go into a loop which prints the remainder of the file without returning to the beginning of the script.
sed that I think works. Uses files for the extra bit to be inserted.
b='##\n## comment piece\n##'
sed --posix -ne '
1,/^$/ {
/^$/ {
x;
/^true$/ !{
x
s/^$/true/
i\
'"$b"'
};
x;
s/^.*$//
}
}
p
' file1
with the examples using ranges of 1,/^$/, an empty first line would result in the disclaimer being printed twice. To avoid this, I've set it up to put a flag in the hold space ( x; s/^$/true/ ) that I can swap to the pattern space to check whether its the first blank. Once theres a match for blank line, i\ inserts the comment ($b) in front of the pattern space.
Thanks to ghoti for the initial plan.

Shell script, escape newlines but emit others?

Given a filename, I want to write a shell-script which emits the following, and pipes it into a process:
Content-Length:<LEN><CR><LF>
<CR><LF>
{ "jsonrpc":"2.0", "params":{ "text":"<ESCAPED-TEXT>" } }
where <ESCAPED-TEXT> is the content of the file but its CRs, LFs and quotation marks have been escaped as \r and \n and \" (and I guess all other JSON escapes will eventually be needed as well), and where <LEN> is the length of final JSON line that includes the escaped text.
Here's my current bash-script solution. It works but is ugly as heck.
(
TXT=`cat ~/a.py | sed -E -e :a -e '$!N; s/\n/\\\n/g; ta' | sed 's/"/\\\"/g'`
CMD='{"jsonrpc":"2.0", "params":{ "text":{"'${TXT}'"}} }'
printf "Content-Length: ${#CMD}\r\n\r\n"
echo -n "${CMD}"
) | pyls
Can anyone suggest how to do this cleaner, please?
This sed script only replaces LFs, not CRs. It accumulates each line into the buffer and then does a s//g to replace all LFs in it. I couldn't figure out anything cleaner that still worked on both Linux and OSX/BSD.
I used both printf and echo. First printf because I do want to emit the CRLFCRLF after the Content-Length header, and you apparently need printf for that because the behavior of echo with escapes isn't uniform across platforms. Next echo because I don't want the \r and \n literals inside TXT to be unescaped, which printf would do.
Context: there's a standard called "Language Server Protocol". Basically you run something like the pyls I'm running here, and you pipe in JsonRPC to it over stdin, and it pipes back stuff. Different people have written language servers for Python (the pyls I'm using here), and C#, and C++, and Typescript, and PHP, and OCaml, and Go, and Java, and each person tends to write their language server in their own language.
I want to write a test-harness which can send some example JsonRPC packets into any such server.
I figured it'd be better to write my test-harness in just the common basic shell-scripting stuff that's available on all platforms out of the box. That way everyone can use my test-harness against their language server. (If I wrote it on Python instead, say, it'd be easier for me to write, but it would force the C# folks to learn+install python just to run it, and likewise the Typescript, PHP, OCaml, Go and other folks.)
a.py:
print("alfa")
print("bravo")
Awk script:
{
gsub("\r", "\\r")
gsub("\42", "\\\42")
z = z $0 "\\n"
}
END {
printf "Content-Length: %d\r\n", length(z) + 42
printf "\r\n"
printf "{\42jsonrpc\42: \0422.0\42, \42params\42: {\42text\42: \42%s\42}}", z
}
Result:
Content-Length: 81
{"jsonrpc": "2.0", "params": {"text": "print(\"alfa\")\r\nprint(\"bravo\")\r\n"}}
Can anyone suggest how to do this cleaner, please?
I guess all other JSON escapes will eventually be needed as well
If I already had Python at my disposal, I'd try really, really hard to use the standard Python JSON encoder, at least for the string escaping part. Why hack together something that kind of works when you can use something known to work that you already are halfway familiar with?
If I didn't have Python, I like Steve Penny's solution. Rules of thumb:
to process sets of files, use the shell
to process data in a file, use awk
if sed can't do it trivially, see rule #2
If you know a little awk, his solution is easy to understand almost at a glance. I would call that "cleaner". If you don't know awk, this would seem to be an excellent opportunity to become acquainted.
I think the main problem with your script is not using format strings with printf. The usual way that printf is used is with various special characters in the format string (like %s, %b, etc) and a list of additional arguments that are substituted into the format string.
That is, when you say "[I used] echo because I don't want the \r and \n literals to be unescaped, which printf would do", the problem is just not using printf "%s" "$string".
Anyway, here's an idea of how to use this stuff to get everything done in bash with no external tools:
escapes=('\n' '\r' '\"') # the escapes we want to put into the output
txt=$(< ~/a.py); # read the file into a variable
for esc in "${escapes[#]}"; do
# escapes are evaluated in a %b string w/ printf
# using -v puts the result into a variable
printf -v lit '%b' "$esc"
# use built-in ${string//pattern/replacement} expansion
txt=${txt//$lit/$esc}
done
txt='{"jsonrpc":"2.0", "params":{ "text":{"'$txt'"}} }'
# escapes in the format string are expanded
# but escapes in the argument substituted for %s are not
printf 'Content-Length: %s\r\n\r\n%s' "${#txt}"
"$txt"

How do I use `sed` to alter a variable in a bash script?

I'm trying to use enscript to print PDFs from Mutt, and hitting character encoding issues. One way around them seems to be to just use sed to replace the problem characters: sed -ir 's/[“”]/"/g' {input}
My test input file is this:
“very dirty”
we’re
I'm hoping to get "very dirty" and we're but instead I'm still getting
â\200\234very dirtyâ\200\235
weâ\200\231re
I found a nice little post on printing to PDFs from Mutt that I used as a starting point. I have a bash script that I point to from my .muttrc with set print_command="$HOME/.mutt/print.sh" -- the script currently reads about like this:
#!/bin/bash
input="$1" pdir="$HOME/Desktop" open_pdf=evince
# Straighten out curly quotes
sed -ir 's/[“”]/"/g' $input
sed -ir "s/[’]/'/g" $input
tmpfile="`mktemp $pdir/mutt_XXXXXXXX.pdf`"
enscript --font=Courier8 $input -2r --word-wrap --fancy-header=mutt -p - 2>/dev/null | ps2pdf - $tmpfile
$open_pdf $tmpfile >/dev/null 2>&1 &
sleep 1
rm $tmpfile
It does a fine job of creating a PDF (and works fine if you give it a file as an argument) but I can't figure out how to fix the curly quotes.
I've tried a bunch of variations on the sed line:
input=sed -r 's/[“”]/"/g' $input
$input=sed -ir "s/[’]/'/g" $input
Per the suggestion at Can I use sed to manipulate a variable in bash? I also tried input=$(sed -r 's/[“”]/"/g' <<< $input) and I get an error: "Syntax error: redirection unexpected"
But none manages to actually change $input -- what is the correct syntax to change $input with sed?
Note: I accepted an answer that resolved the question I asked, but as you can see from the comments there are a couple of other issues here. enscript is taking in a whole file as a variable, not just the text of the file. So trying to tweak the text inside the file is going to take a few extra steps. I'm still learning.
On Editing Variables In General
BashFAQ #21 is a comprehensive reference on performing search-and-replace operations in bash, including within variables, and is thus recommended reading. On this particular case:
Use the shell's native string manipulation instead; this is far higher performance than forking off a subshell, launching an external process inside it, and reading that external process's output. BashFAQ #100 covers this topic in detail, and is well worth reading.
Depending on your version of bash and configured locale, it might be possible to use a bracket expression (ie. [“”], as your original code did). However, the most portable thing is to treat “ and ” separately, which will work even without multi-byte character support available.
input='“hello ’cruel’ world”'
input=${input//'“'/'"'}
input=${input//'”'/'"'}
input=${input//'’'/"'"}
printf '%s\n' "$input"
...correctly outputs:
"hello 'cruel' world"
On Using sed
To provide a literal answer -- you almost had a working sed-based approach in your question.
input=$(sed -r 's/[“”]/"/g' <<<"$input")
...adds the missing syntactic double quotes around the parameter expansion of $input, ensuring that it's treated as a single token regardless of how it might be string-split or glob-expanded.
But All That May Not Help...
The below is mentioned because your test script is manipulating content passed on the command line; if that's not the case in production, you can probably disregard the below.
If your script is invoked as ./yourscript “hello * ’cruel’ * world”, then information about exactly what the user entered is lost before the script is started, and nothing you can do here will fix that.
This is because $1, in that scenario, will only contain “hello; ’cruel’ and world” are in their own argv locations, and the *s will have been replaced with lists of files in the current directory (each such file substituted as a separate argument) before the script was even started. Because the shell responsible for parsing the user's command line (which is not the same shell running your script!) did not recognize the quotes as valid at the time when it ran this parsing, by the time the script is running, there's nothing you can do to recover the original data.
Abstract: The way to use sed to change a variable is explored, but what you really need is a way to use and edit a file. It is covered ahead.
Sed
The (two) sed line(s) could be solved with this (note that -i is not used, it is not a file but a value):
input='“very dirty”
we’re'
sed 's/[“”]/\"/g;s/’/'\''/g' <<<"$input"
But it should be faster (for small strings) to use the internals of the shell:
input='“very dirty”
we’re'
input=${input//[“”]/\"}
input=${input//[’]/\'}
printf '%s\n' "$input"
$1
But there is an underlying problem with your script, you are trying to clean an input received from the command line. You are using $1 as the source of the string. Once somebody writes:
./script “very dirty”
we’re
That input is lost. It is broken into shell's tokens and "$1" will be “very only.
But I do not believe that is what you really have.
file
However, you are also saying that the input comes from a file. If that is the case, then read it in with:
input="$(<infile)" # not $1
sed 's/[“”]/\"/g;s/’/'\''/g' <<<"$input"
Or, if you don't mind to edit (change) the file, do this instead:
sed -i 's/[“”]/\"/g;s/’/'\''/g' infile
input="$(<infile)"
Or, if you are clear and certain that what is being given to the script is a filename, like:
./script infile
You can use:
infile="$1"
sed -i 's/[“”]/\"/g;s/’/'\''/g' "$infile"
input="$(<"$infile")"
Other comments:
Then:
Quote your variables.
Do not use the very old `…` syntax, use $(…) instead.
Do not use variables in UPPER case, those are reserved for environment variables.
And (unless you actually meant sh) use a shebang (first line) that targets bash.
The command enscript most definitively requires a file, not a variable.
Maybe you should use evince to open the PS file, there is no need of the step to make a pdf, unless you know you really need it.
I believe that is better use a file to store the output of enscript and ps2pdf.
Do not hide the errors printed by the commands until everything is working as desired, then, just call the script as:
./script infile 2>/dev/null
Or as required to make it less verbose.
Final script.
If you call the script with the name of the file that enscript is going to use, something like:
./script infile
Then, the whole script will look like this (runs both in bash or sh):
#!/usr/bin/env bash
Usage(){ echo "$0; This script require a source file"; exit 1; }
[ $# -lt 1 ] && Usage
[ ! -e $1 ] && Usage
infile="$1"
pdir="$HOME/Desktop"
open_pdf=evince
# Straighten out curly quotes
sed -i 's/[“”]/\"/g;s/’/'\''/g' "$infile"
tmpfile="$(mktemp "$pdir"/mutt_XXXXXXXX.pdf)"
outfile="${tmpfile%.*}.ps"
enscript --font=Courier10 "$infile" -2r \
--word-wrap --fancy-header=mutt -p "$outfile"
ps2pdf "$outfile" "$tmpfile"
"$open_pdf" "$tmpfile" >/dev/null 2>&1 &
sleep 5
rm "$tmpfile" "$outfile"

reading a very long line from without breaking into multiple line in shell script

I have a file which has very long rows of data. When i try to read using shell script, the data comes into multiple lines,ie, breaks at certain points.
Example row:
B_18453583||Active|917396140129|405819121107402|Active|7396140129||7396140129|||||||||18-MAY-10|||||18-MAY-10|405819121107402|Outgoing International Calls,Outgoing Calls,WAP,Call Waiting,MMS,Data Service,National Roaming-Voice,Outgoing International Calls except home country,Conference Call,STD,Call Forwarding-Barr,CLIP,Incoming Calls,INTSNS,WAPSNS,International Roaming-Voice,ISD,Incoming Calls When Roaming Internationally,INTERNET||For You Plan||||||||||||||||||
All this is the content of a single line.
I use a normal read like this :
var=`cat pranay.psv`
for i in $var; do
echo $i
done
The output comes as:
B_18453583||Active|917396140129|405819121107402|Active|7396140129||7396140129|||||||||18- MAY-10|||||18-MAY-10|405819121107402|Outgoing
International
Calls,Outgoing
Calls,WAP,Call
Waiting,MMS,Data
Service,National
Roaming-Voice,Outgoing
International
Calls
except
home
country,Conference
Call,STD,Call
Forwarding-Barr,CLIP,Incoming
Calls,INTSNS,WAPSNS,International
Roaming-Voice,ISD,Incoming
Calls
When
Roaming
Internationally,INTERNET||For
You
Plan||||||||||||||||||
How do i print all in single line??
Please help.
Thanks
This is because of word splitting. An easier way to do this (which also disbands with the useless use of cat) is this:
while IFS= read -r -d $'\n' -u 9
do
echo "$REPLY"
done 9< pranay.psv
To explain in detail:
$'...' can be used to create human readable strings with escape sequences. See man bash.
IFS= is necessary to avoid that any characters in IFS are stripped from the start and end of $REPLY.
-r avoids interpreting backslash in text specially.
-d $'\n' splits lines by the newline character.
Use file descriptor 9 for data storage instead of standard input to avoid greedy commands like cat eating all of it.
You need proper quoting. In your case, you should use the command read:
while read line ; do
echo "$line"
done < pranay.psv

Resources