bash 'fold' screws up encoding in emacs - bash

Reading lines from a 'somefile' and writing them to 'sample.org' file.
echo "$line" 1>>sample.org gives correct result, which is 'Субъективная оценка (от 1 до 5): 4 - отличный, понятный и богатый вкусом ..' (russian letters)
echo "$line" | fold -w 160 1>>sample.org gives this, which is technically correct if you copypaste it anywhere outside emacs. But still. Why using fold results in my emacs displaying 'sample.org' buffer in 'RAW-TEXT' instead of 'UTF-8'
To reproduce it create 2 files in same directory - test.sh, which will contain
cat 'test.org' |
while read -r line; do
# echo "$line" 1>'newfile.org' # works fine
# line below writes those weird chars to the output file
echo "$line" | fold -w 160 1>'newfile.org'
done
and test.org file, which will contain just 'Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание ГАМК 200мг/100г.'
Run the script with bash text.sh and hopefully you will see the problem in the output file newfile.org

I can't repro this on MacOS, but in an Ubuntu Docker image, it happens because fold inserts a newline in the middle of a UTF-8 multibyte sequence.
root#ef177a152b15:/# cat test.org
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание ГАМК 200мг/100г.
root#ef177a152b15:/# fold -w 160 test.org >newfile.org
root#ef177a152b15:/# cat newfile.org
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание Г?
?МК 200мг/100г.
root#ef177a152b15:/# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
(Perhaps also notice that your demo script can be reduced to a one-liner.)
I would have thought that GNU fold is locale-aware, but that you have to configure a UTF-8 locale for the support to be active; but that changes nothing for me.
root#ef177a152b15:/# locale -a
C
C.UTF-8
POSIX
root#ef177a152b15:/# LC_ALL=C.UTF-8 fold -w 160 test.org
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание Г?
?МК 200мг/100г.
Under these circumstances, the best I can offer is to replace fold with a simple replacement.
#!/usr/bin/python3
from sys import argv
maxlen = int(argv.pop(1))
for file in argv[1:]:
with open(file) as lines:
for line in lines:
while len(line) > maxlen:
print(line[0:maxlen])
line = line[maxlen:]
print(line, end='')
For simplicity, this doesn't have any option processing; just pass in the maximum length as the first argument.
(Python 3 uses UTF-8 throughout on any sane platform. Unfortunately, that excludes Windows; but I am restating the obvious.)
Bash, of course, is entirely innocent here; the shell does not control external utilities like fold. (But not much help, either; echo "${tekst:48:64}" produces similar mojibake.)

I'm not sure where that images comes from, however fold and coreutils in general, as well as huge number of other common cli utils, can only be safely used with inputs consisting of symbols from Posix Portable Character Set and not with multibyte UTF-8, regardless of what bullshit websites such as utf8everywhere.org state. fold suffers from the common problem - it assumes that each symbol occupies just a singe char causing multibyte UTF-8 input to be corrupted when it splits the lines.

Related

Bash Version of C64 Code Art: 10 PRINT CHR$(205.5+RND(1)); : GOTO 10

I picked up a copy of the book 10 PRINT CHR$(205.5+RND(1)); : GOTO 10
http://www.amazon.com/10-PRINT-CHR-205-5-RND/dp/0262018462
This book discusses the art produced by the single line of Commodore 64 BASIC:
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
This just repeatedly prints randomly character 205 or 206 to the screen from the PETSCII set:
http://en.wikipedia.org/wiki/PETSCII
https://vimeo.com/26472518
I'm not sure why the original uses the characters 205 and 206 instead of the identical 109 and 110. Also, I prefer to add a clear at the beginning. This is what I usually type into the C64:
1?CHR$(147)
2?CHR$(109.5+RND(1));:GOTO2
RUN
You can try this all for yourself in an emulator, such as this one using Flash or JavaScript:
http://codeazur.com.br/stuff/fc64_final/
http://www.kingsquare.nl/jsc64
When inputting the above code into the emulators listed, you'll need to realize that
( is *
) is (
+ is ]
I decided it would be amusing to write a bash line to do something similar.
I currently have:
clear; while :; do [ $(($RANDOM%2)) -eq 0 ] && (printf "\\") || (printf "/"); done;
Two questions:
Any suggestions for making this more concise?
Any suggestions
for a better output character? The forward and backward slash are
not nearly as beautiful since their points don't line up. The characters used from PETSCII are special characters, not slashes. I didn't see anything in ASCII that could work as well, but maybe you can suggest a way to pull in a character from UTF-8 or something else?
Best ANSWERS So Far
Shortest for bash (40 characters):
yes 'c=(╱ ╲);printf ${c[RANDOM%2]}'|bash
Here is a short one for zsh (53 characters):
c=(╱ ╲);clear;while :;do printf ${c[RANDOM%2+1]};done
Here is an alias I like to put in my .bashrc or .profile
alias art='c=(╱ ╲);while :;do printf "%s" ${c[RANDOM%2]};done'
Funny comparing this to the shortest I can do for C64 BASIC (23 characters):
1?C_(109.5+R_(1));:G_1
The underscores are shift+H, shift+N, and shift+O respectively. I can't paste the character here since they are specific to PETSCII. Also, the C64 output looks prettier ;)
You can read about the C64 BASIC abbreviations here:
http://www.commodore.ca/manuals/c64_programmers_reference/c64-programmers_reference_guide-02-basic_language_vocabulary.pdf
How about this?
# The characters you want to use
chars=( $'\xe2\x95\xb1' $'\xe2\x95\xb2' )
# Precompute the size of the array chars
nchars=${#chars[#]}
# clear screen
clear
# The loop that prints it:
while :; do
printf -- "${chars[RANDOM%nchars]}"
done
As a one-liner with shorter variable names to make it more concise:
c=($'\xe2\x95\xb1' $'\xe2\x95\xb2'); n=${#c[#]}; clear; while :; do printf -- "${c[RANDOM%n]}"; done
You can get rid of the loop if you know in advance how many characters to print (here 80*24=1920)
c=($'\xe2\x95\xb1' $'\xe2\x95\xb2'); n=${#c[#]}; clear; printf "%s" "${c[RANDOM%n]"{1..1920}"}"
Or, if you want to include the characters directly instead of their code:
c=(╱‬ ╲); n=${#c[#]}; clear; while :; do printf "${c[RANDOM%n]}"; done
Finally, with the size of the array c precomputed and removing unnecessary spaces and quotes (and I can't get shorter than this):
c=(╱‬ ╲);clear;while :;do printf ${c[RANDOM%2]};done
Number of bytes used for this line:
$ wc -c <<< 'c=(╱‬ ╲);clear;while :;do printf ${c[RANDOM%2]};done'
59
Edit. A funny way using the command yes:
clear;yes 'c=(╱ ╲);printf ${c[RANDOM%2]}'|bash
It uses 50 bytes:
$ wc -c <<< "clear;yes 'c=(╱ ╲);printf \${c[RANDOM%2]}'|bash"
51
or 46 characters:
$ wc -m <<< "clear;yes 'c=(╱ ╲);printf \${c[RANDOM%2]}'|bash"
47
After looking at some UTF stuff:
2571 BOX DRAWINGS LIGHT DIAGONAL UPPER RIGHT TO LOWER LEFT
2572 BOX DRAWINGS LIGHT DIAGONAL UPPER LEFT TO LOWER RIGHT
(╱‬ and ╲) seem best.
f="╱╲";while :;do print -n ${f[(RANDOM % 2) + 1]};done
also works in zsh (thanks Clint on OFTC for giving me bits of that)
Here is my 39 character command line solution I just posted to #climagic:
grep -ao "[/\\]" /dev/urandom|tr -d \\n
In bash, you can remove the double quotes around the [/\] match expression and make it even shorter than the C64 solution, but I've included them for good measure and cross shell compatibility. If there was a 1 character option to grep to make grep trim newlines, then you could make this 27 characters.
I know this doesn't use the Unicode characters so maybe it doesn't count. It is possible to grep for the Unicode characters in /dev/urandom, but that will take a long time because that sequence comes up less often and if you pipe it the command pipeline will probably "stick" for quite a while before producing anything due to line buffering.
Bash supports Unicode now, so we don't need to use UTF-8 character sequences such as $'\xe2\x95\xb1'.
This is my most-correct version: it loops, prints either / or \ based on a random number as others do.
for((;;x=RANDOM%2+2571)){ printf "\U$x";}
41
My previous best was:
while :;do printf "\U257"$((RANDOM%2+1));done
45
And this one 'cheats' using embedded Unicode (I think for obviousness, maintainability, and simplicity, this is my favourite).
Z=╱╲;for((;;)){ printf ${Z:RANDOM&1:1};}
40
My previous best was:
while Z=╱╲;do printf ${Z:RANDOM&1:1};done
41
And here are some more.
while :;do ((RANDOM&1))&&printf "\U2571"||printf "\U2572";done
while printf -v X "\\\U%d" $((2571+RANDOM%2));do printf $X;done
while :;do printf -v X "\\\U%d" $((2571+RANDOM%2));printf $X;done
while printf -v X '\\U%d' $((2571+RANDOM%2));do printf $X;done
c=('\U2571' '\U2572');while :;do printf ${c[RANDOM&1]};done
X="\U257";while :;do printf $X$((RANDOM%2+1));done
Now, this one runs until we get a stack overflow (not another one!) since bash does not seem to support tail-call elimination yet.
f(){ printf "\U257"$((RANDOM%2+1));f;};f
40
And this is my attempt to implement a crude form of tail-process elimination. But when you have had enough and press ctrl-c, your terminal will vanish.
f(){ printf "\U257"$((RANDOM%2+1));exec bash -c f;};export -f f;f
UPDATE:
And a few more.
X=(╱ ╲);echo -e "\b${X[RANDOM&1]"{1..1000}"}" 46
X=("\U2571" "\U2572");echo -e "\b${X[RANDOM&1]"{1..1000}"}" 60
X=(╱ ╲);while :;do echo -n ${X[RANDOM&1]};done 46
Z=╱╲;while :;do echo -n ${Z:RANDOM&1:1};done 44
Sorry for necroposting, but here's bash version in 38 characters.
yes 'printf \\u$[2571+RANDOM%2]'|bash
using for instead of yes inflates this to 40 characters:
for((;;)){ printf \\u$[2571+RANDOM%2];}
109 chr for Python 3
Which was the smallest I could get it.
#!/usr/bin/python3
import random
while True:
if random.randrange(2)==1:print('\u2572',end='')
else:print('\u2571',end='')
#!/usr/bin/python3
import random
import sys
while True:
if random.randrange(2)==1:sys.stdout.write("\u2571")
else:sys.stdout.write("\u2572")
sys.stdout.flush()
Here's a version for Batch which fits in 127 characters:
cmd /v:on /c "for /l %a in (0,0,0) do #set /a "a=!random!%2" >nul & if "!a!"=="0" (set /p ".=/" <nul) else (set /p ".=\" <nul)"

Assign string containing null-character (\0) to a variable in Bash

While trying to process a list of file-/foldernames correctly (see my other questions) through the use of a NULL-character as a delimiter I stumbled over a strange behaviour of Bash that I don't understand:
When assigning a string containing one or more NULL-character to a variable, the NULL-characters are lost / ignored / not stored.
For example,
echo -ne "n\0m\0k" | od -c # -> 0000000 n \0 m \0 k
But:
VAR1=`echo -ne "n\0m\0k"`
echo -ne "$VAR1" | od -c # -> 0000000 n m k
This means that I would need to write that string to a file (for example, in /tmp) and read it back from there if piping directly is not desired or feasible.
When executing these scripts in Z shell (zsh) the strings containing \0 are preserved in both cases, but sadly I can't assume that zsh is present in the systems running my script while Bash should be.
How can strings containing \0 chars be stored or handled efficiently without losing any (meta-) characters?
In Bash, you can't store the NULL-character in a variable.
You may, however, store a plain hex dump of the data (and later reverse this operation again) by using the xxd command.
VAR1=`echo -ne "n\0m\0k" | xxd -p | tr -d '\n'`
echo -ne "$VAR1" | xxd -r -p | od -c # -> 0000000 n \0 m \0 k
As others have already stated, you can't store/use NUL char:
in a variable
in an argument of the command line.
However, you can handle any binary data (including NUL char):
in pipes
in files
So to answer your last question:
can anybody give me a hint how strings containing \0 chars can be
stored or handled efficiently without losing any (meta-) characters?
You can use files or pipes to store and handle efficiently any string with any meta-characters.
If you plan to handle data, you should note additionally that:
Only the NUL char will be eaten by variable and argument of the command line, you can check this.
Be wary that command substitution (as $(command..) or `command..`) has an additional twist above being a variable as it'll eat your ending new lines.
Bypassing limitations
If you want to use variables, then you must get rid of the NUL char by encoding it, and various other solutions here give clever ways to do that (an obvious way is to use for example base64 encoding/decoding).
If you are concerned by memory or speed, you'll probably want to use a minimal parser and only quote NUL character (and the quoting char). In this case this would help you:
quote() { sed 's/\\/\\\\/g;s/\x0/\\x00/g'; }
Then, you can secure your data before storing them in variables and
command line argument by piping your sensitive data into quote, which will output a safe data stream without NUL chars. You can get back
the original string (with NUL chars) by using echo -en "$var_quoted" which will send the correct string on the standard output.
Example:
## Our example output generator, with NUL chars
ascii_table() { echo -en "$(echo '\'0{0..3}{0..7}{0..7} | tr -d " ")"; }
## store
myvar_quoted=$(ascii_table | quote)
## use
echo -en "$myvar_quoted"
Note: use | hd to get a clean view of your data in hexadecimal and
check that you didn't loose any NUL chars.
Changing tools
Remember you can go pretty far with pipes without using variables nor argument in command line, don't forget for instance the <(command ...) construct that will create a named pipe (sort of a temporary file).
EDIT: the first implementation of quote was incorrect and would not deal correctly with \ special characters interpreted by echo -en. Thanks #xhienne for spotting that.
EDIT2: the second implementation of quote had bug because of using only \0 than would actually eat up more zeroes as \0, \00, \000 and \0000 are equivalent. So \0 was replaced by \x00. Thanks for #MatthijsSteen for spotting this one.
Use uuencode and uudecode for POSIX portability
xxd and base64 are not POSIX 7 but uuencode is.
VAR="$(uuencode -m <(printf "a\0\n") /dev/stdout)"
uudecode -o /dev/stdout <(printf "$VAR") | od -tx1
Output:
0000000 61 00 0a
0000003
Unfortunately I don't see a POSIX 7 alternative for the Bash process <() substitution extension except writing to file, and they are not installed in Ubuntu 12.04 by default (sharutils package).
So I guess that the real answer is: don't use Bash for this, use Python or some other saner interpreted language.
I love jeff's answer. I would use Base64 encoding instead of xxd. It saves a little space and would be (I think) more recognizable as to what is intended.
VAR=$(echo -ne "foo\0bar" | base64)
echo -n "$VAR" | base64 -d | xargs -0 ...
As for -e, it is needed for the echo of a literal string with an encoded null ('\0'), though I also seem to recall something about "echo -e" being unsafe if you're echoing any user input as they could inject escape sequences that echo will interpret and end up with bad things. The -e flag is not needed when echoing the encoded stored string into the decode.
Here’s a maximally memory-efficient solution, that just escapes the NULL bytes with an \xFF.
(Since I wasn’t happy with base64 or the like. :)
esc0() { sed 's/\xFF/\xFF\xFF/g; s/\x00/\xFF0/g'; }
cse0() { sed 's/\xFF0/\xFF\x00/g; s/\xFF\(.\)/\1/g'; }
It of course escapes any actual \xFF by doubling it too, so it works exactly like when backslashes are used for escaping. This is also why a simple mapping can’t be used, and referring to the match in the replacement is required.
Here’s an example that paints gradients onto the framebuffer (doesn’t work in X), using variables to pre-render blocks and lines for speed:
width=7680; height=1080; # Set these to your framebuffer’s size.
blocksPerLine=$(( $width / 256 ))
block="$( for i in 0 1 2 3 4 5 6 7 8 9 A B C D E F; do for j in 0 1 2 3 4 5 6 7 8 9 A B C D E F; do echo -ne "\x$i$j"; done; done | esc0 )"
line="$( for ((b=0; b < blocksPerLine; b++)); do echo -en "$block"; done )"
for ((l=0; l <= $height; l++)); do echo -en "$line"; done | cse0 > /dev/fb0
Note how $block contains escaped NULLs (plus \xFFs), and at the end, before writing everything to the framebuffer, cse0 unescapes them.

comments amongst parameters in BASH

I want to include command-parameters-inline comments, e.g.:
sed -i.bak -r \
# comment 1
-e 'sed_commands' \
# comment 2
-e 'sed_commands' \
# comment 3
-e 'sed_commands' \
/path/to/file
The above code doesn't work. Is there a different way for embedding comments in the parameters line?
If you really want comment arguments, can try this:
ls $(
echo '-l' #for the long list
echo '-F' #show file types too
echo '-t' #sort by time
)
This will be equivalent to:
ls -l -F -t
echo is an shell built-in, so does not execute external commands, so it is fast enough. But, it is crazy anyway.
or
makeargs() { while read line; do echo ${line//#*/}; done }
ls $(makeargs <<EOF
-l # CDEWDWEls
-F #Dwfwef
EOF
)
I'd recommend using longer text blocks for your sed script, i.e.
sed -i.bak '
# comment 1
sed_commands
# comment 2
sed_commands
# comment 3
sed_commands
' /path/to/file
Unfortunately, embedded comments in sed script blocks are not universally a supported feature. The sun4 version would let you put a comment on the first line, but no where else. AIX sed either doesnt allow any comments, or uses a different char besides # for comments. Your results may vary.
I Hope this helps.
You could invoke sed multiple times instead of passing all of the arguments to one process:
sed sed_commands | # comment 1
sed sed_commands | # comment 2
sed sed_commands | # comment 3
sed sed_commands # final comment
It's obviously more wasteful, but you may decide that three extra sed processes are a fair tradeoff for readability and portability (to #shellter's point about support for comments within sed commands). Depends on your situation.
UPDATE: you'll also have to adjust if you originally intended to edit the files in place, as your -i argument implies. This approach would require a pipeline.
There isn't a way to do what you seek to do in shell plus sed. I put the comments before the sed script, like this:
# This is a remarkably straight-forward SED script
# -- When it encounters an end of here-document followed by
# the start of the next here document, it deletes both lines.
# This cuts down vastly on the number of processes which are run.
# -- It also does a substitution for XXXX, because the script which
# put the XXXX in place was quite hard enough without having to
# worry about whether things were escaped enough times or not.
cat >$tmp.3 <<EOF
/^!\$/N
/^!\\ncat <<'!'\$/d
s%version XXXX%version $SOURCEDIR/%
EOF
# This is another entertaining SED script.
# It takes the output from the shell script generated by running the
# first script through the second script and into the shell, and
# converts it back into an NMD file.
# -- It initialises the hold space with --#, which is a marker.
# -- For lines which start with the marker, it adds the pattern space
# to the hold space and exchanges the hold and pattern space. It
# then replaces a version number followed by a newline, the marker
# and a version number by the just the new version number, but
# replaces a version number followed by a newline and just the
# marker by just the version number. This replaces the old version
# number with the new one (when there is a new version number).
# The line is printed and deleted.
# -- Note that this code allows for an optional single word after the
# version number. At the moment, the only valid value is 'binary' which
# indicates that the file should not be version stamped by mknmd.
# -- On any line which does not start with the marker, the line is
# copied into the hold space, and if the original hold space
# started with the marker, the line is deleted. Otherwise, of
# course, it is printed.
cat >$tmp.2 <<'EOF'
1{
x
s/^/--#/
x
}
/^--# /{
H
x
s/\([ ]\)[0-9.][0-9.]*\n--# \([0-9.]\)/\1\2/
s/\([ ]\)[0-9.][0-9.]*\([ ][ ]*[^ ]*\)\n--# \([0-9.][0-9.]*\)/\1\3\2/
s/\([ ][0-9.][0-9.]*\)\n--# $/\1/
s/\([ ][0-9.][0-9.]*[ ][ ]*[^ ]*\)\n--# $/\1/
p
d
}
/^--#/!{
x
/^--#/d
}
EOF
There's another sed script in the file that is about 40 lines long (marked as 'entertaining'), though about half those lines are simply embedded shell script added to the output. I haven't changed the shell script containing this stuff in 13 years because (a) it works and (b) the sed scripts scare me witless. (The NMD format contains a file name and a version number separated by space and occasionally a tag word 'binary' instead of a version number, plus comment lines and blank lines.)
You don't have to understand what the script does - but commenting before the script is the best way I've found for documenting sed scripts.
No.
If you put the \ before the # it will escape the comment character and you won't have a comment anymore.
If you put the \ after the # it will be part of the comment and you won't escape the newline anymore.
A lack of inline comments is a limitation of bash that you would do better to adapt to than try and work around with some of the baroque suggestions already put forth.
Although the thread is quite old i did find it for the same question and so will others. Here's my solution for this problem:
You need comments, so that if you look at your code at a much later time you will likely get an idea of what you actually did, when you wrote the code. I am just having the same problem while writing my first rsync script, which has lots of parameters which also have side effects.
Group together your parameter which belong together by topic and put them into a variable, which gets a corresponding name. This makes it easy to identify what the parameter steer. This is your short comment. In addition you can put a comment above the variable declaration to see how you can change the behavior. This is the long version comment.
Call the application with the corresponding parameter variables.
## Options
# Remove --whole-file for delta transfer
sync_filesystem=" --one-file-system \
--recursive \
--relative \
--whole-file \ " ;
rsync \
${sync_filesystem} \
${way_more_to_come} \
"${SOURCE}" \
"${DESTIN}" \
Good overview, easy to edit and like comments in parameters. It takes more effort, but has therefore a higher quality.
I'll suggest another way that works at least in some instances:
Let's say I have the command:
foo --option1 --option2=blah --option3 option3val /tmp/bar`
I can write it this way:
options=(
--option1
--option2=blah
--option3 option3val
)
foo ${options[#]} /tmp/bar
Now let's say I want to temporarily remove the second option. I can just comment it out:
options=(
--option1
# --option2=blah
--option3 option3val
)
Note that this technique may not work when you need extensive escaping or quoting. I have run into some issues with that in the past, but unfortunately I don't recall the details at the moment :(
For most cases, though, this technique works well. If you need embedded blanks in a parameter, just enclose the string in quotes, as normal.

Replacing HTML ascii codes via a bash script?

I need a way to replace HTML ASCII codes like ! with their correct character in bash.
Is there a utility I could run my output through to do this, or something along those lines?
$ echo '!' | recode html/..
!
$ echo '<∞>' | recode html/..
<∞>
I don't know of an easy way, here is what I suppose I would do...
You might be able to script a browser into reading the file in and then saving it as text. If lynx supports html character entities then it might be worth looking in to. If that doesn't work out...
The general solution to something like this is done with sed. You need a "higher order" edit for this, as you would first start with an entity table and then you would edit that table into an edit script itself with a multiple-step procedure. Something like:
. . .
s/&Dagger;/‡/g<br />
s/&#8221;/”/g<br />
. . .
Then, encapsulate this as html, read it in to a browser, and save it as text in the character set you are targeting. If you get it to produce lines like:
s/</</g
then you win. A bash script that calls sed or ex can be driven by the substitute commands in the file.
Here is my solution with the standard Linux toolbox.
$ foo="This is a line feed
And e acute:é with a grinning face 😀."
$ echo "$foo"
This is a line feed
And e acute:é with a grinning face 😀.
$ eval "$(printf '%s' "$foo" | sed 's/^/printf "/;s/&#0*\([0-9]*\);/\$( [ \1 -lt 128 ] \&\& printf "\\\\$( printf \"%.3o\\201\" \1)" || \$(which printf) \\\\U\$( printf \"%.8x\" \1) )/g;s/$/\\n"/')" | sed "s/$(printf '\201')//g"
This is a line feed
And e acute:é with a grinning face 😀.
You see that it works for all kinds of escapes, even Line Feed, e acute (é) which is a 2 byte UTF-8 and even the new emoticons which are in the extended plane (4 bytes unicode).
This command works ALSO with dash which is a trimmed down shell (default shell on Ubuntu) and is also compatible with bash and shells like ash used by the Synology.
If you don't mind sticking with bash and dropping the compatibility, you can make is much simpler.
Bits used should be in any decent Linux box (or OS X?)
- which
- printf (GNU and builtin)
- GNU sed
- eval (shell builtin)
The bash only version don't need which nor the GNU printf.

How do I determine file encoding in OS X?

I'm trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn't seem to understand them.
Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I've never seen before: an "#" by the file listing:
-rw-r--r--# 1 me users 2021 Feb 11 18:05 my_file.tex
(And, yes, I'm using \usepackage[utf8]{inputenc} in the LaTeX.)
I've found iconv, but that doesn't seem to be able to tell me what the encoding is -- it'll only convert once I figure it out.
Using the -I (that's a capital i) option on the file command seems to show the file encoding.
file -I {filename}
In Mac OS X the command file -I (capital i) will give you the proper character set so long as the file you are testing contains characters outside of the basic ASCII range.
For instance if you go into Terminal and use vi to create a file eg. vi test.txt
then insert some characters and include an accented character (try ALT-e followed by e)
then save the file.
They type file -I text.txt and you should get a result like this:
test.txt: text/plain; charset=utf-8
The # means that the file has extended file attributes associated with it. You can query them using the getxattr() function.
There's no definite way to detect the encoding of a file. Read this answer, it explains why.
There's a command line tool, enca, that attempts to guess the encoding. You might want to check it out.
vim -c 'execute "silent !echo " . &fileencoding | q' {filename}
aliased somewhere in my bash configuration as
alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
so I just type
vic {filename}
On my vanilla OSX Yosemite, it yields more precise results than "file -I":
$ file -I pdfs/udocument0.pdf
pdfs/udocument0.pdf: application/pdf; charset=binary
$ vic pdfs/udocument0.pdf
latin1
$
$ file -I pdfs/t0.pdf
pdfs/t0.pdf: application/pdf; charset=us-ascii
$ vic pdfs/t0.pdf
utf-8
You can also convert from one file type to another using the following command :
iconv -f original_charset -t new_charset originalfile > newfile
e.g.
iconv -f utf-16le -t utf-8 file1.txt > file2.txt
Just use:
file -I <filename>
That's it.
Using file command with the --mime-encoding option (e.g. file --mime-encoding some_file.txt) instead of the -I option works on OS X and has the added benefit of omitting the mime type, "text/plain", which you probably don't care about.
Classic 8-bit LaTeX is very restricted in which UTF8 characters it can use; it's highly dependent on the encoding of the font you're using and which glyphs that font has available.
Since you don't give a specific example, it's hard to know exactly where the problem is — whether you're attempting to use a glyph that your font doesn't have or whether you're not using the correct font encoding in the first place.
Here's a minimal example showing how a few UTF8 characters can be used in a LaTeX document:
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[utf8]{inputenc}
\begin{document}
‘Héllø—thêrè.’
\end{document}
You may have more luck with the [utf8x] encoding, but be slightly warned that it's no longer supported and has some idiosyncrasies compared with [utf8] (as far as I recall; it's been a while since I've looked at it). But if it does the trick, that's all that matters for you.
The # sign means the file has extended attributes. xattr file shows what attributes it has, xattr -l file shows the attribute values too (which can be large sometimes — try e.g. xattr /System/Library/Fonts/HelveLTMM to see an old-style font that exists in the resource fork).
Typing file myfile.tex in a terminal can sometimes tell you the encoding and type of file using a series of algorithms and magic numbers. It's fairly useful but don't rely on it providing concrete or reliable information.
A Localizable.strings file (found in localised Mac OS X applications) is typically reported to be a UTF-16 C source file.
Synalyze It! allows to compare text or bytes in all encodings the ICU library offers. Using that feature you usually see immediately which code page makes sense for your data.
You can try loading the file into a firefox window then go to View - Character Encoding. There should be a check mark next to the file's encoding type.
I implemented the bash script below, it works for me.
It first tries to iconv from the encoding returned by file --mime-encoding to utf-8.
If that fails, it goes through all encodings and shows the diff between the original and re-encoded file. It skips over encodings that produce a large diff output ("large" as defined by the MAX_DIFF_LINES variable or the second input argument), since those are most likely the wrong encoding.
If "bad things" happen as a result of using this script, don't blame me. There's a rm -f in there, so there be monsters. I tried to prevent adverse effects by using it on files with a random suffix, but I'm not making any promises.
Tested on Darwin 15.6.0.
#!/bin/bash
if [[ $# -lt 1 ]]
then
echo "ERROR: need one input argument: file of which the enconding is to be detected."
exit 3
fi
if [ ! -e "$1" ]
then
echo "ERROR: cannot find file '$1'"
exit 3
fi
if [[ $# -ge 2 ]]
then
MAX_DIFF_LINES=$2
else
MAX_DIFF_LINES=10
fi
#try the easy way
ENCOD=$(file --mime-encoding $1 | awk '{print $2}')
#check if this enconding is valid
iconv -f $ENCOD -t utf-8 $1 &> /dev/null
if [ $? -eq 0 ]
then
echo $ENCOD
exit 0
fi
#hard way, need the user to visually check the difference between the original and re-encoded files
for i in $(iconv -l | awk '{print $1}')
do
SINK=$1.$i.$RANDOM
iconv -f $i -t utf-8 $1 2> /dev/null > $SINK
if [ $? -eq 0 ]
then
DIFF=$(diff $1 $SINK)
if [ ! -z "$DIFF" ] && [ $(echo "$DIFF" | wc -l) -le $MAX_DIFF_LINES ]
then
echo "===== $i ====="
echo "$DIFF"
echo "Does that make sense [N/y]"
read $ANSWER
if [ "$ANSWER" == "y" ] || [ "$ANSWER" == "Y" ]
then
echo $i
exit 0
fi
fi
fi
#clean up re-encoded file
rm -f $SINK
done
echo "None of the encondings worked. You're stuck."
exit 3
Which LaTeX are you using? When I was using teTeX, I had to manually download the unicode package and add this to my .tex files:
% UTF-8 stuff
\usepackage[notipa]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
Now, I've switched over to XeTeX from the TeXlive 2008 package (here), it is even more simple:
% UTF-8 stuff
\usepackage{fontspec}
\usepackage{xunicode}
As for detection of a file's encoding, you could play with file(1) (but it is rather limited) but like someone else said, it is difficult.
A brute-force way to check the encoding might just be to check the file in a hex editor or similar. (or write a program to check) Look at the binary data in the file. The UTF-8 format is fairly easy to recognize. All ASCII characters are single bytes with values below 128 (0x80)
Multibyte sequences follow the pattern shown in the wiki article
If you can find a simpler way to get a program to verify the encoding for you, that's obviously a shortcut, but if all else fails, this would do the trick.

Resources