How do I determine file encoding in OS X?

How do I determine file encoding in OS X? - macos

I'm trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn't seem to understand them.
Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I've never seen before: an "#" by the file listing:
-rw-r--r--# 1 me users 2021 Feb 11 18:05 my_file.tex
(And, yes, I'm using \usepackage[utf8]{inputenc} in the LaTeX.)
I've found iconv, but that doesn't seem to be able to tell me what the encoding is -- it'll only convert once I figure it out.

Using the -I (that's a capital i) option on the file command seems to show the file encoding.
file -I {filename}

In Mac OS X the command file -I (capital i) will give you the proper character set so long as the file you are testing contains characters outside of the basic ASCII range.
For instance if you go into Terminal and use vi to create a file eg. vi test.txt
then insert some characters and include an accented character (try ALT-e followed by e)
then save the file.
They type file -I text.txt and you should get a result like this:
test.txt: text/plain; charset=utf-8

The # means that the file has extended file attributes associated with it. You can query them using the getxattr() function.
There's no definite way to detect the encoding of a file. Read this answer, it explains why.
There's a command line tool, enca, that attempts to guess the encoding. You might want to check it out.

vim -c 'execute "silent !echo " . &fileencoding | q' {filename}
aliased somewhere in my bash configuration as
alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
so I just type
vic {filename}
On my vanilla OSX Yosemite, it yields more precise results than "file -I":
$ file -I pdfs/udocument0.pdf
pdfs/udocument0.pdf: application/pdf; charset=binary
$ vic pdfs/udocument0.pdf
latin1
$
$ file -I pdfs/t0.pdf
pdfs/t0.pdf: application/pdf; charset=us-ascii
$ vic pdfs/t0.pdf
utf-8

You can also convert from one file type to another using the following command :
iconv -f original_charset -t new_charset originalfile > newfile
e.g.
iconv -f utf-16le -t utf-8 file1.txt > file2.txt

Just use:
file -I <filename>
That's it.

Using file command with the --mime-encoding option (e.g. file --mime-encoding some_file.txt) instead of the -I option works on OS X and has the added benefit of omitting the mime type, "text/plain", which you probably don't care about.

Classic 8-bit LaTeX is very restricted in which UTF8 characters it can use; it's highly dependent on the encoding of the font you're using and which glyphs that font has available.
Since you don't give a specific example, it's hard to know exactly where the problem is — whether you're attempting to use a glyph that your font doesn't have or whether you're not using the correct font encoding in the first place.
Here's a minimal example showing how a few UTF8 characters can be used in a LaTeX document:
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[utf8]{inputenc}
\begin{document}
‘Héllø—thêrè.’
\end{document}
You may have more luck with the [utf8x] encoding, but be slightly warned that it's no longer supported and has some idiosyncrasies compared with [utf8] (as far as I recall; it's been a while since I've looked at it). But if it does the trick, that's all that matters for you.

The # sign means the file has extended attributes. xattr file shows what attributes it has, xattr -l file shows the attribute values too (which can be large sometimes — try e.g. xattr /System/Library/Fonts/HelveLTMM to see an old-style font that exists in the resource fork).

Typing file myfile.tex in a terminal can sometimes tell you the encoding and type of file using a series of algorithms and magic numbers. It's fairly useful but don't rely on it providing concrete or reliable information.
A Localizable.strings file (found in localised Mac OS X applications) is typically reported to be a UTF-16 C source file.

Synalyze It! allows to compare text or bytes in all encodings the ICU library offers. Using that feature you usually see immediately which code page makes sense for your data.

You can try loading the file into a firefox window then go to View - Character Encoding. There should be a check mark next to the file's encoding type.

I implemented the bash script below, it works for me.
It first tries to iconv from the encoding returned by file --mime-encoding to utf-8.
If that fails, it goes through all encodings and shows the diff between the original and re-encoded file. It skips over encodings that produce a large diff output ("large" as defined by the MAX_DIFF_LINES variable or the second input argument), since those are most likely the wrong encoding.
If "bad things" happen as a result of using this script, don't blame me. There's a rm -f in there, so there be monsters. I tried to prevent adverse effects by using it on files with a random suffix, but I'm not making any promises.
Tested on Darwin 15.6.0.
#!/bin/bash
if [[ $# -lt 1 ]]
then
echo "ERROR: need one input argument: file of which the enconding is to be detected."
exit 3
fi
if [ ! -e "$1" ]
then
echo "ERROR: cannot find file '$1'"
exit 3
fi
if [[ $# -ge 2 ]]
then
MAX_DIFF_LINES=$2
else
MAX_DIFF_LINES=10
fi
#try the easy way
ENCOD=$(file --mime-encoding $1 | awk '{print $2}')
#check if this enconding is valid
iconv -f $ENCOD -t utf-8 $1 &> /dev/null
if [ $? -eq 0 ]
then
echo $ENCOD
exit 0
fi
#hard way, need the user to visually check the difference between the original and re-encoded files
for i in $(iconv -l | awk '{print $1}')
do
SINK=$1.$i.$RANDOM
iconv -f $i -t utf-8 $1 2> /dev/null > $SINK
if [ $? -eq 0 ]
then
DIFF=$(diff $1 $SINK)
if [ ! -z "$DIFF" ] && [ $(echo "$DIFF" | wc -l) -le $MAX_DIFF_LINES ]
then
echo "===== $i ====="
echo "$DIFF"
echo "Does that make sense [N/y]"
read $ANSWER
if [ "$ANSWER" == "y" ] || [ "$ANSWER" == "Y" ]
then
echo $i
exit 0
fi
fi
fi
#clean up re-encoded file
rm -f $SINK
done
echo "None of the encondings worked. You're stuck."
exit 3

Which LaTeX are you using? When I was using teTeX, I had to manually download the unicode package and add this to my .tex files:
% UTF-8 stuff
\usepackage[notipa]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
Now, I've switched over to XeTeX from the TeXlive 2008 package (here), it is even more simple:
% UTF-8 stuff
\usepackage{fontspec}
\usepackage{xunicode}
As for detection of a file's encoding, you could play with file(1) (but it is rather limited) but like someone else said, it is difficult.

A brute-force way to check the encoding might just be to check the file in a hex editor or similar. (or write a program to check) Look at the binary data in the file. The UTF-8 format is fairly easy to recognize. All ASCII characters are single bytes with values below 128 (0x80)
Multibyte sequences follow the pattern shown in the wiki article
If you can find a simpler way to get a program to verify the encoding for you, that's obviously a shortcut, but if all else fails, this would do the trick.

Related

How to select filetypes in bash script or perl and act based on the type

For the moment, ignore any issues with the specific file type used here as an example only.
Given a directory with multiple types of files, like a README file in markdown format, git version control files, PEM keys, and other random file types, what's the best way, in terms of portability or even readability, to select ONLY the PEM keys in the directory and move those over to an arbitrary location?
One possible option was to use the file tool, which returns the file type. For example:
$ file randomly-named-file
randomly-named-file: PEM RSA private key
Essentially, what's the best way, in terms of portability or readability, to create a test something like the following pseudo "code"?
if file $1 is PEM
mv $1 /some/other/dir
fi
Bash is preferred; though, perl solution is acceptable if necessary.
The original question stemmed from use on a Mac OSX platform with high use-case of and needed portability between Linux (RHEL-based, Gentoo) platforms.
The file version is 5.31, supporting -print0, for what it's worth.
It might be worth noting, that the file tool is just the obvious first choice I thought of. It does NOT have to be using that tool if other portable solutions are available.

A reasonable implementation (subject to all the usual limitations of libmagic) requiring file -0 might look like the following:
#!/usr/bin/env bash
# ^^^^- NOT /bin/sh; requires bash-specific syntax.
while IFS= read -r -d '' filename && read -r type; do
type=${type#": "}
if [[ $type = *"PEM RSA private key"* ]]; then
printf 'Found file: %q\n' "$filename"
fi
done < <(file -0 -- *)
The -0 argument tells file to print a NUL after each filename; IFS= read -r -d '' reads up to the first NUL, so it thus consumes only the filename, whereas read -r type consumes the rest of each line.
This makes it possible to disambiguate names from types in the list, even if those names contain colons, newlines, or other surprising characters. If you didn't have -0, you'd need to start a separate copy of file for each file you wanted to test, with the associated performance hit.

If, for some reason, you cannot rely on file, you can try the using head to inspect the file for the PEM file standard header, like so:
for file in ./*; do
echo "Examining file: $file"
first_line=$(head -n 1 "$file")
if [ "$first_line" = "-----BEGIN ENCRYPTED PRIVATE KEY-----" ]; then
echo "Found PEM file"
mv example.pem /my/arbitrary/dir
fi
done
I created a small directory with the following test files:
$ ls
pretend.pdf example.pem script.sh text_file.txt
Running the script against this directory gives the following results:
$ ./script.sh
Examining file: ./pretend.pdf
Examining file: ./example.pem
Found PEM file
Examining file: ./script.sh
Examining file: ./text_file.txt

Script which will move non-ASCII files

I need help. I should write a script,whih will move all non-ASCII files from one directory to another. I got this code,but i dont know why it is not working.
#!/bin/bash
for file in "/home/osboxes/Parkhom"/*
do
if [ -eq "$( echo "$(file $file)" | grep -nP '[\x80-\xFF]' )" ];
then
if test -e "$1"; then
mv $file $1
fi
fi
done
exit 0

It's not clear which one you are after, but:
• To test if the variable $file contains a non-ASCII character, you can do:
if [[ $file == *[^[:ascii:]]* ]]; then
• To test if the file $file contains a non-ASCII character, you can do:
if grep -qP '[^[:ascii:]]' "$file"; then
So for example your code would look like:
for file in "/some/path"/*; do
if grep -qP '[^[:ascii:]]' "$file"; then
test -d "$1" && mv "$file" "$1"
fi
done

The first problem is that your first if statement has an invalid test clause. The -eq operator of [ needs to take one argument before and one after; your before argument is gone or empty.
The second problem is that I think the echo is redundant.
The third problem is that the file command always has ASCII output but you're checking for binary output, which you'll never see.
Using file pretty smart for this application, although there are two ways you can go on this; file says a variety of things and what you're interested in are data and ASCII, but not all files that don't identify as data are ASCII and not all files that don't identify as ASCII are data. You might be better off going with the original idea of using grep, unless you need to support Unicode files. Your grep is a bit strange to me so I don't know what your environment is but I might try this:
#!/bin/bash
for file in "/home/osboxes/Parkhom"/*
do
if grep -qP '[\0x80-\0xFF]' $file; then
[ -e "$1" ] && mv $file $1
fi
done
The -q option means be quiet, only return a return code, don't show the matches. (It might be -s in your grep.) The return code is tested directly by the if statement (no need to use [ or test). The && in the next line is just a quick way of saying if the left-hand side is true, then execute the right-hand side. You could also form this as an if statement if you find that clearer. [ is a synonym for test. Personally if $1 is a directory and doesn't change, I'd check it once at the beginning of the script instead of on each file, it would be faster.

If you mean you want to know if something is not a plain text file then you can use the file command which returns information about the type of a file.
[[ ! $( file -b "$file" ) =~ (^| )text($| ) ]]
The -b simply tells it not to bother returning the filename.
The returned value will be something like:
ASCII text
HTML document text
POSIX shell script text executable
PNG image data, 21 x 34, 8-bit/color RGBA, non-interlaced
gzip compressed data, from Unix, last modified: Mon Oct 31 14:29:59 2016
The regular expression will check whether the returned file information includes the word "text" that is included for all plain text file types.
You can instead filter for specific file types like "ASCII text" if that is all you need.

How do I use `sed` to alter a variable in a bash script?

I'm trying to use enscript to print PDFs from Mutt, and hitting character encoding issues. One way around them seems to be to just use sed to replace the problem characters: sed -ir 's/[“”]/"/g' {input}
My test input file is this:
“very dirty”
we’re
I'm hoping to get "very dirty" and we're but instead I'm still getting
â\200\234very dirtyâ\200\235
weâ\200\231re
I found a nice little post on printing to PDFs from Mutt that I used as a starting point. I have a bash script that I point to from my .muttrc with set print_command="$HOME/.mutt/print.sh" -- the script currently reads about like this:
#!/bin/bash
input="$1" pdir="$HOME/Desktop" open_pdf=evince
# Straighten out curly quotes
sed -ir 's/[“”]/"/g' $input
sed -ir "s/[’]/'/g" $input
tmpfile="`mktemp $pdir/mutt_XXXXXXXX.pdf`"
enscript --font=Courier8 $input -2r --word-wrap --fancy-header=mutt -p - 2>/dev/null | ps2pdf - $tmpfile
$open_pdf $tmpfile >/dev/null 2>&1 &
sleep 1
rm $tmpfile
It does a fine job of creating a PDF (and works fine if you give it a file as an argument) but I can't figure out how to fix the curly quotes.
I've tried a bunch of variations on the sed line:
input=sed -r 's/[“”]/"/g' $input
$input=sed -ir "s/[’]/'/g" $input
Per the suggestion at Can I use sed to manipulate a variable in bash? I also tried input=$(sed -r 's/[“”]/"/g' <<< $input) and I get an error: "Syntax error: redirection unexpected"
But none manages to actually change $input -- what is the correct syntax to change $input with sed?
Note: I accepted an answer that resolved the question I asked, but as you can see from the comments there are a couple of other issues here. enscript is taking in a whole file as a variable, not just the text of the file. So trying to tweak the text inside the file is going to take a few extra steps. I'm still learning.

On Editing Variables In General
BashFAQ #21 is a comprehensive reference on performing search-and-replace operations in bash, including within variables, and is thus recommended reading. On this particular case:
Use the shell's native string manipulation instead; this is far higher performance than forking off a subshell, launching an external process inside it, and reading that external process's output. BashFAQ #100 covers this topic in detail, and is well worth reading.
Depending on your version of bash and configured locale, it might be possible to use a bracket expression (ie. [“”], as your original code did). However, the most portable thing is to treat “ and ” separately, which will work even without multi-byte character support available.
input='“hello ’cruel’ world”'
input=${input//'“'/'"'}
input=${input//'”'/'"'}
input=${input//'’'/"'"}
printf '%s\n' "$input"
...correctly outputs:
"hello 'cruel' world"
On Using sed
To provide a literal answer -- you almost had a working sed-based approach in your question.
input=$(sed -r 's/[“”]/"/g' <<<"$input")
...adds the missing syntactic double quotes around the parameter expansion of $input, ensuring that it's treated as a single token regardless of how it might be string-split or glob-expanded.
But All That May Not Help...
The below is mentioned because your test script is manipulating content passed on the command line; if that's not the case in production, you can probably disregard the below.
If your script is invoked as ./yourscript “hello * ’cruel’ * world”, then information about exactly what the user entered is lost before the script is started, and nothing you can do here will fix that.
This is because $1, in that scenario, will only contain “hello; ’cruel’ and world” are in their own argv locations, and the *s will have been replaced with lists of files in the current directory (each such file substituted as a separate argument) before the script was even started. Because the shell responsible for parsing the user's command line (which is not the same shell running your script!) did not recognize the quotes as valid at the time when it ran this parsing, by the time the script is running, there's nothing you can do to recover the original data.

Abstract: The way to use sed to change a variable is explored, but what you really need is a way to use and edit a file. It is covered ahead.
Sed
The (two) sed line(s) could be solved with this (note that -i is not used, it is not a file but a value):
input='“very dirty”
we’re'
sed 's/[“”]/\"/g;s/’/'\''/g' <<<"$input"
But it should be faster (for small strings) to use the internals of the shell:
input='“very dirty”
we’re'
input=${input//[“”]/\"}
input=${input//[’]/\'}
printf '%s\n' "$input"
$1
But there is an underlying problem with your script, you are trying to clean an input received from the command line. You are using $1 as the source of the string. Once somebody writes:
./script “very dirty”
we’re
That input is lost. It is broken into shell's tokens and "$1" will be “very only.
But I do not believe that is what you really have.
file
However, you are also saying that the input comes from a file. If that is the case, then read it in with:
input="$(<infile)" # not $1
sed 's/[“”]/\"/g;s/’/'\''/g' <<<"$input"
Or, if you don't mind to edit (change) the file, do this instead:
sed -i 's/[“”]/\"/g;s/’/'\''/g' infile
input="$(<infile)"
Or, if you are clear and certain that what is being given to the script is a filename, like:
./script infile
You can use:
infile="$1"
sed -i 's/[“”]/\"/g;s/’/'\''/g' "$infile"
input="$(<"$infile")"
Other comments:
Then:
Quote your variables.
Do not use the very old `…` syntax, use $(…) instead.
Do not use variables in UPPER case, those are reserved for environment variables.
And (unless you actually meant sh) use a shebang (first line) that targets bash.
The command enscript most definitively requires a file, not a variable.
Maybe you should use evince to open the PS file, there is no need of the step to make a pdf, unless you know you really need it.
I believe that is better use a file to store the output of enscript and ps2pdf.
Do not hide the errors printed by the commands until everything is working as desired, then, just call the script as:
./script infile 2>/dev/null
Or as required to make it less verbose.
Final script.
If you call the script with the name of the file that enscript is going to use, something like:
./script infile
Then, the whole script will look like this (runs both in bash or sh):
#!/usr/bin/env bash
Usage(){ echo "$0; This script require a source file"; exit 1; }
[ $# -lt 1 ] && Usage
[ ! -e $1 ] && Usage
infile="$1"
pdir="$HOME/Desktop"
open_pdf=evince
# Straighten out curly quotes
sed -i 's/[“”]/\"/g;s/’/'\''/g' "$infile"
tmpfile="$(mktemp "$pdir"/mutt_XXXXXXXX.pdf)"
outfile="${tmpfile%.*}.ps"
enscript --font=Courier10 "$infile" -2r \
--word-wrap --fancy-header=mutt -p "$outfile"
ps2pdf "$outfile" "$tmpfile"
"$open_pdf" "$tmpfile" >/dev/null 2>&1 &
sleep 5
rm "$tmpfile" "$outfile"

Rename every unicode file of a linux directory in ascii

I am trying to rename all unicode files name in ASCII.
I wanted to do something like this :
for file in `ls | egrep -v ^[a-z0-9\._-]+$`; do mv "$file" $(echo "$file" | slugify); done
But it doesn't work yet.
first, regexp ^[a-z0-9\._-]+$ doesn't seem to be enough.
second, slugify also transform the extension of the file so I have to cut the extension first and after put it back.
Any idea of a way to do that ?

First thing first, don't parse the output of ls. That is, in general, a bad idea, especially if you're expecting files that have any sort of strange characters in their names.
Assuming slugify does what you want with filenames in general, try:
for file in * ; do
if [ -f "$file" ] ; then
ext=${file##*.}
name=${file%.*}
new_name=$(echo "$name"|slugify)
if [[ $name != $new_name ]] ; then
echo mv -v "$name.$ext" "$new_name.$ext"
fi
fi
done
Warning: this will fail if you have files without an extension (it'll double-up the filename). See this other answer by Doctor J if you need to handle that.

How can I be sure of the file encoding?

I have a PHP file that I created with VIM, but I'm not sure which is its encoding.
When I use the terminal and check the encoding with the command file -bi foo (My operating system is Ubuntu 11.04) it gives me the next result:
text/html; charset=us-ascii
But, when I open the file with gedit it says its encoding is UTF-8.
Which one is correct? I want the file to be encoded in UTF-8.
My guess is that there's no BOM in the file and that the command file -bi reads the file and doesn't find any UTF-8 characters, so it assumes that it's ascii, but in reality it's encoded in UTF-8.

$ file --mime my.txt
my.txt: text/plain; charset=iso-8859-1

Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.
That being said, file typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.
Now to your question:
Run this command:
tr -d \\000-\\177 < your-file | wc -c
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
Run this command
iconv -f utf-8 -t ucs-4 < your-file >/dev/null
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).
If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.

(on Linux)
$ chardet <filename>
it also delivers the confidence level [0-1] of the output.

Based on #Celada answer and the #Arthur Zennig, I have created this simple script:
#/bin/bash
if [ "$#" -lt 1 ]
then
echo "Usage: utf8-check filename"
exit 1
fi
chardet $1
countchars="$(tr -d \\000-\\177 < $1 | wc -c)"
if [ $countchars -eq 0 ]
then
echo "Ascii";
exit 0
fi
{
iconv -f utf-8 -t ucs-4 < $1 >/dev/null
echo "UTF-8"
} || {
echo "not UTF-8 or corrupted"
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio