How to check whether a file is valid UTF-8? - validation

I'm processing some data files that are supposed to be valid UTF-8 but aren't, which causes the parser (not under my control) to fail. I'd like to add a stage of pre-validating the data for UTF-8 well-formedness, but I've not yet found a utility to help do this.
There's a web service at W3C which appears to be dead, and I've found a Windows-only validation tool that reports invalid UTF-8 files but doesn't report which lines/characters to fix.
I'd be happy with either a tool I can drop in and use (ideally cross-platform), or a ruby/perl script I can make part of my data loading process.

You can use GNU iconv:
$ iconv -f UTF-8 your_file -o /dev/null; echo $?
Or with older versions of iconv, such as on macOS:
$ iconv -f UTF-8 your_file > /dev/null; echo $?
The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred.
Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.

You can use isutf8 from the moreutils collection.
$ apt-get install moreutils
$ isutf8 your_file
In a shell script, use the --quiet switch and check the exit status, which is zero for files that are valid utf-8.

Use python and str.encode|decode functions.
>>> a="γεια"
>>> a
'\xce\xb3\xce\xb5\xce\xb9\xce\xb1'
>>> b='\xce\xb3\xce\xb5\xce\xb9\xff\xb1' # note second-to-last char changed
>>> print b.decode("utf_8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 6: unexpected code byte
The exception thrown has the info requested in its .args property.
>>> try: print b.decode("utf_8")
... except UnicodeDecodeError, exc: pass
...
>>> exc
UnicodeDecodeError('utf8', '\xce\xb3\xce\xb5\xce\xb9\xff\xb1', 6, 7, 'unexpected code byte')
>>> exc.args
('utf8', '\xce\xb3\xce\xb5\xce\xb9\xff\xb1', 6, 7, 'unexpected code byte')

How about the gnu iconv library? Using the iconv() function: "An invalid multibyte sequence is encountered in the input. In this case it sets errno to EILSEQ and returns (size_t)(-1). *inbuf is left pointing to the beginning of the invalid multibyte sequence."
EDIT: oh - i missed the part where you want a scripting language. But for command line work, the iconv utility should validate for you too.

Here is the bash script to check whether a file is valid UTF-8 or not:
#!/bin/bash
inputFile="./testFile.txt"
iconv -f UTF-8 "$inputFile" -o /dev/null
if [[ $? -eq 0 ]]
then
echo "Valid UTF-8 file.";
else
echo "Invalid UTF-8 file!";
fi
Description:
--from-code, -f encoding (Convert characters from encoding)
--to-code, -t encoding (Convert characters to encoding, it doesn't have to be specified, it will be assumed to be UTF-8.)
--output, -o file (Specify output file 'instead of stdout')

You can also use recode, which will exit with an error if it tries to decode UTF-8 and encounters invalid characters.
if recode utf8/..UCS < "$FILE" >/dev/null 2>&1; then
echo "Valid utf8 : $FILE"
else
echo "NOT valid utf8: $FILE"
fi
This tries to recode to the Universal Character Set (UCS) which is always possible from valid UTF-8.

Related

bash 'fold' screws up encoding in emacs

Reading lines from a 'somefile' and writing them to 'sample.org' file.
echo "$line" 1>>sample.org gives correct result, which is 'Субъективная оценка (от 1 до 5): 4 - отличный, понятный и богатый вкусом ..' (russian letters)
echo "$line" | fold -w 160 1>>sample.org gives this, which is technically correct if you copypaste it anywhere outside emacs. But still. Why using fold results in my emacs displaying 'sample.org' buffer in 'RAW-TEXT' instead of 'UTF-8'
To reproduce it create 2 files in same directory - test.sh, which will contain
cat 'test.org' |
while read -r line; do
# echo "$line" 1>'newfile.org' # works fine
# line below writes those weird chars to the output file
echo "$line" | fold -w 160 1>'newfile.org'
done
and test.org file, which will contain just 'Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание ГАМК 200мг/100г.'
Run the script with bash text.sh and hopefully you will see the problem in the output file newfile.org
I can't repro this on MacOS, but in an Ubuntu Docker image, it happens because fold inserts a newline in the middle of a UTF-8 multibyte sequence.
root#ef177a152b15:/# cat test.org
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание ГАМК 200мг/100г.
root#ef177a152b15:/# fold -w 160 test.org >newfile.org
root#ef177a152b15:/# cat newfile.org
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание Г?
?МК 200мг/100г.
root#ef177a152b15:/# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
(Perhaps also notice that your demo script can be reduced to a one-liner.)
I would have thought that GNU fold is locale-aware, but that you have to configure a UTF-8 locale for the support to be active; but that changes nothing for me.
root#ef177a152b15:/# locale -a
C
C.UTF-8
POSIX
root#ef177a152b15:/# LC_ALL=C.UTF-8 fold -w 160 test.org
Среднеферментированный среднепрожаренный улун полусферической скрутки. Содержание Г?
?МК 200мг/100г.
Under these circumstances, the best I can offer is to replace fold with a simple replacement.
#!/usr/bin/python3
from sys import argv
maxlen = int(argv.pop(1))
for file in argv[1:]:
with open(file) as lines:
for line in lines:
while len(line) > maxlen:
print(line[0:maxlen])
line = line[maxlen:]
print(line, end='')
For simplicity, this doesn't have any option processing; just pass in the maximum length as the first argument.
(Python 3 uses UTF-8 throughout on any sane platform. Unfortunately, that excludes Windows; but I am restating the obvious.)
Bash, of course, is entirely innocent here; the shell does not control external utilities like fold. (But not much help, either; echo "${tekst:48:64}" produces similar mojibake.)
I'm not sure where that images comes from, however fold and coreutils in general, as well as huge number of other common cli utils, can only be safely used with inputs consisting of symbols from Posix Portable Character Set and not with multibyte UTF-8, regardless of what bullshit websites such as utf8everywhere.org state. fold suffers from the common problem - it assumes that each symbol occupies just a singe char causing multibyte UTF-8 input to be corrupted when it splits the lines.

Script which will move non-ASCII files

I need help. I should write a script,whih will move all non-ASCII files from one directory to another. I got this code,but i dont know why it is not working.
#!/bin/bash
for file in "/home/osboxes/Parkhom"/*
do
if [ -eq "$( echo "$(file $file)" | grep -nP '[\x80-\xFF]' )" ];
then
if test -e "$1"; then
mv $file $1
fi
fi
done
exit 0
It's not clear which one you are after, but:
• To test if the variable $file contains a non-ASCII character, you can do:
if [[ $file == *[^[:ascii:]]* ]]; then
• To test if the file $file contains a non-ASCII character, you can do:
if grep -qP '[^[:ascii:]]' "$file"; then
So for example your code would look like:
for file in "/some/path"/*; do
if grep -qP '[^[:ascii:]]' "$file"; then
test -d "$1" && mv "$file" "$1"
fi
done
The first problem is that your first if statement has an invalid test clause. The -eq operator of [ needs to take one argument before and one after; your before argument is gone or empty.
The second problem is that I think the echo is redundant.
The third problem is that the file command always has ASCII output but you're checking for binary output, which you'll never see.
Using file pretty smart for this application, although there are two ways you can go on this; file says a variety of things and what you're interested in are data and ASCII, but not all files that don't identify as data are ASCII and not all files that don't identify as ASCII are data. You might be better off going with the original idea of using grep, unless you need to support Unicode files. Your grep is a bit strange to me so I don't know what your environment is but I might try this:
#!/bin/bash
for file in "/home/osboxes/Parkhom"/*
do
if grep -qP '[\0x80-\0xFF]' $file; then
[ -e "$1" ] && mv $file $1
fi
done
The -q option means be quiet, only return a return code, don't show the matches. (It might be -s in your grep.) The return code is tested directly by the if statement (no need to use [ or test). The && in the next line is just a quick way of saying if the left-hand side is true, then execute the right-hand side. You could also form this as an if statement if you find that clearer. [ is a synonym for test. Personally if $1 is a directory and doesn't change, I'd check it once at the beginning of the script instead of on each file, it would be faster.
If you mean you want to know if something is not a plain text file then you can use the file command which returns information about the type of a file.
[[ ! $( file -b "$file" ) =~ (^| )text($| ) ]]
The -b simply tells it not to bother returning the filename.
The returned value will be something like:
ASCII text
HTML document text
POSIX shell script text executable
PNG image data, 21 x 34, 8-bit/color RGBA, non-interlaced
gzip compressed data, from Unix, last modified: Mon Oct 31 14:29:59 2016
The regular expression will check whether the returned file information includes the word "text" that is included for all plain text file types.
You can instead filter for specific file types like "ASCII text" if that is all you need.

How can I be sure of the file encoding?

I have a PHP file that I created with VIM, but I'm not sure which is its encoding.
When I use the terminal and check the encoding with the command file -bi foo (My operating system is Ubuntu 11.04) it gives me the next result:
text/html; charset=us-ascii
But, when I open the file with gedit it says its encoding is UTF-8.
Which one is correct? I want the file to be encoded in UTF-8.
My guess is that there's no BOM in the file and that the command file -bi reads the file and doesn't find any UTF-8 characters, so it assumes that it's ascii, but in reality it's encoded in UTF-8.
$ file --mime my.txt
my.txt: text/plain; charset=iso-8859-1
Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.
That being said, file typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.
Now to your question:
Run this command:
tr -d \\000-\\177 < your-file | wc -c
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
Run this command
iconv -f utf-8 -t ucs-4 < your-file >/dev/null
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).
If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.
(on Linux)
$ chardet <filename>
it also delivers the confidence level [0-1] of the output.
Based on #Celada answer and the #Arthur Zennig, I have created this simple script:
#/bin/bash
if [ "$#" -lt 1 ]
then
echo "Usage: utf8-check filename"
exit 1
fi
chardet $1
countchars="$(tr -d \\000-\\177 < $1 | wc -c)"
if [ $countchars -eq 0 ]
then
echo "Ascii";
exit 0
fi
{
iconv -f utf-8 -t ucs-4 < $1 >/dev/null
echo "UTF-8"
} || {
echo "not UTF-8 or corrupted"
}

bash script error possibly related to length of filename (actually I don't know what's wrong) [duplicate]

This question already has answers here:
Are shell scripts sensitive to encoding and line endings?
(14 answers)
Closed 5 years ago.
Here's an example of my problematic code:
#!/bin/bash
fileList='fileList.txt'
#IFS=$'\n'
while read filename
do
echo listing "$filename"
ls -ligG "$filename"
done < "$fileList"
echo "done."
#unset IFS
exit 0
The output is:
listing /some/long/path/README.TXT
ls: cannot access /some/long/pa
: No such file or directoryDME.TXT
Notice that ls cuts off the path. Also notice that the end of the path/filename is appended to the error message (after "No such file or directory").
I just tested it with a path exactly this long and it still gives the error:
/this/is/an/example/of/shorter/name.txt
Anyone know what's going on? I've been messing with this for hours already :-/
In response to torek's answer, here is more info:
First, here's the modified script based on torek's suggestions:
#!/bin/bash
fileList=/settings/Scripts/fileList.txt
while IFS=$'\n' read -r filename
do
printf 'listing %q\n' "$filename"
ls -ligG $filename
done < "$fileList"
echo "done."
exit 0
Here's the output of that:
# ./test.sh
listing $'/example/pathname/myfile.txt\r'
: No such file or directorypathname/myfile.txt
done.
Notice there is some craziness going on still.
Here's the file. It does exist.
ls -ligG /example/pathname/myfile.txt
106828 -rwxrwx--- 1 34 Mar 28 00:55 /example/pathname/myfile.txt
Based on the unusual behavior, I'm going to say the file has CRLF line terminators. Your file names actually have an invisible carriage return appended to the name. In echo, this doesn't show up, since it just jumps to the first column then prints a newline. However, ls tries to access the file including the hidden carriage return, and in its error message, the carriage return causes the error message partially overwrite your path.
To trim these chars away, you can use tr:
tr -d '\r' < fileList.txt > fileListTrimmed.txt
and try using that file instead.
That embedded newline is a clue: the error message should read ls: cannot access /some/long/path/README.TXT: No such file or directory (no newline after the "a" in "path"). Even if there were some mysterious truncation happening, the colon should happen right after the "a" in "path". It doesn't, so, the string is not what it seems to be.
Try:
printf 'listing %q\n' "$filename"
for printing the file name before invoking ls. Bash's built-in printf has a %q format that will quote funny characters.
I'm not sure what the intent of the commented-out IFS-setting is. Perhaps you want to prevent read from splitting at whitespace? You can put the IFS= in front of the read, and you might want to use read -r as well:
while IFS=$'\n' read -r filename; do ...; done < "$fileList"

How do I determine file encoding in OS X?

I'm trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn't seem to understand them.
Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I've never seen before: an "#" by the file listing:
-rw-r--r--# 1 me users 2021 Feb 11 18:05 my_file.tex
(And, yes, I'm using \usepackage[utf8]{inputenc} in the LaTeX.)
I've found iconv, but that doesn't seem to be able to tell me what the encoding is -- it'll only convert once I figure it out.
Using the -I (that's a capital i) option on the file command seems to show the file encoding.
file -I {filename}
In Mac OS X the command file -I (capital i) will give you the proper character set so long as the file you are testing contains characters outside of the basic ASCII range.
For instance if you go into Terminal and use vi to create a file eg. vi test.txt
then insert some characters and include an accented character (try ALT-e followed by e)
then save the file.
They type file -I text.txt and you should get a result like this:
test.txt: text/plain; charset=utf-8
The # means that the file has extended file attributes associated with it. You can query them using the getxattr() function.
There's no definite way to detect the encoding of a file. Read this answer, it explains why.
There's a command line tool, enca, that attempts to guess the encoding. You might want to check it out.
vim -c 'execute "silent !echo " . &fileencoding | q' {filename}
aliased somewhere in my bash configuration as
alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
so I just type
vic {filename}
On my vanilla OSX Yosemite, it yields more precise results than "file -I":
$ file -I pdfs/udocument0.pdf
pdfs/udocument0.pdf: application/pdf; charset=binary
$ vic pdfs/udocument0.pdf
latin1
$
$ file -I pdfs/t0.pdf
pdfs/t0.pdf: application/pdf; charset=us-ascii
$ vic pdfs/t0.pdf
utf-8
You can also convert from one file type to another using the following command :
iconv -f original_charset -t new_charset originalfile > newfile
e.g.
iconv -f utf-16le -t utf-8 file1.txt > file2.txt
Just use:
file -I <filename>
That's it.
Using file command with the --mime-encoding option (e.g. file --mime-encoding some_file.txt) instead of the -I option works on OS X and has the added benefit of omitting the mime type, "text/plain", which you probably don't care about.
Classic 8-bit LaTeX is very restricted in which UTF8 characters it can use; it's highly dependent on the encoding of the font you're using and which glyphs that font has available.
Since you don't give a specific example, it's hard to know exactly where the problem is — whether you're attempting to use a glyph that your font doesn't have or whether you're not using the correct font encoding in the first place.
Here's a minimal example showing how a few UTF8 characters can be used in a LaTeX document:
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[utf8]{inputenc}
\begin{document}
‘Héllø—thêrè.’
\end{document}
You may have more luck with the [utf8x] encoding, but be slightly warned that it's no longer supported and has some idiosyncrasies compared with [utf8] (as far as I recall; it's been a while since I've looked at it). But if it does the trick, that's all that matters for you.
The # sign means the file has extended attributes. xattr file shows what attributes it has, xattr -l file shows the attribute values too (which can be large sometimes — try e.g. xattr /System/Library/Fonts/HelveLTMM to see an old-style font that exists in the resource fork).
Typing file myfile.tex in a terminal can sometimes tell you the encoding and type of file using a series of algorithms and magic numbers. It's fairly useful but don't rely on it providing concrete or reliable information.
A Localizable.strings file (found in localised Mac OS X applications) is typically reported to be a UTF-16 C source file.
Synalyze It! allows to compare text or bytes in all encodings the ICU library offers. Using that feature you usually see immediately which code page makes sense for your data.
You can try loading the file into a firefox window then go to View - Character Encoding. There should be a check mark next to the file's encoding type.
I implemented the bash script below, it works for me.
It first tries to iconv from the encoding returned by file --mime-encoding to utf-8.
If that fails, it goes through all encodings and shows the diff between the original and re-encoded file. It skips over encodings that produce a large diff output ("large" as defined by the MAX_DIFF_LINES variable or the second input argument), since those are most likely the wrong encoding.
If "bad things" happen as a result of using this script, don't blame me. There's a rm -f in there, so there be monsters. I tried to prevent adverse effects by using it on files with a random suffix, but I'm not making any promises.
Tested on Darwin 15.6.0.
#!/bin/bash
if [[ $# -lt 1 ]]
then
echo "ERROR: need one input argument: file of which the enconding is to be detected."
exit 3
fi
if [ ! -e "$1" ]
then
echo "ERROR: cannot find file '$1'"
exit 3
fi
if [[ $# -ge 2 ]]
then
MAX_DIFF_LINES=$2
else
MAX_DIFF_LINES=10
fi
#try the easy way
ENCOD=$(file --mime-encoding $1 | awk '{print $2}')
#check if this enconding is valid
iconv -f $ENCOD -t utf-8 $1 &> /dev/null
if [ $? -eq 0 ]
then
echo $ENCOD
exit 0
fi
#hard way, need the user to visually check the difference between the original and re-encoded files
for i in $(iconv -l | awk '{print $1}')
do
SINK=$1.$i.$RANDOM
iconv -f $i -t utf-8 $1 2> /dev/null > $SINK
if [ $? -eq 0 ]
then
DIFF=$(diff $1 $SINK)
if [ ! -z "$DIFF" ] && [ $(echo "$DIFF" | wc -l) -le $MAX_DIFF_LINES ]
then
echo "===== $i ====="
echo "$DIFF"
echo "Does that make sense [N/y]"
read $ANSWER
if [ "$ANSWER" == "y" ] || [ "$ANSWER" == "Y" ]
then
echo $i
exit 0
fi
fi
fi
#clean up re-encoded file
rm -f $SINK
done
echo "None of the encondings worked. You're stuck."
exit 3
Which LaTeX are you using? When I was using teTeX, I had to manually download the unicode package and add this to my .tex files:
% UTF-8 stuff
\usepackage[notipa]{ucs}
\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
Now, I've switched over to XeTeX from the TeXlive 2008 package (here), it is even more simple:
% UTF-8 stuff
\usepackage{fontspec}
\usepackage{xunicode}
As for detection of a file's encoding, you could play with file(1) (but it is rather limited) but like someone else said, it is difficult.
A brute-force way to check the encoding might just be to check the file in a hex editor or similar. (or write a program to check) Look at the binary data in the file. The UTF-8 format is fairly easy to recognize. All ASCII characters are single bytes with values below 128 (0x80)
Multibyte sequences follow the pattern shown in the wiki article
If you can find a simpler way to get a program to verify the encoding for you, that's obviously a shortcut, but if all else fails, this would do the trick.

Resources