How to create a file with unknown character encoding - utf-8

I would like to test some file character encoding detection functionality, where I input files of type UTF-8, windows-1252, ISO-8859-1, etc.
I also want to input files with unknown character encoding so that the user can be alerted.
I haven't found a good way to create files with an unknown or undetectable character encoding.

head -c1024 /dev/random > /tmp/badencoding
This is almost certainly what you want in practice (1kB of random data), but there isn't really a good definition of "undetectable character encoding." This random file is legal 8-bit ASCII. The fact that it certainly is not meant to be 8-bit ASCII is just a heuristic. So all you're going to wind up doing is testing that your algorithm works in ways that your users probably want it to; there is no ultimate "correct" here without reading the mind of the person who created the file.

An empty text file has an undetectable character encoding (except if it has a Unicode BOM).
But basically, you either have to require the user to tell which character encoding a file they are giving you uses, or tell them which one to use (or both, if you specify a default but allow it to be overridden [which is what many compilers do.]).
You can then test the contents for validity against the agreed character encoding. This will catch some errors but note that many character encodings allow any sequence of bytes with any value so any content is always valid (even if the character encoding is not what was used to write the file).
You can then test for consistency with expected values, such as some syntax or allowable character or words, to catch more errors (but you wouldn't necessarily be able to say the character encoding didn't match; it could be just the content is incorrect).
To create files with different character encodings, you could write a program or use a 3rd-party program such as iconv or PowerShell.
If you want an unknown character encoding, just generate a random integer map, convert a file, discard the map, and then not even you will know it.
Ultimately, text files are too technical for users to deal with. Give them some other option such as an open document or spreadsheet format such as .odt, .docx, .ods, or .xlsx. These are very easy to read by programs.

Related

find reason for automatic encoding detection (UTF-8 vs Windows-1252)

I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.

How to find file encoding type or convert any encoding type to UTF-8 in shell?

I get text file of random encoding format, usc-2le, ansi, utf-8, usc-2be etc. I have to convert this files to utf8.
For conversion am using the following command
iconv options -f from-encoding -t utf-8 <inputfile > outputfile
But if incorrect from-encoding is provided, then incorrect file is generated.
I want a way to find the input file encoding type.
Thanks in advance
On Linux you could try using file(1) on your unknown input file. Most of the time it would guess the encoding correctly. Or else try several encodings to iconv till you "feel" that the result is acceptable (for example if you know that the file is some Russian poetry, you might try KOI-8, UTF-8, etc.... till you recognize a good Russian poem).
But character encoding is a nightmare and can be ambiguous. The provider of the file should tell you what encoding he used (and there is no way to get that encoding reliably and in all cases : there are some byte sequences which would be valid and interpreted differently with various encodings).
(notice that the HTTP protocol mentions and explicits the encoding)
In 2017, better use UTF-8 everywhere (and you should follow that http://utf8everywhere.org/ link) so ask your human partners to send you UTF-8 (hopefully most of your files are in UTF-8, since today they all should be).
(so encoding is more a social issue than a technical one)
I get text file of random encoding format
Notice that "random encoding" don't exist. You want and need to find out what character encoding (and file format) has been used by the provider of that file (so you mean "unknown encoding", not "random" one).
BTW, do you have a formal, unambiguous, sound and precise definition of text file, beyond file without zero bytes, or files with few control characters? LaTeX, C source, Markdown, SQL, UUencoding, shar, XPM, and HTML files are all text files, but very different ones!
You probably want to expect UTF-8, and you might use the file extension as some hint. Knowing the media-type could help.
(so if HTTP has been used to transfer the file, it is important to keep (and trust) the Content-Type...; read about HTTP headers)
[...] then incorrect file is generated.
How do you know that the resulting file is incorrect? You can only know if you have some expectations about that result (e.g. that it contains Russian poetry, not junk characters; but perhaps these junk characters are some bytecode to some secret interpreter, or some music represented in weird fashion, or encrypted, etc....). Raw files are just sequences of bytes, you need some extra knowledge to use them (even if you know that they use UTF-8).
We do file encoding conversion with
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
It's working fine , no need to give source encoding.

Can I add a character of UTF-8 on a file encoded in ANSI?

I have a file of character encoding set to ANSI, however I can still copy a character of character set UTF-8. Are character sets defined on the file forced on the entire file? I am trying to understand how character sets works. Thanks
Files are bytes. They are long sequences of numbers. In most operating systems, that's all they are. There is no "encoding" attached to the file. The file is bytes.
It is up to software to interpret those bytes as having some meaning. For example, there is nothing fundamentally different between an "picture file" and a "text file." Both are just long sequences of numbers. But software interprets the "picture file" using some encoding rules to create a picture. Similarly, software interprets the "text file" using some encoding rules.
Most text file formats do not include their encoding anywhere the format. It's up to the software to know or infer what it is. Sometimes the operating system assists here and provides additional metadata that's not in the file, like filename extensions. This generally doesn't help for text files, since in most systems text files do not have different extensions based on their encoding.
There are many character encodings in ANSI that exactly match UTF-8 encodings. So just looking at a file, it may be impossible to tell which encoding it was written with, since it could be identical in both. There are byte sequences that are illegal in UTF-8, so it is possible to determine that file is not valid UTF-8, but all byte sequences are valid ANSI (though there are byte sequences that are very rare, and so can be used to guess that it's not ANSI).
(I assume you mean Windows-1252; there isn't really such a thing as "ANSI" encoding.)

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

Finding files ISO-8859-1 encoded?

I have a bunch of files with a mixtures of encodings mainly ISO-8859-1 and UTF-8.
I would like to make all files UTF-8, but when trying to batch encode this files using
iconv some problems arise. (Files cuts by half, etc.)
I supposse the reason is that iconv requires to know the 'from' encoding, so if the command looks like this
iconv -f ISO-8859-1 -t UTF-8 in.php -o out.php
but 'in.php' if already UTF-8 encoded, that causes problems (correct me if I'm wrong)
Is there a way, that I can list all the files whose encoding is not UTF-8?
You can't find files that are definitely ISO-8859-1, but you can find files that are valid UTF-8 (which unlike most multibyte encodings give you a reasonable assurance that they are in fact UTF-8). moreutils has a tool isutf8 which can do this for you. Or you can write your own, it would be fairly simple.
It's often hard to tell just by reading a text file whether it's in UTF-8 encoding or not. You could scan the file for certain indicator bytes which can never occur in UTF-8, and if you find them, you know the file is in ISO-8859-1. If you find a byte with its high-order bit set, where the bytes both immediately before and immediately after it don't have their high-order bit set, you know it's ISO encoded (because bytes >127 always occur in sequences in UTF-8). Beyond that, it's basically guesswork - you'll have to look at the sequences of bytes with that high bit set and see whether it would make sense for them to occur in ISO-8859-1 or not.
The file program will make an attempt to guess the encoding of a text file it's processing, you could try that.
with find it's quite simple
find . -print0 | xargs -0 file | grep 8859
Is there a way, that I can list all the files whose encoding is not UTF-8?
Perhaps not so easily in bash alone, but it's a trivial task from eg. Python:
import os.path
for child in os.path.listdir(TARGETDIR):
child= os.path.join(TARGETDIR, child)
if os.path.isfile(child):
content= open(child, 'rb').read()
try:
unicode(content, 'utf-8')
except UnicodeDecodeError:
open(child, 'wb').write(unicode(content, 'iso-8859-1'))
This assumes that any file that can be interpreted as a valid UTF-8 sequence is one (and so can be left alone), whilst anything that isn't must be ISO-8859-1.
This is a reasonable assumption if those two are the only possible encodings, because valid UTF-8 sequences (of at least two top-bit-set characters in a particular order) are relatively rare in real Latin text, where we tend only to use the odd single accented characters here and there.
What kind of content? XML? Then yes, if properly tagged at the top. Generic text files? I don't know of any a-priori way to know what encoding is used, although it might be possible, sometimes, with clever code. "Tagged" UTF-8 text files, by which I mean UTF-8 text files with a Byte-Order mark? (For UTF-8, the three byte sequence "") Probably. The Byte Order Mark characters will not commonly appear as the first three characters in a ISO-8859-1 encoded file. (Which bobince pointed out in a comment to this post, so I'm correcting my post.)
For your purposes, tools exist that can probably solve most of your question. Logan Capaldo pointed out one in his answer.
But after all, if it were always possible to figure out, unambiguously, what character encoding was used in a file, then the iconv utility wouldn't need you to provide the "from" encoding. :)

Resources