What does rb:bom|utf-8 mean in CSV.open in Ruby? - ruby

What does the 'rb:bom|utf-8' mean in:
CSV.open(csv_name, 'rb:bom|utf-8', headers: true, return_headers: true) do |csv|
I can understand that:
r means read
bom is a file format with \xEF\xBB\xBF at the start of a file to
indicate endianness.
utf-8 is a file format
But:
I don't know how they fits together and why is it necessary to write all these for reading a csv
I'm struggling to find the documentation for
this. It doesn't seem to be documented in
https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html
Update:
Found a very useful documentation:
https://ruby-doc.org/core-2.6.3/IO.html#method-c-new-label-Open+Mode

(The accepted answer is not incorrect but incomplete)
rb:bom|utf-8 converted to a human readable sentence means:
Open the file for reading (r) in binary mode (b) and look for an Unicode BOM marker (bom) to detect the encoding or, in case no BOM marker is found, assume UTF-8 encoding (utf-8).
A BOM marker can be used to detect if a file is UTF-8 or UTF-16 and in case it is UTF-16, whether that is little or big endian UTF-16. There is also a BOM marker for UTF-32, yet Ruby doesn't support UTF-32 as of today. A BOM marker is just a special reserved byte sequence in the Unicode standard that is only used for the purpose of detecting the encoding of a file and it must be the first "character" of that file. It's recommended and typically used for UTF-16 as it exists in two different variants, it's optional for UTF-8 and usually if a file is Unicode but has no BOM marker, it is assumed to be UTF-8.

When reading a text file in Ruby you need to specify the encoding or it will revert to the default, which might be wrong.
If you're reading CSV files that are BOM encoded then you need to do it that way.
Pure UTF-8 encoding can't deal with the BOM header so you need to read it and skip past that part before treating the data as UTF-8. That notation is how Ruby expresses that requirement.

Related

How to find file encoding type or convert any encoding type to UTF-8 in shell?

I get text file of random encoding format, usc-2le, ansi, utf-8, usc-2be etc. I have to convert this files to utf8.
For conversion am using the following command
iconv options -f from-encoding -t utf-8 <inputfile > outputfile
But if incorrect from-encoding is provided, then incorrect file is generated.
I want a way to find the input file encoding type.
Thanks in advance
On Linux you could try using file(1) on your unknown input file. Most of the time it would guess the encoding correctly. Or else try several encodings to iconv till you "feel" that the result is acceptable (for example if you know that the file is some Russian poetry, you might try KOI-8, UTF-8, etc.... till you recognize a good Russian poem).
But character encoding is a nightmare and can be ambiguous. The provider of the file should tell you what encoding he used (and there is no way to get that encoding reliably and in all cases : there are some byte sequences which would be valid and interpreted differently with various encodings).
(notice that the HTTP protocol mentions and explicits the encoding)
In 2017, better use UTF-8 everywhere (and you should follow that http://utf8everywhere.org/ link) so ask your human partners to send you UTF-8 (hopefully most of your files are in UTF-8, since today they all should be).
(so encoding is more a social issue than a technical one)
I get text file of random encoding format
Notice that "random encoding" don't exist. You want and need to find out what character encoding (and file format) has been used by the provider of that file (so you mean "unknown encoding", not "random" one).
BTW, do you have a formal, unambiguous, sound and precise definition of text file, beyond file without zero bytes, or files with few control characters? LaTeX, C source, Markdown, SQL, UUencoding, shar, XPM, and HTML files are all text files, but very different ones!
You probably want to expect UTF-8, and you might use the file extension as some hint. Knowing the media-type could help.
(so if HTTP has been used to transfer the file, it is important to keep (and trust) the Content-Type...; read about HTTP headers)
[...] then incorrect file is generated.
How do you know that the resulting file is incorrect? You can only know if you have some expectations about that result (e.g. that it contains Russian poetry, not junk characters; but perhaps these junk characters are some bytecode to some secret interpreter, or some music represented in weird fashion, or encrypted, etc....). Raw files are just sequences of bytes, you need some extra knowledge to use them (even if you know that they use UTF-8).
We do file encoding conversion with
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
It's working fine , no need to give source encoding.

Can I add a character of UTF-8 on a file encoded in ANSI?

I have a file of character encoding set to ANSI, however I can still copy a character of character set UTF-8. Are character sets defined on the file forced on the entire file? I am trying to understand how character sets works. Thanks
Files are bytes. They are long sequences of numbers. In most operating systems, that's all they are. There is no "encoding" attached to the file. The file is bytes.
It is up to software to interpret those bytes as having some meaning. For example, there is nothing fundamentally different between an "picture file" and a "text file." Both are just long sequences of numbers. But software interprets the "picture file" using some encoding rules to create a picture. Similarly, software interprets the "text file" using some encoding rules.
Most text file formats do not include their encoding anywhere the format. It's up to the software to know or infer what it is. Sometimes the operating system assists here and provides additional metadata that's not in the file, like filename extensions. This generally doesn't help for text files, since in most systems text files do not have different extensions based on their encoding.
There are many character encodings in ANSI that exactly match UTF-8 encodings. So just looking at a file, it may be impossible to tell which encoding it was written with, since it could be identical in both. There are byte sequences that are illegal in UTF-8, so it is possible to determine that file is not valid UTF-8, but all byte sequences are valid ANSI (though there are byte sequences that are very rare, and so can be used to guess that it's not ANSI).
(I assume you mean Windows-1252; there isn't really such a thing as "ANSI" encoding.)

How does visual studio resolve unicode string from different encoding source file ?

I know if I using unicode charset in vs, I can use L"There is a string" to present an unicode string. I think There is a string will be read from srouce file when vs is doing lexical parsing, it will decode There is a string to unicode from source file's encoding.
I have change source file to some different encodings, but I always got the correct unicode data from L marco. Dose vs detect the encoding of source file to covert There is a string to correct unicode ? If not, how does vs achieve this ?
I'm not sure whether this question could be asked in SO, if not , where should I ask? Thanks in advance.
VS won't detect the encoding without a BOM1 signature at the start of a source file. It will just assume the localized ANSI encoding if no BOM is present.
A BOM signature identifies the UTF8/16/32 encoding used. So if you save something as UTF-8 (VS will add a BOM) and remove the first 3 bytes (EF BB BF), then the file will be interpreted as CP1252 on US Windows, but GB2312 on Chinese Windows, etc.
You are on Chinese Windows, so either save as GB2312 (without BOM) or UTF8 (with BOM) for VS to decode your source code correctly.
1https://en.wikipedia.org/wiki/Byte_order_mark

Batch convert to UTF8 using Ruby

I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.

Finding files ISO-8859-1 encoded?

I have a bunch of files with a mixtures of encodings mainly ISO-8859-1 and UTF-8.
I would like to make all files UTF-8, but when trying to batch encode this files using
iconv some problems arise. (Files cuts by half, etc.)
I supposse the reason is that iconv requires to know the 'from' encoding, so if the command looks like this
iconv -f ISO-8859-1 -t UTF-8 in.php -o out.php
but 'in.php' if already UTF-8 encoded, that causes problems (correct me if I'm wrong)
Is there a way, that I can list all the files whose encoding is not UTF-8?
You can't find files that are definitely ISO-8859-1, but you can find files that are valid UTF-8 (which unlike most multibyte encodings give you a reasonable assurance that they are in fact UTF-8). moreutils has a tool isutf8 which can do this for you. Or you can write your own, it would be fairly simple.
It's often hard to tell just by reading a text file whether it's in UTF-8 encoding or not. You could scan the file for certain indicator bytes which can never occur in UTF-8, and if you find them, you know the file is in ISO-8859-1. If you find a byte with its high-order bit set, where the bytes both immediately before and immediately after it don't have their high-order bit set, you know it's ISO encoded (because bytes >127 always occur in sequences in UTF-8). Beyond that, it's basically guesswork - you'll have to look at the sequences of bytes with that high bit set and see whether it would make sense for them to occur in ISO-8859-1 or not.
The file program will make an attempt to guess the encoding of a text file it's processing, you could try that.
with find it's quite simple
find . -print0 | xargs -0 file | grep 8859
Is there a way, that I can list all the files whose encoding is not UTF-8?
Perhaps not so easily in bash alone, but it's a trivial task from eg. Python:
import os.path
for child in os.path.listdir(TARGETDIR):
child= os.path.join(TARGETDIR, child)
if os.path.isfile(child):
content= open(child, 'rb').read()
try:
unicode(content, 'utf-8')
except UnicodeDecodeError:
open(child, 'wb').write(unicode(content, 'iso-8859-1'))
This assumes that any file that can be interpreted as a valid UTF-8 sequence is one (and so can be left alone), whilst anything that isn't must be ISO-8859-1.
This is a reasonable assumption if those two are the only possible encodings, because valid UTF-8 sequences (of at least two top-bit-set characters in a particular order) are relatively rare in real Latin text, where we tend only to use the odd single accented characters here and there.
What kind of content? XML? Then yes, if properly tagged at the top. Generic text files? I don't know of any a-priori way to know what encoding is used, although it might be possible, sometimes, with clever code. "Tagged" UTF-8 text files, by which I mean UTF-8 text files with a Byte-Order mark? (For UTF-8, the three byte sequence "") Probably. The Byte Order Mark characters will not commonly appear as the first three characters in a ISO-8859-1 encoded file. (Which bobince pointed out in a comment to this post, so I'm correcting my post.)
For your purposes, tools exist that can probably solve most of your question. Logan Capaldo pointed out one in his answer.
But after all, if it were always possible to figure out, unambiguously, what character encoding was used in a file, then the iconv utility wouldn't need you to provide the "from" encoding. :)

Resources