what is the difference between utf8 and utf-8? - utf-8

just wondering that what is the difference between utf8 and utf-8? or they are just the same? as I have read in an answer here at SO saying use utf8 for PDO as utf-8 generate errors sometimes.
EDITS
where would utf-8 be invalid to use? (as answered in another similar question)

There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.

Related

What does rb:bom|utf-8 mean in CSV.open in Ruby?

What does the 'rb:bom|utf-8' mean in:
CSV.open(csv_name, 'rb:bom|utf-8', headers: true, return_headers: true) do |csv|
I can understand that:
r means read
bom is a file format with \xEF\xBB\xBF at the start of a file to
indicate endianness.
utf-8 is a file format
But:
I don't know how they fits together and why is it necessary to write all these for reading a csv
I'm struggling to find the documentation for
this. It doesn't seem to be documented in
https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html
Update:
Found a very useful documentation:
https://ruby-doc.org/core-2.6.3/IO.html#method-c-new-label-Open+Mode
(The accepted answer is not incorrect but incomplete)
rb:bom|utf-8 converted to a human readable sentence means:
Open the file for reading (r) in binary mode (b) and look for an Unicode BOM marker (bom) to detect the encoding or, in case no BOM marker is found, assume UTF-8 encoding (utf-8).
A BOM marker can be used to detect if a file is UTF-8 or UTF-16 and in case it is UTF-16, whether that is little or big endian UTF-16. There is also a BOM marker for UTF-32, yet Ruby doesn't support UTF-32 as of today. A BOM marker is just a special reserved byte sequence in the Unicode standard that is only used for the purpose of detecting the encoding of a file and it must be the first "character" of that file. It's recommended and typically used for UTF-16 as it exists in two different variants, it's optional for UTF-8 and usually if a file is Unicode but has no BOM marker, it is assumed to be UTF-8.
When reading a text file in Ruby you need to specify the encoding or it will revert to the default, which might be wrong.
If you're reading CSV files that are BOM encoded then you need to do it that way.
Pure UTF-8 encoding can't deal with the BOM header so you need to read it and skip past that part before treating the data as UTF-8. That notation is how Ruby expresses that requirement.

How does visual studio resolve unicode string from different encoding source file ?

I know if I using unicode charset in vs, I can use L"There is a string" to present an unicode string. I think There is a string will be read from srouce file when vs is doing lexical parsing, it will decode There is a string to unicode from source file's encoding.
I have change source file to some different encodings, but I always got the correct unicode data from L marco. Dose vs detect the encoding of source file to covert There is a string to correct unicode ? If not, how does vs achieve this ?
I'm not sure whether this question could be asked in SO, if not , where should I ask? Thanks in advance.
VS won't detect the encoding without a BOM1 signature at the start of a source file. It will just assume the localized ANSI encoding if no BOM is present.
A BOM signature identifies the UTF8/16/32 encoding used. So if you save something as UTF-8 (VS will add a BOM) and remove the first 3 bytes (EF BB BF), then the file will be interpreted as CP1252 on US Windows, but GB2312 on Chinese Windows, etc.
You are on Chinese Windows, so either save as GB2312 (without BOM) or UTF8 (with BOM) for VS to decode your source code correctly.
1https://en.wikipedia.org/wiki/Byte_order_mark

Smarty - Shift-jis encoding without using PHP

I want to show Shift-jis characters but only when displaying it. Store in UTF-8 and show in Shift-jis, so what is the solution to do that in Smarty?
You cannot mix different charsets/encodings in the output to the browser. So you can either send UTF-8 OR Shift-jis.
You can use UTF-8 internally and in an outputfilter convert the complete output from UTF-8 to Shift-jis (using mb_convert_encoding).
Smarty is not (really) equipped to deal with charsets other than ASCII supersets (like Latin1, UTF-8) internally.

Converting Multibyte characters to UTF-8

My application has to write data to an XML file which will be read by a swf file. The swf expects the data in the XML to be in UTF-8 encoding. I have to convert some Multibyte characters in my app(Chinese simplified, Japanese, Korean etc..) to UTF-8.
Are there any API calls which could allow me to do this?I would prefer not to use any 3rd party dlls. I need to do it both on Windows and on Mac and would prefer any system API's if available.
Thanks
jbsp72
UTF-8 is a multibyte encoding (Well, a variable byte-length encoding to be precise). Stating that you need to convert from a multibyte encoding is not enough. You need to specify which multibye encoding your source is?
I have to convert some Multibyte
characters in my app(Chinese
simplified, Japanese, Korean etc..) to
UTF-8.
if your original string is in multibyte (chinese/arabic/thai/etc..) and you need to convert it to other multibyte (UTF-8), One way is to convert to WideCharacter(UTF-16) first, then convert back to multibyte.
multibyte(chinese/arabic/thai/etc) -> widechar(UTF-16) -> multibyte(UTF-8)
if your original string is already in Unicode(UTF-16), you can skip the first conversion in the above illustration
you can refer the codepage from MSDN.
Google Chrome has some string conversion implementations for Windows, Linux, and Mac. You can see it here or here. the files are under src/base:
+ sys_string_conversions.h
+ sys_string_conversions_linux.cc
+ sys_string_conversions_win.cc
+ sys_string_conversions_mac.mm
The code uses BSD license so you can use it for commercial projects.

Finding files ISO-8859-1 encoded?

I have a bunch of files with a mixtures of encodings mainly ISO-8859-1 and UTF-8.
I would like to make all files UTF-8, but when trying to batch encode this files using
iconv some problems arise. (Files cuts by half, etc.)
I supposse the reason is that iconv requires to know the 'from' encoding, so if the command looks like this
iconv -f ISO-8859-1 -t UTF-8 in.php -o out.php
but 'in.php' if already UTF-8 encoded, that causes problems (correct me if I'm wrong)
Is there a way, that I can list all the files whose encoding is not UTF-8?
You can't find files that are definitely ISO-8859-1, but you can find files that are valid UTF-8 (which unlike most multibyte encodings give you a reasonable assurance that they are in fact UTF-8). moreutils has a tool isutf8 which can do this for you. Or you can write your own, it would be fairly simple.
It's often hard to tell just by reading a text file whether it's in UTF-8 encoding or not. You could scan the file for certain indicator bytes which can never occur in UTF-8, and if you find them, you know the file is in ISO-8859-1. If you find a byte with its high-order bit set, where the bytes both immediately before and immediately after it don't have their high-order bit set, you know it's ISO encoded (because bytes >127 always occur in sequences in UTF-8). Beyond that, it's basically guesswork - you'll have to look at the sequences of bytes with that high bit set and see whether it would make sense for them to occur in ISO-8859-1 or not.
The file program will make an attempt to guess the encoding of a text file it's processing, you could try that.
with find it's quite simple
find . -print0 | xargs -0 file | grep 8859
Is there a way, that I can list all the files whose encoding is not UTF-8?
Perhaps not so easily in bash alone, but it's a trivial task from eg. Python:
import os.path
for child in os.path.listdir(TARGETDIR):
child= os.path.join(TARGETDIR, child)
if os.path.isfile(child):
content= open(child, 'rb').read()
try:
unicode(content, 'utf-8')
except UnicodeDecodeError:
open(child, 'wb').write(unicode(content, 'iso-8859-1'))
This assumes that any file that can be interpreted as a valid UTF-8 sequence is one (and so can be left alone), whilst anything that isn't must be ISO-8859-1.
This is a reasonable assumption if those two are the only possible encodings, because valid UTF-8 sequences (of at least two top-bit-set characters in a particular order) are relatively rare in real Latin text, where we tend only to use the odd single accented characters here and there.
What kind of content? XML? Then yes, if properly tagged at the top. Generic text files? I don't know of any a-priori way to know what encoding is used, although it might be possible, sometimes, with clever code. "Tagged" UTF-8 text files, by which I mean UTF-8 text files with a Byte-Order mark? (For UTF-8, the three byte sequence "") Probably. The Byte Order Mark characters will not commonly appear as the first three characters in a ISO-8859-1 encoded file. (Which bobince pointed out in a comment to this post, so I'm correcting my post.)
For your purposes, tools exist that can probably solve most of your question. Logan Capaldo pointed out one in his answer.
But after all, if it were always possible to figure out, unambiguously, what character encoding was used in a file, then the iconv utility wouldn't need you to provide the "from" encoding. :)

Resources