What's the default encoding for System.IO.File.ReadAllText - windows

if we don't mention the decoding what decoding will they use?
I do not think it's System.Text.Encoding.Default. Things work well if I EXPLICITLY put System.Text.Encoding.Default but things go wrong when I live that empty.
So this doesn't work well
Dim b = System.IO.File.ReadAllText("test.txt")
System.IO.File.WriteAllText("test4.txt", b)
but this works well
Dim b = System.IO.File.ReadAllText("test.txt", System.Text.Encoding.Default)
System.IO.File.WriteAllText("test4.txt", b, System.Text.Encoding.Default)
If we do not specify encoding will vb.net try to figure out the encoding from the text file?
Also what is System.Text.Encoding.Default?
It's the system default. What is my system default and how can I change it?
How do I know encoding used in a text file?
If I create a new text file and open it with scite I see that the encoding is code page property. What is code page property?

Look here, "This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected."

see also http://msdn.microsoft.com/en-us/library/ms143375(v=vs.110).aspx
This method uses UTF-8 encoding without a Byte-Order Mark (BOM)

Related

find reason for automatic encoding detection (UTF-8 vs Windows-1252)

I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.

How to find file encoding type or convert any encoding type to UTF-8 in shell?

I get text file of random encoding format, usc-2le, ansi, utf-8, usc-2be etc. I have to convert this files to utf8.
For conversion am using the following command
iconv options -f from-encoding -t utf-8 <inputfile > outputfile
But if incorrect from-encoding is provided, then incorrect file is generated.
I want a way to find the input file encoding type.
Thanks in advance
On Linux you could try using file(1) on your unknown input file. Most of the time it would guess the encoding correctly. Or else try several encodings to iconv till you "feel" that the result is acceptable (for example if you know that the file is some Russian poetry, you might try KOI-8, UTF-8, etc.... till you recognize a good Russian poem).
But character encoding is a nightmare and can be ambiguous. The provider of the file should tell you what encoding he used (and there is no way to get that encoding reliably and in all cases : there are some byte sequences which would be valid and interpreted differently with various encodings).
(notice that the HTTP protocol mentions and explicits the encoding)
In 2017, better use UTF-8 everywhere (and you should follow that http://utf8everywhere.org/ link) so ask your human partners to send you UTF-8 (hopefully most of your files are in UTF-8, since today they all should be).
(so encoding is more a social issue than a technical one)
I get text file of random encoding format
Notice that "random encoding" don't exist. You want and need to find out what character encoding (and file format) has been used by the provider of that file (so you mean "unknown encoding", not "random" one).
BTW, do you have a formal, unambiguous, sound and precise definition of text file, beyond file without zero bytes, or files with few control characters? LaTeX, C source, Markdown, SQL, UUencoding, shar, XPM, and HTML files are all text files, but very different ones!
You probably want to expect UTF-8, and you might use the file extension as some hint. Knowing the media-type could help.
(so if HTTP has been used to transfer the file, it is important to keep (and trust) the Content-Type...; read about HTTP headers)
[...] then incorrect file is generated.
How do you know that the resulting file is incorrect? You can only know if you have some expectations about that result (e.g. that it contains Russian poetry, not junk characters; but perhaps these junk characters are some bytecode to some secret interpreter, or some music represented in weird fashion, or encrypted, etc....). Raw files are just sequences of bytes, you need some extra knowledge to use them (even if you know that they use UTF-8).
We do file encoding conversion with
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
It's working fine , no need to give source encoding.

How does visual studio resolve unicode string from different encoding source file ?

I know if I using unicode charset in vs, I can use L"There is a string" to present an unicode string. I think There is a string will be read from srouce file when vs is doing lexical parsing, it will decode There is a string to unicode from source file's encoding.
I have change source file to some different encodings, but I always got the correct unicode data from L marco. Dose vs detect the encoding of source file to covert There is a string to correct unicode ? If not, how does vs achieve this ?
I'm not sure whether this question could be asked in SO, if not , where should I ask? Thanks in advance.
VS won't detect the encoding without a BOM1 signature at the start of a source file. It will just assume the localized ANSI encoding if no BOM is present.
A BOM signature identifies the UTF8/16/32 encoding used. So if you save something as UTF-8 (VS will add a BOM) and remove the first 3 bytes (EF BB BF), then the file will be interpreted as CP1252 on US Windows, but GB2312 on Chinese Windows, etc.
You are on Chinese Windows, so either save as GB2312 (without BOM) or UTF8 (with BOM) for VS to decode your source code correctly.
1https://en.wikipedia.org/wiki/Byte_order_mark

Change file (editor) encoding in jgrasp to UTF-8

So... I was testing jGrasp and when i openned my testing file I saw something like this:
¿Khà?
instead of this:
¿Khà?
but when i compile it, first i got the weird characters (the encoding was wrong). So i changed the encoding on the WorkSpace>Charset (the default, I/O and cygwin) to UTF-8 and got the correct output (like in the second image)... but it still looks the same on jGrasp.
If I change it on jGrasp so it looks "good", on other text editors will look diferent (and also in the compiler).
EDIT
I have found a few other encodings that work, but they aren't UTF-8, and also i don 't want to be changing every moment the encoding.
I'm not clear on exactly what the problem is, but if you need to open and/or edit a single file with a specific encoding different from the default, use "File" > "Open" and specify the charset on the dialog. The charset choice will be remembered.

How does Windows Notepad interpret characters?

I was wondering how Windows interprets characters.
I made a file with a hex editor with the 3 bytes E3 81 81.
Those bytes are the ぁ character in UTF-8.
I opened the notepad and it displayed ぁ. I didn't specify the encoding of the file, I just created the bytes and the notepad interpreted it correctly.
Is notepad somehow guessing the encoding?
Or is the hex editor saving those bytes with a specific encoding?
If the file only contains these three bytes, then there is no information at all about which encoding to use.
A byte is just a byte, and there is no way to include any encoding information in it. Besides, the hex editor doesn't even know that you intended to decode the data as text.
Notepad normally uses ANSI encoding, so if it reads the file as UTF-8 then it has to guess the encoding based on the data in the file.
If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file.
Notepad makes an educated guess. I don't know the details, but loading the first few kilobytes and trying to convert them from UTF-8 is very simple, so it probably does something similar to that.
...and sometimes it gets it wrong...
https://ychittaranjan.wordpress.com/2006/06/20/buggy-notepad/
There is an easy and efficient way to check whether a file is in UTF-8. See Wikipedia: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=581360767#Advantages, fourth bullet point. Notepad probably uses this.
Wikipedia claims that Notepad used the IsTextUnicode function, which checks whether a patricular text is written in UTF-16 (it may have stopped using it in Windows Vista, which fixed the "Bush hid the facts" bug): http://en.wikipedia.org/wiki/Bush_hid_the_facts.
how to identify the file is in which encoding ....?
Go to the file and try to Save As... and you can see the default (current) encoding of the file (in which encoding it is saved).

Resources