So... I was testing jGrasp and when i openned my testing file I saw something like this:
¿Khà ?
instead of this:
¿Khà?
but when i compile it, first i got the weird characters (the encoding was wrong). So i changed the encoding on the WorkSpace>Charset (the default, I/O and cygwin) to UTF-8 and got the correct output (like in the second image)... but it still looks the same on jGrasp.
If I change it on jGrasp so it looks "good", on other text editors will look diferent (and also in the compiler).
EDIT
I have found a few other encodings that work, but they aren't UTF-8, and also i don 't want to be changing every moment the encoding.
I'm not clear on exactly what the problem is, but if you need to open and/or edit a single file with a specific encoding different from the default, use "File" > "Open" and specify the charset on the dialog. The charset choice will be remembered.
Related
I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.
I wrote a script with German special characters e.g. ü.
However, whenever I close R and reopen the script the characters are substituted:
Before "für"; "hinzufügen"; "Ø" - After "für"; "hinzufügen"; "Ã".
I tried to remedy it using save with encoding and choosing UTF-8 as it is stated here but it did not work.
What am I missing?
You don't say what OS you're using, but this kind of thing really only happens on Windows nowadays, so I'll assume that.
The problem is that Windows has a local encoding that is not UTF-8. It is commonly something like Latin1 in English-speaking countries. I'm not sure what encoding people use in German-speaking countries, if that's where you are. From the junk you saw, it looks as though you saved the file in UTF-8, then read it using your local encoding. The encodings for writing and reading have to match if you want things to work.
In RStudio you can try "Reopen with encoding..." and specify UTF-8, and you'll probably get your original back, as long as you haven't saved it after the bad read. If you did that, you've got a much harder cleanup to do.
I have a solution on my Visual Studio and my program's language is Brazillian Portuguese.
Everytime I compile it and execute and it simply doesn't show the characters I wrote.
Example:
int main (void) {
printf("áéíóúàèìòù");
return 0;
}
It simply shows something really strange.
Although, I had tested another time taking the output to a file and it showed the right output, so I think the problem might be in the cmd.
Then, I searched what might be causing the problem and the results were hanging basically on the code page cmd used.
I finally used chcp 1252, but it seems it doens't work with me, so here I am. Does anyone know what code page should I use or what I can do to the source file to it show the right output? Thanks in advance.
I'm assuming C++.
The reason is that the file is saved with UTF-8 encoding, and the string literals are treated as a sequence of bytes.
So if you have "é" in your source code, it's treated as "\c9\a9" and it gets displayed in CP-437 (default Western encoding for Windows Command Prompt) as ├⌐
Solution: either:
save your source files in some 8-bit encoding (for example CP-1252), change the default encoding in VS, and set the terminal to use the same encoding,
or change your terminal to something that support UTF-8, like Cygwin.
I created an ordinary text file on Windows 7 64-bit using gnu emacs 23.3.1. I can edit the file with other programs such as LinqPad (the file happens to be a linqpad script, extension .linq). Everything is fine until I put a Unicode character in the file, a character such as the greek letter λ (lambda). I can input the letter in emacs and it displays correctly. However, emacs refuses to save the file, reporting the following error
Failure in loading charset map: 8859-7
If I input the λ in LinqPad, emacs will read and display them, but will not save the file.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λ's, but instead pairs of odd characters such as λ. That is fitting to an untuition (pun intended) that the unicode chars are being stored as pairs. So it looks like this is a kind of ambiguous situation (storing unicode in text files), but it also looks like linqPad and visual studio "do the obvious thing."
I want to use emacs because it's the only program that I have that reflows sequences of commented lines (lines after //, reflows them with Alt-Q), and I want to use greek characters in my comments because I'm describing a mathematical program.
I'll be grateful for advice and answers.
UPDATE: some advice in other questions said to try M-x describe-char, also bound to C-x = ; both of those give me the same failure message as above, so they're on the right track, just not answers.
This once happened to me when I had upgraded all packages (including Emacs) without realising I still had an Emacs session open during the upgrade. Next time I asked it to save some Unicode, it tried to load 8859-7 and failed because the path was different in the upgraded version. I had to redo the edit after restarting Emacs.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λs, but instead pairs of odd characters such as λ.
λ is what you get when you interpret the byte sequence 0xCE, 0xBB using the encoding ISO-8859-1, or Windows code page 1252 (Western European). Code page 1252 is probably the default (‘ANSI’) code page on your machine.
0xCE, 0xBB is the UTF-8 encoding of the character λ (U+03BB Greek small letter lambda). So to display it correctly you need to tell your text editor that the file is saved in UTF-8 and not ANSI.
In Notepad++, choose UTF-8 from the menu bar ‘Encoding’ entry.
In Emacs, C-x C-m c utf-8-dos (or unix or whatever) as a prefix to opening or saving the file. Hopefully by saving in UTF-8 you'll avoid whatever the problem is with the ISO 8859-7 (Greek) map; you certainly don't want to be saving any files in 8859-7, or indeed anything but UTF-8, if you can help it.
if we don't mention the decoding what decoding will they use?
I do not think it's System.Text.Encoding.Default. Things work well if I EXPLICITLY put System.Text.Encoding.Default but things go wrong when I live that empty.
So this doesn't work well
Dim b = System.IO.File.ReadAllText("test.txt")
System.IO.File.WriteAllText("test4.txt", b)
but this works well
Dim b = System.IO.File.ReadAllText("test.txt", System.Text.Encoding.Default)
System.IO.File.WriteAllText("test4.txt", b, System.Text.Encoding.Default)
If we do not specify encoding will vb.net try to figure out the encoding from the text file?
Also what is System.Text.Encoding.Default?
It's the system default. What is my system default and how can I change it?
How do I know encoding used in a text file?
If I create a new text file and open it with scite I see that the encoding is code page property. What is code page property?
Look here, "This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected."
see also http://msdn.microsoft.com/en-us/library/ms143375(v=vs.110).aspx
This method uses UTF-8 encoding without a Byte-Order Mark (BOM)