Codemirror (AjaXplorer/Linux Webserver): ANSI-files with special characters - utf-8

I have a AjaXplorer installation on a Linux Webserver.
One of the plugins of AjaXplorer is Codemirror - to view and edit text files.
Now I have the following situation: If I create a txt-File on Windows (ANSI) and upload it into Ajaxplorer (UTF-8), Codemirror shows every special character as a question mark. Consequently the whole file will be saved with question marks instead of the special characters.
But if a file once is saved in UTF-8, the special characters will be saved correctly.
So the problem exists in opening the ANSI-File. This is the point where I have to implement the solution, for example convert ANSI to UTF-8.
The 'funny' thing is, that if I open a fresh uploaded ANSI-File and a saved UTF-8 file with VIM on the Linux console, they look exactly the same, but the output in codemirror is different.
This is a uploaded ANSI-File with special characters like ä and ö and ü
Output in Codemirror: 'like ? and ? and ?'
This is a saved UTF8-File with special characters like ä and ö and ü
Output in Codemirror 'like ä and ö and ü'
This is the CodeMirror-Class of AjaXplorer and I think here must be the point where I could intervene:
https://github.com/mattleff/AjaXplorer/blob/master/plugins/editor.codemirror/class.CodeMirrorEditor.js
As you may see, I'm not a pro and I already tried some code pieces - otherwise I already had the solution ;-) I would be happy if someone gives me a hint! Thank you!!!

Related

find reason for automatic encoding detection (UTF-8 vs Windows-1252)

I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.

RStudio: keeping special characters in a script

I wrote a script with German special characters e.g. ü.
However, whenever I close R and reopen the script the characters are substituted:
Before "für"; "hinzufügen"; "Ø" - After "für"; "hinzufügen"; "Ã".
I tried to remedy it using save with encoding and choosing UTF-8 as it is stated here but it did not work.
What am I missing?
You don't say what OS you're using, but this kind of thing really only happens on Windows nowadays, so I'll assume that.
The problem is that Windows has a local encoding that is not UTF-8. It is commonly something like Latin1 in English-speaking countries. I'm not sure what encoding people use in German-speaking countries, if that's where you are. From the junk you saw, it looks as though you saved the file in UTF-8, then read it using your local encoding. The encodings for writing and reading have to match if you want things to work.
In RStudio you can try "Reopen with encoding..." and specify UTF-8, and you'll probably get your original back, as long as you haven't saved it after the bad read. If you did that, you've got a much harder cleanup to do.

Some Chinese, Japanese characters appear garbled on IE/FF

We have a content rendering site that displays content in multiple languages. The site is built using JSP and content fetched from Oracle DB. All our pages are UTF-8 compliant.
When displaying zh/jp content, out of the complete content, only some of the character appear garbled (square boxes on IE and diamond question marks on FF). The data in DB does not have any garbled character. Since we dont understand the language, we dont know what characters are problematic. Would appreciate some pointers to the solution please. Could it be that some characters may be appearing invalid to the browsers?
Example in FF:
ネット犯���者 がアプ
脆弱性保護機能 - ネット犯���者 がアプリケーションのセキュリティホール (脆弱性) を突いて、パソコンに脅威を侵入させることを阻止します。
In doubt on the sanity of a UTF-8 encoding, you can always re-encode it either with a good text editor or with a specialized tool like iconv :
iconv -f UTF-8 -t UTF-8 yourfile > youfile2
If your file is indeed invalid, iconv will also give you some information on the problem.
But, another way you might want to explore is installing new fonts for Far East languages…
Indeed, not knowing the actual bytes used in your file, it is hard to say why their are replaced with a replacement character � (U+FFFD). Thus, you might want to post a hex dump of the parts of your file that do not work.

Emacs failure in loading charset map when saving file with unicode

I created an ordinary text file on Windows 7 64-bit using gnu emacs 23.3.1. I can edit the file with other programs such as LinqPad (the file happens to be a linqpad script, extension .linq). Everything is fine until I put a Unicode character in the file, a character such as the greek letter λ (lambda). I can input the letter in emacs and it displays correctly. However, emacs refuses to save the file, reporting the following error
Failure in loading charset map: 8859-7
If I input the λ in LinqPad, emacs will read and display them, but will not save the file.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λ's, but instead pairs of odd characters such as λ. That is fitting to an untuition (pun intended) that the unicode chars are being stored as pairs. So it looks like this is a kind of ambiguous situation (storing unicode in text files), but it also looks like linqPad and visual studio "do the obvious thing."
I want to use emacs because it's the only program that I have that reflows sequences of commented lines (lines after //, reflows them with Alt-Q), and I want to use greek characters in my comments because I'm describing a mathematical program.
I'll be grateful for advice and answers.
UPDATE: some advice in other questions said to try M-x describe-char, also bound to C-x = ; both of those give me the same failure message as above, so they're on the right track, just not answers.
This once happened to me when I had upgraded all packages (including Emacs) without realising I still had an Emacs session open during the upgrade. Next time I asked it to save some Unicode, it tried to load 8859-7 and failed because the path was different in the upgraded version. I had to redo the edit after restarting Emacs.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λs, but instead pairs of odd characters such as λ.
λ is what you get when you interpret the byte sequence 0xCE, 0xBB using the encoding ISO-8859-1, or Windows code page 1252 (Western European). Code page 1252 is probably the default (‘ANSI’) code page on your machine.
0xCE, 0xBB is the UTF-8 encoding of the character λ (U+03BB Greek small letter lambda). So to display it correctly you need to tell your text editor that the file is saved in UTF-8 and not ANSI.
In Notepad++, choose UTF-8 from the menu bar ‘Encoding’ entry.
In Emacs, C-x C-m c utf-8-dos (or unix or whatever) as a prefix to opening or saving the file. Hopefully by saving in UTF-8 you'll avoid whatever the problem is with the ISO 8859-7 (Greek) map; you certainly don't want to be saving any files in 8859-7, or indeed anything but UTF-8, if you can help it.

How does Windows Notepad interpret characters?

I was wondering how Windows interprets characters.
I made a file with a hex editor with the 3 bytes E3 81 81.
Those bytes are the ぁ character in UTF-8.
I opened the notepad and it displayed ぁ. I didn't specify the encoding of the file, I just created the bytes and the notepad interpreted it correctly.
Is notepad somehow guessing the encoding?
Or is the hex editor saving those bytes with a specific encoding?
If the file only contains these three bytes, then there is no information at all about which encoding to use.
A byte is just a byte, and there is no way to include any encoding information in it. Besides, the hex editor doesn't even know that you intended to decode the data as text.
Notepad normally uses ANSI encoding, so if it reads the file as UTF-8 then it has to guess the encoding based on the data in the file.
If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file.
Notepad makes an educated guess. I don't know the details, but loading the first few kilobytes and trying to convert them from UTF-8 is very simple, so it probably does something similar to that.
...and sometimes it gets it wrong...
https://ychittaranjan.wordpress.com/2006/06/20/buggy-notepad/
There is an easy and efficient way to check whether a file is in UTF-8. See Wikipedia: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=581360767#Advantages, fourth bullet point. Notepad probably uses this.
Wikipedia claims that Notepad used the IsTextUnicode function, which checks whether a patricular text is written in UTF-16 (it may have stopped using it in Windows Vista, which fixed the "Bush hid the facts" bug): http://en.wikipedia.org/wiki/Bush_hid_the_facts.
how to identify the file is in which encoding ....?
Go to the file and try to Save As... and you can see the default (current) encoding of the file (in which encoding it is saved).

Resources