Internationalizing (I18n) my extension, translation file upload error - utf-8

Getting the following error when uploading my extension:
An error occurred: Message JSON file must be in UTF-8 encoding.
I have about 19 translation files.
When I run the following command locally:
file extension/_locales/[locale]/messages.json
I get:
extension/[locale]/messages.json: UTF-8 Unicode English text
On a few locale translations (Polish, Catalonian, Portugese, French, etc..) I get the following message
extension/[locale]/messages.json: UTF-8 Unicode English text, with very long lines
I have tracked the upload error from the Chrome webstore down to locale translation files that output 'with very long lines' from the 'file' command.
I'm not really sure how to fix this problem. Any advice?
Oh, I should mention the translation files:
Don't have a BOM
Contain no comments
UPDATE:
This error was caused by two problems:
Forgot to remove a comment in one of the locale json files.
There was a bad character in a few locale files.
Really makes me frustrated that I didn't run into this problem locally during development. C'mon Chrome...

If the issue is really that the key/values are exceeding some size limit, then you might try breaking up the lines that go over 300 characters onto multiple lines:
{
"longMessage": {
"message": "This is a really long message
over 300 characters that has
been put on multiple lines"
}
}
See this question for more details on the 300 character limit:
https://superuser.com/questions/91660/how-long-is-long-for-the-unix-file-command
It would be useful to narrow down reproducing this error to just one key/value pair so that you can run this test on that one string as opposed to all of them. Also, if you are able to narrow it down to just one, then edit you question with the string so I can try locally.

Related

How to archive an entire FTP server where many of the filenames seem to include illegal characters

I am trying to use wget -m <address> to download the contents of an FTP server. A lot of the content is icelandic and so contains a bunch of weird characters that I think are causing issues as I keep seeing:
Incomplete or invalid multibyte sequence encountered
I have tried adding flags such as --restrict-file-names=nocontrol but to no avail.
I have also tried using lftp but doesn't seem to make any difference.
According to wget manual
If you specify ‘nocontrol’, then the escaping of the control
characters is also switched off.
that is it as actually more permissive than default, bunch of weird characters suggest you have some issues with getting encoding right and therefore ascii seems to be best fit for your use case
The ‘ascii’ mode is used to specify that any bytes whose values are
outside the range of ASCII characters (that is, greater than 127)
shall be escaped. This can be useful when saving filenames whose
encoding does not match the one used locally.
As I do not have ability to test, please try it and write about result it give.

System date format causing date in file to change when copied using IO.File.Copy()

A user ran into an exception the other day when my code tried to parse a date from a line of text in a configuration file. The customer was using the Chinese date localization, so I figured that the issue was the parsing.
However, when reproducing the problem, I found the text in the file itself was in the Chinese format. This file is copied from a backup location, which I verified to not be in the Chinese format.
It turns out when the file was previously copied from that location by a call to IO.File.Copy(), the text changed from:
22/JUN/2016,00:00:00
to
22/6月/2016,00:00:00
The size of the file even changed.
Due to this, an exception was thrown when trying to parse that text on this call:
DateTime.ParseExact(timeString, datetimeFormat, CultureInfo.InvariantCulture)
The data doesn't have to be localized, so we always use CultureInfo.InvariantCulture. However, if the file changes the date format, this breaks.
When I copy and paste the file as usual, the file doesn't change, which is even more curious.
I verified this problem to occur on Windows 10, after changing the Regional Format to Chinese (Simplified, China).
Any ideas as to why IO.File.Copy() causes this change?
It's always good to use "TryParseExact" for parsing the datetime.
Answering your question, there might be few things you need to look into:
Look in to the code used to Copy the file.
Try using StreamReader/StreamWriter for copying the file, do specify the Encoding Type.

find reason for automatic encoding detection (UTF-8 vs Windows-1252)

I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.

Having trouble with Japanese text in PHPUnit

I have a ZF2 application that uses a lot of Japanese. I'm trying to test the output of things such as people's names and addresses. But when I try to run $this->assertQueryContentContains() with DOM elements containing Japanese characters, the tests fail. What's more, the output in the console shows characters that are completely different from the ones I used. For example, I ran the following test:
$this->assertQueryContentContains('span#address', '<strong>Address:</strong> 〒300-1234 茨城県つくば市上郷1-2-3');
The output of the console showed this:
Failed asserting node denoted by span#address CONTAINS content "<strong>Address:</strong> 縲・00-1234縲闌ィ蝓守恁縺、縺上・蟶ゆク企・・托シ搾
As you can see (assuming your browser can properly display Japanese), the characters being output are completely different from the ones that I actually entered, which leads me to believe there's some kind of setting I need to set to allow testing utf-8. The test unit's file is encoded in utf-8 without BOM (via Notepad++).
May you try to use mb_convert_encoding function in your assert function
Eg:
$this->assertQueryContentContains('span#address', mb_convert_encoding('<strong>Address:</strong> 〒300-1234 茨城県つくば市上郷1-2-3',"SJIS"));

Codeigniter black diamond characters

This is more of a curiosity than actual problem as there is an easy and propably more preferable workaround. When using Codeigniters form validation and when displaying error message the CI user guide gives two ways to set one's own validation messages: through set_message-method and editing the language file which is located in the system folder.
However when editing the language file to contain error messages in my native language (which contains special character liks 'Ä' and 'Ö') the special characters are replaced with a black diamond. When using the set_message-method from form_validation it works without a problem and the characters are encoded with UTF-8 properly.
I am wondering where lies the problem when using the file instead of the method and how to solve it?
It sounds like the file is not saved by your editor as UTF-8. Make sure that it is.

Resources