I have a ZF2 application that uses a lot of Japanese. I'm trying to test the output of things such as people's names and addresses. But when I try to run $this->assertQueryContentContains() with DOM elements containing Japanese characters, the tests fail. What's more, the output in the console shows characters that are completely different from the ones I used. For example, I ran the following test:
$this->assertQueryContentContains('span#address', '<strong>Address:</strong> 〒300-1234 茨城県つくば市上郷1-2-3');
The output of the console showed this:
Failed asserting node denoted by span#address CONTAINS content "<strong>Address:</strong> 縲・00-1234縲闌ィ蝓守恁縺、縺上・蟶ゆク企・・托シ搾
As you can see (assuming your browser can properly display Japanese), the characters being output are completely different from the ones that I actually entered, which leads me to believe there's some kind of setting I need to set to allow testing utf-8. The test unit's file is encoded in utf-8 without BOM (via Notepad++).
May you try to use mb_convert_encoding function in your assert function
Eg:
$this->assertQueryContentContains('span#address', mb_convert_encoding('<strong>Address:</strong> 〒300-1234 茨城県つくば市上郷1-2-3',"SJIS"));
Related
I am trying to use wget -m <address> to download the contents of an FTP server. A lot of the content is icelandic and so contains a bunch of weird characters that I think are causing issues as I keep seeing:
Incomplete or invalid multibyte sequence encountered
I have tried adding flags such as --restrict-file-names=nocontrol but to no avail.
I have also tried using lftp but doesn't seem to make any difference.
According to wget manual
If you specify ‘nocontrol’, then the escaping of the control
characters is also switched off.
that is it as actually more permissive than default, bunch of weird characters suggest you have some issues with getting encoding right and therefore ascii seems to be best fit for your use case
The ‘ascii’ mode is used to specify that any bytes whose values are
outside the range of ASCII characters (that is, greater than 127)
shall be escaped. This can be useful when saving filenames whose
encoding does not match the one used locally.
As I do not have ability to test, please try it and write about result it give.
I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.
Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)
I have an IzPack installer that takes in a lot of User Inputs and substitutes them in an XML file. This XML file is actually the configuration file for my application.
There is a major problem that I have hit and I cant move on from it.
In the Input fields (in the installer) user can enter any text and also special characters like & # % ' etc. These special characters messes up my XML file as they are no allowed in the XML syntax and needs to be escaped. for example for & one would need &
So far I have been asking the user to do this, as in escape the special characters themselves, but thats now working either.
Is there a way to have this done automatically? I really need a solution fast.
I am using IzPack V 4.1
You should use a proper XML Api (SAX, DOM) to generate the XML file, this will apply the correct encoding automatically. This may look more complicated first but guarantees that a well formed, syntactically correct file is written.
Searching for JAXP should give you a proper starting point.
So, I'm trying to do some screen scraping off of a certain site using nokogiri, but the site owners failed to specify the proper encoding of the page in a <meta> tag. The upshot of this is that I'm trying to deal with strings that think they're utf-8, but really aren't.
(If you care, here are the files I was using to test this:
main file: http://dpaste.de/nif5/
ann.html: http://dpaste.de/YsLM/
ann2.html: http://dpaste.de/Lofi/
ann3.html: http://dpaste.de/R21j/
a-p.html: http://dpaste.de/O9dy/
output: http://dpaste.de/WdXc/
)
After doing a lot of searching around (this SO question was particularly useful), I found that calling encode('iso-8859-1', 'utf-8') on that test string "works", in that I get a proper © symbol. The issue now is that there are other characters in some other strings I want that really do not work at being converted to latin encoding (Shōta, for instance, turns into Sh�\x8Dta).
Now, I'm probably going to bother the appropriate webmasters and try and get them to fix their damn encodings, but in the meantime, I'd like to be able to use the bytes that I've got. I'm fairly certain that there is a way, but I just can't for the life of me figure out what it is.
Those pages appear to be correctly encoded as UTF-8. That's how my browser sees them, and when I viewsource them and tell the editor to decode them as UTF-8, they look fine. The only problem I see is that some copyright symbols seem to have been corrupted before (or as) they were added to the content. The o-macron and other non-ASCII letters come through just fine.
I don't know if you're aware of this, but the proper way to notify clients of a page's encoding is through a header. Pages may include that information in <meta> tags, but that's neither required nor expected; browsers typically ignore such tags if the header is present.
Since your pages are XHTML, they could also embed the encoding information in an XML processing instruction, but again, they're not required to. But it als means you could have Nokogiri treat them as XML instead of HTML, in which case I would expect it to use UTF-8 by default. But I'm not familiar with Nokogiri, so I can't be sure. And anyway, the header is still the final authority.
So, the issue is that ANN only specifies encoding via headers, and Nokogiri doesn't receive the headers from the open() function. So, Nokogiri guesses that the page is latin-encoded, and produces strings that we really can't reverse to get back the original characters from.
You can specify the encoding to Nokogiri as the 3rd parameter to Nokogiri::HTML(), which solves the issue I was initially trying to solve. So, I'll accept this answer, even though the more specific question I asked (how to get those non-latin characters out of a latin string) is unanswerable.