Xpath accents in the source - xpath

I'm using google spreadsheet to extract a few book descriptions from an html page.
A1 contains the ISBN number and in another cell I have this =importXML("http://www.ibs.it/code/"&A1& "/scheda/libro.html","(//span[#class='tcorpotesto'])[1]").
It works but something is wrong with the accents. For example on http://www.ibs.it/code/9788823503298/hornby-nick/febbre-90ordm.html one of the words is 'Perché' but the scraped text in the cell is 'Perch?'
How can I fix this? It's the same problem with all the accented characters.

The document at http://www.ibs.it/code/9788823503298/hornby-nick/febbre-90ordm.html uses ISO-8859-1 encoding.
Google uses UTF-8.
It appears that their implementation of importXML() doesn't perform character set translation preserving these characters correctly. You could provide your own proxy / web service that runs translation inline, or file a ticket requesting a fix from upstream.

Related

Several occurrences of same Anchor Tag string

I have several occurrences of same Anchor Tag string in one document - /sn1/. But when signing I can see only one generated field - near the first match. Documentation says that a sign tab is created in every place a match is found in a document. What am I doing wrong or should strings be unique?
Update: I have the document in Hebrew(RTL) language, probably it's somehow connected with the problem, as I tested another document, but this time in English, and had multiple fields at the place of anchor string instances with no problem.
Well, it was incorrect convertation via libreconv gem that prevented additional sign tabs from appearing. The problem was in initial convertation from .docx to .pdf using the above mentioned gem. I do not recommend it at least for RTL documents.

FTP batch report with chinese characters to excel

We have a requirement to FTP the batch report to a excel sheet in .csv format. The batch report contains both single byte and double byte characters, for example, English and Chinese. The data in mainframe is in Base64 format and when this is FTP’ed in either Binary or ASCII mode, the resulting .csv spreadsheet shows only junk characters. We need a method to FTP the batch report file, so that the FTP’ed report is in readable format.
Request your help in resolving this issue.
I'm not familiar with Chinese character sets but I would think if you're not restricted to CSV, you might try to format an XML document for excel whereby you can specify the fonts as part of the spreadsheet definition.
Assuming that isn't an option I would think the Base64 format might need to be translated to ASCII (from EBCDIC) before transmission and then delivered in BINARY. Otherwise you risk having the data translated to something you didn't expect.
Another way to see what is really happening is send the data as ASCII and retrieve the data as BINARY and then compare the before and after results to see what characters were changed enroute during transmission. I recall having to do something similar to this once to resolve different code sets in Europe vs. U.S.
I'm not sure any of these suggestions would represent a "solution" to your problem, but these would be ideas that I would explore. I would be interested in hearing how you resolve this.

Some Chinese, Japanese characters appear garbled on IE/FF

We have a content rendering site that displays content in multiple languages. The site is built using JSP and content fetched from Oracle DB. All our pages are UTF-8 compliant.
When displaying zh/jp content, out of the complete content, only some of the character appear garbled (square boxes on IE and diamond question marks on FF). The data in DB does not have any garbled character. Since we dont understand the language, we dont know what characters are problematic. Would appreciate some pointers to the solution please. Could it be that some characters may be appearing invalid to the browsers?
Example in FF:
ネット犯���者 がアプ
脆弱性保護機能 - ネット犯���者 がアプリケーションのセキュリティホール (脆弱性) を突いて、パソコンに脅威を侵入させることを阻止します。
In doubt on the sanity of a UTF-8 encoding, you can always re-encode it either with a good text editor or with a specialized tool like iconv :
iconv -f UTF-8 -t UTF-8 yourfile > youfile2
If your file is indeed invalid, iconv will also give you some information on the problem.
But, another way you might want to explore is installing new fonts for Far East languages…
Indeed, not knowing the actual bytes used in your file, it is hard to say why their are replaced with a replacement character � (U+FFFD). Thus, you might want to post a hex dump of the parts of your file that do not work.

ruby string encoding

So, I'm trying to do some screen scraping off of a certain site using nokogiri, but the site owners failed to specify the proper encoding of the page in a <meta> tag. The upshot of this is that I'm trying to deal with strings that think they're utf-8, but really aren't.
(If you care, here are the files I was using to test this:
main file: http://dpaste.de/nif5/
ann.html: http://dpaste.de/YsLM/
ann2.html: http://dpaste.de/Lofi/
ann3.html: http://dpaste.de/R21j/
a-p.html: http://dpaste.de/O9dy/
output: http://dpaste.de/WdXc/
)
After doing a lot of searching around (this SO question was particularly useful), I found that calling encode('iso-8859-1', 'utf-8') on that test string "works", in that I get a proper © symbol. The issue now is that there are other characters in some other strings I want that really do not work at being converted to latin encoding (Shōta, for instance, turns into Sh�\x8Dta).
Now, I'm probably going to bother the appropriate webmasters and try and get them to fix their damn encodings, but in the meantime, I'd like to be able to use the bytes that I've got. I'm fairly certain that there is a way, but I just can't for the life of me figure out what it is.
Those pages appear to be correctly encoded as UTF-8. That's how my browser sees them, and when I viewsource them and tell the editor to decode them as UTF-8, they look fine. The only problem I see is that some copyright symbols seem to have been corrupted before (or as) they were added to the content. The o-macron and other non-ASCII letters come through just fine.
I don't know if you're aware of this, but the proper way to notify clients of a page's encoding is through a header. Pages may include that information in <meta> tags, but that's neither required nor expected; browsers typically ignore such tags if the header is present.
Since your pages are XHTML, they could also embed the encoding information in an XML processing instruction, but again, they're not required to. But it als means you could have Nokogiri treat them as XML instead of HTML, in which case I would expect it to use UTF-8 by default. But I'm not familiar with Nokogiri, so I can't be sure. And anyway, the header is still the final authority.
So, the issue is that ANN only specifies encoding via headers, and Nokogiri doesn't receive the headers from the open() function. So, Nokogiri guesses that the page is latin-encoded, and produces strings that we really can't reverse to get back the original characters from.
You can specify the encoding to Nokogiri as the 3rd parameter to Nokogiri::HTML(), which solves the issue I was initially trying to solve. So, I'll accept this answer, even though the more specific question I asked (how to get those non-latin characters out of a latin string) is unanswerable.

Is it possible to create INTERNATIONAL permalinks?

i was wondering how you deal with permalinks on international sites. By permalink i mean some link which is unique and human readable.
E.g. for english phrases its no problem e.g. /product/some-title/
but what do you do if the product title is in e.g chinese language??
how do you deal with this problem?
i am implementing an international site and one requirement is to have human readable URLs.
Thanks for every comment
Characters outside the ISO Latin-1 set are not permitted in URLs according to this spec, so Chinese strings would be out immediately.
Where the product name can be localised, you can use urls like <DOMAIN>/<LANGUAGE>/DIR/<PRODUCT_TRANSLATED>, e.g.:
http://www.example.com/en/products/cat/
http://www.example.com/fr/products/chat/
accompanied by a mod_rewrite rule to the effect of:
RewriteRule ^([a-z]+)/product/([a-z]+)? product_lookup.php?lang=$1&product=$2
For the first example above, this rule will call product_lookup.php?lang=en&product=cat. Inside this script is where you would access the internal translation engine (from the lang parameter, en in this case) to do the same translation you do on the user-facing side to translate, say, "Chat" on the French page, "Cat" on the English, etc.
Using an external translation API would be a good idea, but tricky to get a reliable one which works correctly in your business domain. Google have opened up a translation API, but it currently only supports a limited number of languages.
English <=> Arabic
English <=> Chinese
English <=> Russian
Take a look at Wikipedia.
They use national characters in URLs.
For example, Russian home page URL is: http://ru.wikipedia.org/wiki/Заглавная_страница. The browser transparently encodes all non-ASCII characters and replaces them by their codes when sending URL to the server.
But on the web page all URLs are human-readable.
So you don't need to do anything special -- just put your product names into URLs as is.
The webserver should be able to decode them for your application automatically.
I usually transliterate the non-ascii characters. For example "täst" would become "taest". GNU iconv can do this for you (I'm sure there are other libraries):
$ echo täst | iconv -t 'ascii//translit'
taest
Alas, these transliterations are locale dependent: in languages other than german, 'ä' could be translitertated as simply 'a', for example. But on the other side, there should be a transliteration for every (commonly used) character set into ASCII.
How about some scheme like /productid/{product-id-number}/some-title/
where the site looks at the {number} and ignores the 'some-title' part entirely. You can put that into whatever language or encoding you like, because it's not being used.
If memory serves, you're only able to use English letters in URLs. There's a discussion to change that, but I'm fairly positive that it's not been implemented yet.
that said, you'd need to have a look up table where you assign translations of products/titles into whatever word that they'll be in the other language. For example:
foo.com/cat will need a translation look up for "cat" "gato" "neko" etc.
Then your HTTP module which is parsing those human reading objects into an exact url will know which page to serve based upon the translations.
Creating a look up for such thing seems an overflow to me. I cannot create a lookup for all the different words in all languages. Maybe accessing an translation API would be a good idea.
So as far as I can see its not possible to use foreign chars in the permalink as the sepecs of the URL does not allow it.
What do you think of encoding the specials chars? are those URLs recognized by Google then?

Resources