RSS reader Error : Input is not proper UTF-8 when use simplexml_load_file() - utf-8

I'm using simplexml_load_file method for parsing feed from external source.
My code like this
$rssFeed['DAILYSTAR'] = 'http://www.thedailystar.net/latest/rss/rss.xml';
$rssParser = simplexml_load_file($url);
The output is as follows :
Warning: simplexml_load_file() [function.simplexml-load-file]: http://www.thedailystar.net/latest/rss/rss.xml:12: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x92 0x73 0x20 0x48 in C:\xampp\htdocs\googlebd\index.php on line 39
Ultimately stop with a fatal error. Main problem is the site's character encoding is ISO-8859-1, not UTF-8.
Can i be able to read this using this method(SimpleXML API)?
If no then any other method is available?
I've searched through Google but no answer. Every method I applied returns with this error.
Thanks,
Rashed

Well, well, when I retrieve this content using Python, I get the following:
'\n<rss version="2.0" encoding="ISO-8859-1">\n [...]
<description>The results of this year\x92s Higher Secondary Certificate
Now it says it's ISO-8859-1, but \x92 is not in that character set, but instead is the closing curly single quote, used as an apostrophe, in Windows-1252. So the page throws an encoding error, and as per the XML spec, clients should be "strict" and not fix errors.
You can retrieve it, and filter out the non-ISO-8859-1 characters in some fashion, or better, convert the encoding using mb-convert-encoding() before passing the result to your RSS parser.
Oh, and if you want to incorporate the result into a UTF-8 page, you may have convert everything to UTF-8, though this is English, which might not even require any different character encodings, if all turns out to be ASCII after all.

We ran into the same issue and used utf8_encode to change the encoding from ISO-8859-1/latin-1 to UTF-8 and get past the error.
$contents = file_get_contents($url);
simplexml_load_string(utf8_encode($contents));

Related

How to handle the javascript decoding error?

When the code is implemented, some characters cannot be decoded. I am getting a bunch of question marks like ??. How can I fix this?
HtmlInput inputBox2 = (HtmlInput)currentPage.getHtmlElementById("classNo");
inputBox2.setValueAttribute("2016同學15");
ScriptResult result = currentPage.executeJavaScript("javascript:Search(2)");
I found this in the compiler: ScriptResult[result=net.sourceforge.htmlunit.corejs.javascript.Undefined#24d7aac3 page=HtmlPage(http://www.xx.org/classNo=2016??15)#1330510442]
You might try to use URL-encoding for some ASCII and all non ASCII characters.
e.g. space by %20
Here is a web site explaning the
HTML URL Encoding Reference.
You can also interactive encode strings there.
Your "2016同學15" would be encoded as:
"2016%E5%90%8C%E5%AD%B815"

Ruby character encoding issue with scraped HTML

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:
./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
In my logs, I can see Café showing up for some of the strings that would be included in the join operation.
Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.
I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.
Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?
Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.
Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:
[CODE REMOVED]
Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.
Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).
I hate character encoding.
Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:
> s = 'Café'
=> "Café"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('iso-8859-1').encode('utf-8')
=> "Café"
So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.
You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.
Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

Extended charsets chars not reccognized and converting to ? mark

I have a string contain some special char like "\u2012" i.e. FIGURE DASH. When i am trying to print this on console I am getting a '?' mark instead of its symbol. I have an editor where in I can insert the symbol using alt+numpad like alt+2012. In editor it I could see the symbol save it in a xml file and get the value using nodevalue, I get a '?' mark.
To summerize I am facing problem to read extended latin a charset. What i need is When i insert such symbols and read it, i should get something like &#xXXXX;.
Please help!
TIA :)
Simply I have a String inpath = "À";, I want to get its unicode value..like &#xXXXX;
The default console encoding in Windows is some MS-DOS code page and they don't support the character. You can try running chcp 65001 before running the program but you might also need to change the console font as well.
You don't need to do anything you wouldn't do with any other character, as long as you use UTF-8. You aren't doing that in many places. You need to explicitly write in your code to save and read the file in UTF-8, and not rely on the platform default encoding.

Nokogiri - Encoding Issue - Invalid UTF8 characters

Can someone take a look at this. I think there is invalid UTF-8 characters when making this call.
Nokogiri::HTML(open("http://www.next.co.uk/x502062s2"))
If there a way around this? And is this the issue? I am writing a new open source screen scraper designed for product information capture (when a site does not supply a feed) before anyone says I am doing something a little shifty :-)
Before passing anything to Nokogiri, you can encode the content of the page, and ignore all invalid UTF characters using Iconv.
I was using it like this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(open('http://example.com').read)
You can also check "Fixing invalid UTF-8 in Ruby, revisited."

UnicodeEncodeError: ascii when wishing to use unicode

I'm trying something like this:
outFile = open("file.txt", "wt",encoding='utf-8')
outFile.write(str(sentence))
outFile.close()
and getting the error:
UnicodeEncodeError: 'ascii' codec can't encode character '/x4e'.
why is an ascii encoder being used?
Am I right in saying that my string (str(sentence)) is in unicode? Then why is it not simply encoded as utf-8 when writen to file? This code gives no exception when run on ubuntu and windows, with the exception occuring on mac os x.
Seems to me that ascii is being used by default somewhere on my mac even though i explicitly state the use of utf-8
Please help,
Barry
str() returns a string yes. And a str will be encoded when written, yes.
I'm not entirely sure why the ascii encoding is being used (it is the default encoding in Python 2, but not in Python 3), but I'm even less sure why you do str(sentence). If you want to decode bytes you don' use str() you use .decode(). So start with removing the str() call.
You don't give a full traceback, but I'm guessing that it's the str(sentence) that gives the error.

Resources