How to solve "The byte stream was erroneous according to the character encoding that was declared"? - utf-8

In my mozilla log, I get the following error:
The byte stream was erroneous according to the character encoding that was declared. The character encoding declaration may be incorrect.
Meanwhile, under my doctype meta is UTF8 charset declared:
<!DOCTYPE html><html lang="en"><head prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# fb: http://ogp.me/ns/fb# website: http://ogp.me/ns/website#"><meta charset="utf-8"><meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0"><meta name="msvalidate.01" content="232BB6672CFDF39D90402F9473F59D51"><title>What are the Terms of the Covenant of Settlement ? :. Bishop David Oyedepo, Questions and Answers, + Pdf</title>
I am using <meta charset="utf-8">. Why am I getting this error, and how can I solve it?

In my case, in Mozilla, I found one accented character that logged the same message (á).
It was after a remark (//).

this error reflects in the console of the Mozilla what should i do for the error resolving.
error:
"The byte stream was erroneous according to the character encoding that was declared. The character encoding declaration may be incorrect."

Related

unmarshalling fixedlength utf-8 strings with beanio and camel

When there are no diacritic signs that are represented with two bytes, unmarshalling of a message is OK, otherwise it fails complaining about the length. I tried to converty body to type string and set charset utf-8
<convertBodyTo type="java.lang.String" charset="UTF-8" />
before unmarshalling using BeanIO in a Camel route, but it doesn't help. What is the right way to solve the problem?
In fact, I think that purpose of convertBodyTo might be not to tell some class that is supposed to do unmarshalling that the actual string although declared fixedlength, might be variable length, but to do actual conversion? But that requires that I tell somewhere first that the actual source is utf-8, probably in from endpoint. Then I can convert it temporarily to some charset that has single byte charset representation before unmarshalling, and back to utf-8 afterwards?
After having a suggestion that the point is to give BeanIO information which charset to use, I came up with:
<dataFormats>
<beanio id="parseTransactions464" mapping="mapping.xml" streamName="Transactions464" encoding="UTF-8"/>
</dataFormats>
but this gives me:
Exhausted after delivery attempt: 1 caught: java.lang.NullPointerException: charset
I basically copied the usage of encoding with beanio dataFormat from here, I don't know if it is OK:
Cannot find data format in registry - Camel
This is a defect in camel-beanio, see this:
http://camel.465427.n5.nabble.com/Re-Exhausted-after-delivery-attempt-1-caught-java-lang-NullPointerException-charset-tc5817807.html
http://camel.465427.n5.nabble.com/Exhausted-after-delivery-attempt-1-caught-java-lang-NullPointerException-charset-tc5817815.html
https://issues.apache.org/jira/browse/CAMEL-12284

XMLParser in Pharo Claims U+00A0 is "Invalid UTF-8"

Given the input:
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<sms body=". what" />
Where the character after the "." in the body attribute of the sms tag is U+00A0;
I get the error:
XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)
IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia. Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.
This seems like a bug in XMLParser, or am I missing something?
Thanks to Monty for coming to the rescue on the Pharo User's list:
You're double decoding. Use onFileNamed:/parseFileNamed: instead (and
the DOM printToFileNamed: family of messages when writing) and let
XMLParser take care this for you, or disable XMLParser decoding before
parsing with #decodesCharacters:.
Longer explanation:
The class #on:/#parse: take either a string or a stream (read the
definitions). You gave it a FileReference, but because the argument is
tested with isString and sent #readStream otherwise, it didn't blowup
then.
File refs sent #readStream return file streams that do automatic
decoding. But XMLParser automatically attempts its own decoding too,
if:
The input starts with a BOM or it can be inferred by null bytes
before or after the first non-null byte.
There is an encoding declaration with a non-UTF-8 encoding.
There is a UTF-8 encoding declaration but the stream is not a normal
ReadStream (your case).
So it gets decoded twice, and the decoded value of the char causes the
error. I'll consider changing the heuristic to make less eager to
decode.

mime.quotedprintable having trouble decoding this message

I am trying to decode a message which doesn't completely conform to the Quoted Printable String idea.
One of the snippets as shown below has an = where should be an =3D this occurs in a number of places. In fact there are two offences occurring here:
------=_Part_7575500_2105086112.1449628640342
Content-Type: text/html; charset="UTF-8"
I'm decoding with the as follows:
qpr := quotedprintable.NewReader(msg.Body)
cleanBody, err := ioutil.ReadAll(qpr)
The resulting error is: (complaining about the _ after first =)
quotedprintable: invalid hex byte 0x5f
How can I fix get this to work please? Thank you.
You don't just have quoted-printable data, it's part of a MIME multipart message. The =_ pattern is specifically used because it can never occur in a quoted-printable message.
Use a multipart.Reader to get the contents of each part.

Ruby read a web page with encoding `GB2313`, how to check if the content contains some keyword?

I use ruby reading a web page, and its content is:
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GB2312" />
</HEAD>
<BODY>
中文
</BODY>
</HTML>
From the meta, we can see it uses a GB2312 encoding.
My code is:
res = Net::HTTP.post_form(URI.parse("http://xxx/check"),
{:query=>'xxx'})
Then I use:
res.include?("中文")
to check if the content has that word. But if shows false.
I don't know why it is false, and what should I do? What encoding ruby 1.8.7 use? If I need to convert the encoding, how to do it?
Ruby 1.8 doesn't use encodings, it uses plain byte strings. If you want the byte string in your program to match the byte string in the web page, you'd have to save the .rb file in the same encoding the web pages uses (GB2312) so that Ruby will see the same bytes.
Probably better would be to write the byte string explicitly, avoiding issues to do with the encoding of the .rb file:
res.include?("\xD6\xD0\xCE\xC4")
However, matching byte strings doesn't match characters reliably when multibyte encodings are in use (except for UTF-8, which is deliberately designed to allow it). If the web page had the string:
兄形男
in it, that would be encoded as "\xD0\xD6\xD0\xCE\xC4\xD0". Which contains the byte sequence "\xD6\xD0\xCE\xC4", so the include? would be true even though the characters 中文 are not present.
If you need to handle non-ASCII characters fully reliably, you'd need a language with Unicode support.

RSS reader Error : Input is not proper UTF-8 when use simplexml_load_file()

I'm using simplexml_load_file method for parsing feed from external source.
My code like this
$rssFeed['DAILYSTAR'] = 'http://www.thedailystar.net/latest/rss/rss.xml';
$rssParser = simplexml_load_file($url);
The output is as follows :
Warning: simplexml_load_file() [function.simplexml-load-file]: http://www.thedailystar.net/latest/rss/rss.xml:12: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x92 0x73 0x20 0x48 in C:\xampp\htdocs\googlebd\index.php on line 39
Ultimately stop with a fatal error. Main problem is the site's character encoding is ISO-8859-1, not UTF-8.
Can i be able to read this using this method(SimpleXML API)?
If no then any other method is available?
I've searched through Google but no answer. Every method I applied returns with this error.
Thanks,
Rashed
Well, well, when I retrieve this content using Python, I get the following:
'\n<rss version="2.0" encoding="ISO-8859-1">\n [...]
<description>The results of this year\x92s Higher Secondary Certificate
Now it says it's ISO-8859-1, but \x92 is not in that character set, but instead is the closing curly single quote, used as an apostrophe, in Windows-1252. So the page throws an encoding error, and as per the XML spec, clients should be "strict" and not fix errors.
You can retrieve it, and filter out the non-ISO-8859-1 characters in some fashion, or better, convert the encoding using mb-convert-encoding() before passing the result to your RSS parser.
Oh, and if you want to incorporate the result into a UTF-8 page, you may have convert everything to UTF-8, though this is English, which might not even require any different character encodings, if all turns out to be ASCII after all.
We ran into the same issue and used utf8_encode to change the encoding from ISO-8859-1/latin-1 to UTF-8 and get past the error.
$contents = file_get_contents($url);
simplexml_load_string(utf8_encode($contents));

Resources