unmarshalling fixedlength utf-8 strings with beanio and camel - utf-8

When there are no diacritic signs that are represented with two bytes, unmarshalling of a message is OK, otherwise it fails complaining about the length. I tried to converty body to type string and set charset utf-8
<convertBodyTo type="java.lang.String" charset="UTF-8" />
before unmarshalling using BeanIO in a Camel route, but it doesn't help. What is the right way to solve the problem?
In fact, I think that purpose of convertBodyTo might be not to tell some class that is supposed to do unmarshalling that the actual string although declared fixedlength, might be variable length, but to do actual conversion? But that requires that I tell somewhere first that the actual source is utf-8, probably in from endpoint. Then I can convert it temporarily to some charset that has single byte charset representation before unmarshalling, and back to utf-8 afterwards?
After having a suggestion that the point is to give BeanIO information which charset to use, I came up with:
<dataFormats>
<beanio id="parseTransactions464" mapping="mapping.xml" streamName="Transactions464" encoding="UTF-8"/>
</dataFormats>
but this gives me:
Exhausted after delivery attempt: 1 caught: java.lang.NullPointerException: charset
I basically copied the usage of encoding with beanio dataFormat from here, I don't know if it is OK:
Cannot find data format in registry - Camel

This is a defect in camel-beanio, see this:
http://camel.465427.n5.nabble.com/Re-Exhausted-after-delivery-attempt-1-caught-java-lang-NullPointerException-charset-tc5817807.html
http://camel.465427.n5.nabble.com/Exhausted-after-delivery-attempt-1-caught-java-lang-NullPointerException-charset-tc5817815.html
https://issues.apache.org/jira/browse/CAMEL-12284

Related

MSG Clarification on PidTagInternetCodePage, PidTagMessageCodepage, PidTagStoreSupportMask

The official documentation for the MSG format states
PidTagStoreSupportMask
indicates whether string properties within the .msg file are Unicode-encoded or not. STORE_UNICODE_OK Set if the string properties are Unicode-encoded.
PidTagMessageCodepage
specifies the code page used to encode the non-Unicode string properties on this Message object
PidTagInternetCodepage
indicates the code page used for the PidTagBody property or the PidTagBodyHtml property
Based on the above my understanding is that if the unicode mask is set then all String properties are unicode encoded i.e UTF-16LE
If the mask is not set then PidTagMessageCodepage is used to decode all String properties in the message.
Based on the documentation non-unicode and unicode properties cannot exist together.
So, what is the purpose of the PidTagInternetCodepage ? It is used to decode the body or bodyhtml which have types ptystring.
If a message has the unicode storemask then
Q1. Do we decode the PidTagBody/PidTagBodyHtml using unicode or PidTagInternetCodepage ?
If a message is non-unicode then
Q2. Do we decode PidTagBody/PidTagBodyHtml using PidTagMessageCodepage or PidTagInternetCodepage ?
Q3. Do we use unicode when storemask is set, and when it is not first attempt PidTagInternetCodepage then PidTagMessageCodepage for PidTagBody/PidTagBodyHtmlit ?
Q4. What do we do if none are present .. default to 1252 ?
PR_BODY is not different from any other string property (such as PR_SUBJECT) - it comes in both PT_STRING8 and PT_UNICODE flavors.
PR_HTML, on the other hand, is PT_BINARY and it stores the data in a binary byte blob. Most HTML bodies includes the charset as a part of the HTML headers, but if it is not present, you will need to use PR_INTERNET_CODEPAGE.

Convert to unicode from UTF-8 [duplicate]

I try to convert a UTF8 string to a Java Unicode string.
String question = request.getParameter("searchWord");
byte[] bytes = question.getBytes();
question = new String(bytes, "UTF-8");
The input are Chinese Characters and when I compare the hex code of each caracter it is the same Chinses character. So I'm pretty sure that the charset is UTF8.
Where do I go wrong?
There's no such thing as a "UTF-8 string" in Java. Everything is in Unicode.
When you call String.getBytes() without specifying an encoding, that uses the platform default encoding - that's almost always a bad idea.
You shouldn't have to do anything to get the right characters here - the request should be handling it all for you. If it's not doing so, then chances are it's lost data already.
Could you give an example of what's actually going wrong? Specify the Unicode values of the characters in the string you're receiving (e.g. by using toCharArray() and then converting each char to an int) and what you expected to receive.
EDIT: To diagnose this, use something like this:
public static void dumpString(String text) {
for (int i = 0; i < text.length(); i++) {
System.out.println(i + ": " + (int) text.charAt(i));
}
}
Note that that will give the decimal value of each Unicode character. If you have a handy hex library method around, you may want to use that to give you the hex value. The main point is that it will dump the Unicode characters in the string.
First make sure that the data is actually encoded as UTF-8.
There are some inconsistency between browsers regarding the encoding used when sending HTML form data. The safest way to send UTF-8 encoded data from a web form is to put that form on a page that is served with the Content-Type: text/html; charset=utf-8 header or contains a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> meta tag.
Now to properly decode the data call request.setCharacterEncoding("UTF-8") in your servlet before the first call to request.getParameter().
The servlet container takes care of the encoding for you. If you use setCharacterEncoding() properly you can expect getParameter() to return normal Java strings.
Also you may need a special filter which will take care of encoding of your requests. For example such filter exists in spring framework org.springframework.web.filter.CharacterEncodingFilter
String question = request.getParameter("searchWord");
is all you have to do in your servlet code. At this point you have not to deal with encodings, charsets etc. This is all handled by the servlet-infrastucture. When you notice problems like displaying �, ?, ü somewhere, there is maybe something wrong with request the client sent. But without knowing something of the infrastructure or the logged HTTP-traffic, it is hard to tell what is wrong.
possibly.
question = new String(bytes, "UNICODE");

Character reference "&#1" is an invalid XML character

I set the property mapred.textoutputformat.separator with value \001. But when I run the MR Job, it's throwing exception:
Character reference "&#1" is an invalid XML character.
Please help me.
I got the solution. The reason was that when using "\001" character sequence or other Unicode characters, during the object serialization it was getting transformed to some invalid formats.
So the solution was to encode the character using Base64, override the getRecordWriter method of TextOutputFormat class and then decode it there.(Base64.decodeBase64)
This will work.

Convert base64 byte array to an image

I have a form bean with attributes id, desc and imageByteArray. Struts action gets executed and it redirects to a JSP where i want to access these bean attributes like id, desc and convert the imageByteArray and display it as an image. I tried this post, but that's not working for me.
I encode the bytearray using Base64 - where this.bean.imageByteArray refers to the form bean
this.bean.setImageByteArray(new org.apache.commons.codec.binary.Base64().encode(imageInByteArr));
I tried this, but not working
<img src="data:image/jpg;base64,<c:out value='${bean.imageByteArray}'/>" />
Byte array (byte[] imageByteArray) refers a base64 encoded JPG image and I'm getting the following img tag as output and obviously nothing gets displayed,
<img src="data:image/jpg;base64,[B#2e200e">
Any idea how to convert base64 byte array and display as an image in JSP?
What you get is just the toString output of an array. You need however the byte array converted to a String.
You should create a method in bean
public String getByteArrayString()
{
return new String(this.imageByteArray);
}
and reference this in your JSP.
While technically you should define which encoding to use for an array of base64 bytes this is not necessary as all characters are in the standard 7bit ASCII range.
DoubleMalt's answer (accepted at the time of writing) is unfortunate, because it's sort of using two wrongs to make a right. It doesn't help that Apache Commons Codec makes it so easy to do the wrong thing :(
Base64 is fundamentally an encoding from binary data to text - as such, it should almost always be used to convert a byte[] to a String. Your issue is that you're converting a byte[] to another byte[] - but you later want to use that data as a string. It would be better to convert once, in the right way.
Now you can choose exactly when you convert to base64 (and a string). You could do it early, in your Java code, in which case I'd use:
// Obviously you'd need to introduce a new method for this, replacing
// setImageByteArray
this.bean.setImageBase64(new Base64().encodeToString(imageInByteArr));
<img src="data:image/jpg;base64,<c:out value='${bean.imageBase64}'/>" />
Alternatively, you could keep just the binary data in your bean, and the perform the encoding in the JSP. It's been a long time since I've written any JSPs, so I'm not going to try to write the code for that here.
But basically, you need to decide whether your bean should keep the original binary data as a byte[], or the base64-encoded data as a String. Anything else is misleading, IMO.

RSS reader Error : Input is not proper UTF-8 when use simplexml_load_file()

I'm using simplexml_load_file method for parsing feed from external source.
My code like this
$rssFeed['DAILYSTAR'] = 'http://www.thedailystar.net/latest/rss/rss.xml';
$rssParser = simplexml_load_file($url);
The output is as follows :
Warning: simplexml_load_file() [function.simplexml-load-file]: http://www.thedailystar.net/latest/rss/rss.xml:12: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x92 0x73 0x20 0x48 in C:\xampp\htdocs\googlebd\index.php on line 39
Ultimately stop with a fatal error. Main problem is the site's character encoding is ISO-8859-1, not UTF-8.
Can i be able to read this using this method(SimpleXML API)?
If no then any other method is available?
I've searched through Google but no answer. Every method I applied returns with this error.
Thanks,
Rashed
Well, well, when I retrieve this content using Python, I get the following:
'\n<rss version="2.0" encoding="ISO-8859-1">\n [...]
<description>The results of this year\x92s Higher Secondary Certificate
Now it says it's ISO-8859-1, but \x92 is not in that character set, but instead is the closing curly single quote, used as an apostrophe, in Windows-1252. So the page throws an encoding error, and as per the XML spec, clients should be "strict" and not fix errors.
You can retrieve it, and filter out the non-ISO-8859-1 characters in some fashion, or better, convert the encoding using mb-convert-encoding() before passing the result to your RSS parser.
Oh, and if you want to incorporate the result into a UTF-8 page, you may have convert everything to UTF-8, though this is English, which might not even require any different character encodings, if all turns out to be ASCII after all.
We ran into the same issue and used utf8_encode to change the encoding from ISO-8859-1/latin-1 to UTF-8 and get past the error.
$contents = file_get_contents($url);
simplexml_load_string(utf8_encode($contents));

Resources