I try to convert a UTF8 string to a Java Unicode string.
String question = request.getParameter("searchWord");
byte[] bytes = question.getBytes();
question = new String(bytes, "UTF-8");
The input are Chinese Characters and when I compare the hex code of each caracter it is the same Chinses character. So I'm pretty sure that the charset is UTF8.
Where do I go wrong?
There's no such thing as a "UTF-8 string" in Java. Everything is in Unicode.
When you call String.getBytes() without specifying an encoding, that uses the platform default encoding - that's almost always a bad idea.
You shouldn't have to do anything to get the right characters here - the request should be handling it all for you. If it's not doing so, then chances are it's lost data already.
Could you give an example of what's actually going wrong? Specify the Unicode values of the characters in the string you're receiving (e.g. by using toCharArray() and then converting each char to an int) and what you expected to receive.
EDIT: To diagnose this, use something like this:
public static void dumpString(String text) {
for (int i = 0; i < text.length(); i++) {
System.out.println(i + ": " + (int) text.charAt(i));
}
}
Note that that will give the decimal value of each Unicode character. If you have a handy hex library method around, you may want to use that to give you the hex value. The main point is that it will dump the Unicode characters in the string.
First make sure that the data is actually encoded as UTF-8.
There are some inconsistency between browsers regarding the encoding used when sending HTML form data. The safest way to send UTF-8 encoded data from a web form is to put that form on a page that is served with the Content-Type: text/html; charset=utf-8 header or contains a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> meta tag.
Now to properly decode the data call request.setCharacterEncoding("UTF-8") in your servlet before the first call to request.getParameter().
The servlet container takes care of the encoding for you. If you use setCharacterEncoding() properly you can expect getParameter() to return normal Java strings.
Also you may need a special filter which will take care of encoding of your requests. For example such filter exists in spring framework org.springframework.web.filter.CharacterEncodingFilter
String question = request.getParameter("searchWord");
is all you have to do in your servlet code. At this point you have not to deal with encodings, charsets etc. This is all handled by the servlet-infrastucture. When you notice problems like displaying �, ?, ü somewhere, there is maybe something wrong with request the client sent. But without knowing something of the infrastructure or the logged HTTP-traffic, it is hard to tell what is wrong.
possibly.
question = new String(bytes, "UNICODE");
Related
I wanted to ask how it is possible to change the URL encoding (from ISO-8859-1 to UTF-8) for a POST request.
Now my code is as follows:
cl_http_client=>create_by_url(
EXPORTING
url = lc_url
IMPORTING
client = lr_client
EXCEPTIONS
argument_not_found = 1
plugin_not_active = 2
internal_error = 3
OTHERS = 4 ).
IF sy-subrc <> 0.
MESSAGE e020(rest_core_texts).
EXIT.
ENDIF.
lr_client->request->set_method( method = if_http_entity=>co_request_method_post ).
lr_client->request->set_content_type( content_type = 'text/plain; charset=utf-8' ).
lr_client->request->set_form_field( name = 'sUsername' value = lc_uname ).
Etc..
CALL METHOD lr_client->send
EXCEPTIONS
http_communication_failure = 1
http_invalid_state = 2
http_invalid_timeout = 3
http_processing_failed = 4
OTHERS = 5.
IF sy-subrc NE 0.
MESSAGE i400(sclnt_http).
EXIT.
ENDIF.
One of the form fields in the POST request contains the name of the person and those names might contain german umlaute (ä,ö,ü). The resulting URL is then encoded using an ISO codepage instead of UTF and the external system expects it to be encoded in UTF.
The result is that the external system stores the name the wrong way (e.g. Gr%e4%df) because the URL is encoded using ISO-8859-1 instead of UTF-8.
This is most likely caused by the fact that the system uses ISO-8859-1 by default (Table: TCP0C).
Now, I have tried transforming the variable holding the name using the class CL_ABAP_CODEPAGE from string to xstring and then vice versa as there is no method to directly transform a string variable to a different code page.
Unfortunately this has not yielded any success.
My second guess was to try to transform the http request body into UTF but I didn't find any suitable method nor function module which I could use.
Any suggestion would be much appreciated!
EDIT:
The system is a non-Unicode system.
The system codepage is ISO-8859-1.
I have a solution on how to properly escape the URL in UTF-8 for the post request even though the systems default code page is not unicode.
Use the following method to escape the name and address, or other variables which might contain non-ASCII characters for that matter.
CALL METHOD cl_http_utility=>escape_url
EXPORTING
unescaped = lv_pname
options = 1
receiving
escaped = lv_pname.
Do not forget to pass the options parameter as otherwise the default system code page is used for escaping the string variable.
You can subsequently pass the escaped variable via set_form_field() as usual.
I cannot reproduce your problem in a Unicode ABAP 7.52 system. With your code + lc_url = 'http://dummy/dummy'. and lc_uname = 'ändern'., the ICF trace gives:
POST /dummy?sUsername=%c3%a4ndern HTTP/1.0
Content-Type: text/plain; charset=utf-8
Content-Length: 0
user-agent: SAP NetWeaver Application Server (1.0;752)
host: dummy
accept-encoding: gzip
Concerning the query string, according to the note 1228903 - CL_HTTP_CLIENT: Escaping of special characters in URL:
Example of incorrect source code: CL_HTTP_UTILITY=>SET_REQUEST_URI( '/test?name=M%FCller' )
...
In Unicode systems, UTF-8 is used as the character set; in NON-Unicode systems, the character set of the session is used.
...
When escaping, Unicode systems use UTF8 as the code page and non-Unicode systems use the system code page.
...
Workaround: You can use the method SET_HEADER_FIELD of the REQUEST object to set the URL via the pseudo header field "~request_uri".
In your case, that would correspond to lr_client->request->SET_HEADER_FIELD( name = '~request_uri' value = '/dummy?sUsername=%c3%a4ndern'.
Given this piece of code:
<%
Response.Write Server.URLEncode("a doc file.asp")
%>
It output this for a while (like Javascript call encodeURI):
a%20doc%20file.asp
Now, for unknow reason, I get:
a+doc+file%2Easp
I'm not sure of what I touched to make this happen (maybe the file content encoding ANSI/UTF-8). Why did this happen and how can I get the first behavior of Server.URLEncode, ie using a percent encoding?
Classic ASP hasn't been updated in nearly 20 years, so Server.URLEncode still uses the RFC-1866 standard, which specifies spaces be encoded as + symbols (which is a hangover from an old application/x-www-form-urlencoded media type), you must be mistaken in thinking it was encoding spaces as %20 at some point, not unless there's an IIS setting you can change that I'm unaware of.
More modern languages use the RFC-3986 standard for encoding URLs, which is why Javascript's encodeURI function returns spaces encoded as %20.
Both + and %20 should be treated exactly the same when decoded by any browser thanks to RFC backwards compatibility, but it's generally considered best to use %20 when encoding spaces in a URL as it's the more modern standard now, and some decoding functions (such as Javascript's decodeURIComponent) won't recognise + symbols as spaces and will fail to properly decode URLs that use them over %20.
You can always use a custom function to encode spaces as %20:
function URL_encode(ByVal url)
url = Server.URLEncode(url)
url = replace(url,"+","%20")
URL_encode = url
end function
When the code is implemented, some characters cannot be decoded. I am getting a bunch of question marks like ??. How can I fix this?
HtmlInput inputBox2 = (HtmlInput)currentPage.getHtmlElementById("classNo");
inputBox2.setValueAttribute("2016同學15");
ScriptResult result = currentPage.executeJavaScript("javascript:Search(2)");
I found this in the compiler: ScriptResult[result=net.sourceforge.htmlunit.corejs.javascript.Undefined#24d7aac3 page=HtmlPage(http://www.xx.org/classNo=2016??15)#1330510442]
You might try to use URL-encoding for some ASCII and all non ASCII characters.
e.g. space by %20
Here is a web site explaning the
HTML URL Encoding Reference.
You can also interactive encode strings there.
Your "2016同學15" would be encoded as:
"2016%E5%90%8C%E5%AD%B815"
I would like to receive a long string the contains spaces to my method in my web api
To my understanding i can't send a parameter with white spaces, does it have to be encoded in some way?
EDIT:
My content type is:
Content-Type: application/x-www-form-urlencoded
I've changed it to several other types but none of them allows me to receive a parameter with + instead of spaces
my post method signature is
public HttpResponseMessage EditCommentForExtension(string did, string extention, string comment)
Usually, parameters to an HTTP GET request are URL encoded. This means (among other) that spaces are replaced by "+".
Using + to mean "space" in a URL is an internal convention used by some web sites, but it's not part of the URL encoding standard. If you want to use + to means spaces, you are going to have to convert them yourself.
As you discovered, spaces (like everything else that needs encoding) should be encoded with %XX where X standards for a hex digit.
http://www.w3.org/Addressing/rfc1738.txt
The only thing that work for me is to add %20 instead of the spaces
I am trying to get some data from the server via an AJAX call and then displaying the result using responseDiv.innerHTML. The data from the server comes partially encoded with Unicode elements, like: za\u010Dat test. By setting the innerHTML of the response div, this just displayed as is. That is, the Unicode is not converted to an actual representation in the browser.
The charset of the containing page is set to UTF-8. I have tried most other things, like converting the unicode representation to HTML entities, but that doesn't seem to work either.
I should also mention that the text coming from the server has HTML tags intermixed as well. The HTML tags are honored as they should be. For example, if the text from the server comes as <b>Bold this!</b>, the text is bolded.
Any help appreciated.
Vikram
Can you replace '\u010D' with 'č'?
AFAIK the HTML tags coming from the server should work if you are setting the innerHTML.
This works for me:
document.getElementById('info').innerHTML = "č <b>Bold this</b>";
BTW - you can use something like Fiddler or Firebug to ensure you are getting what you expect from the server.
Update: use regular expressions to find and replace the unicode characters with HTML entities:
$.get('data.txt', function(data) {
data = data.replace(/\\u([0-9A-F])([0-9A-F])([0-9A-F])([0-9A-F])/g, '&#x$1$2$3$4;');
document.getElementById('info').innerHTML = data;
});
Just convert the unicode literals to characters directly:
'H\u0065\u006Clo, world!'.replace(/\u([0-9a-fA-F]{4})/, function() {
return String.fromCharCode(parseInt(arguments[1],16));
});