I am working on an application that comprises of Hibernate, JPA, Spring.
I have a process that uses Freemarker to send emails at the end of the process.
The data in the free marker template is filled from database. The data has a £ pound symbol in the database but when freemarker processes it, It is convereted into ? symbol.
Code is as below
final StringWriter writer = new StringWriter();
Template template= freemarkerConfiguration.getTemplate("sampleReport.ftl");
template.process(this, writer);
System.out.println(writer.toString());
Printing the data retrieved from database , result is as below
INFO 05-22 10:08:18 Printing ?-- Hex :a3
a3 - Is the hex code for pound symbol but its printing ? as well.
I cannot spot the error
FreeMarker works with UNICODE everywhere, so it doesn't generate ?-s itself. Print the character codes for that String rather than the String itself, and see if it already contains 0x3F (the UCS code of ?). If it doesn't, it's System.out.println that uses a wrong encoding, and so does whatever that encodes the e-mail for sending. If it does contain 0x3F, maybe something is messed up with JDBC there (again, you can check the String coming from JDBC by printing the character codes), so FreeMarker already receives ?.
Related
The official documentation for the MSG format states
PidTagStoreSupportMask
indicates whether string properties within the .msg file are Unicode-encoded or not. STORE_UNICODE_OK Set if the string properties are Unicode-encoded.
PidTagMessageCodepage
specifies the code page used to encode the non-Unicode string properties on this Message object
PidTagInternetCodepage
indicates the code page used for the PidTagBody property or the PidTagBodyHtml property
Based on the above my understanding is that if the unicode mask is set then all String properties are unicode encoded i.e UTF-16LE
If the mask is not set then PidTagMessageCodepage is used to decode all String properties in the message.
Based on the documentation non-unicode and unicode properties cannot exist together.
So, what is the purpose of the PidTagInternetCodepage ? It is used to decode the body or bodyhtml which have types ptystring.
If a message has the unicode storemask then
Q1. Do we decode the PidTagBody/PidTagBodyHtml using unicode or PidTagInternetCodepage ?
If a message is non-unicode then
Q2. Do we decode PidTagBody/PidTagBodyHtml using PidTagMessageCodepage or PidTagInternetCodepage ?
Q3. Do we use unicode when storemask is set, and when it is not first attempt PidTagInternetCodepage then PidTagMessageCodepage for PidTagBody/PidTagBodyHtmlit ?
Q4. What do we do if none are present .. default to 1252 ?
PR_BODY is not different from any other string property (such as PR_SUBJECT) - it comes in both PT_STRING8 and PT_UNICODE flavors.
PR_HTML, on the other hand, is PT_BINARY and it stores the data in a binary byte blob. Most HTML bodies includes the charset as a part of the HTML headers, but if it is not present, you will need to use PR_INTERNET_CODEPAGE.
What I want is to generate a GS1 datamatrix using the bwip-js API with a FNC1 passed in.
I have tried the example provided in their website (Online Barcode API documentation) throught Postman and it returns the correct value back (ie. without the FNC1 character in the scanned result).
Their example request (parses FNC1 correctly)
http://bwipjs-api.metafloor.com/?bcid=code128&text=%5EFNC1011234567890&parsefnc&alttext=%2801%291234567890
However when I use my example for the GS1 data matrix, with the FNC1 value, I get the FNC1 in the scanned result. So it is not parsing the FNC1 value correctly.
My request (does not parse FNC1 correctly):
http://bwipjs-api.metafloor.com/?bcid=gs1datamatrix&text=%5EFNC1(01)03453120000011(17)120508(10)ABCD1234(410)9501101020917&parsefnc&alttext=%2801%291234567890
I have read all the documentation and articles I can find about their generator and the FNC1 character, but didn't give me any clues.
Am I doing anything wrong here?
UPDATE:
The input to BWIP-JS:
(01)99312650999998(91)111JD507496002000960300(420)2164(8008)181102113732
Image generated:
The code in bwip-js is PostScript and I'm no expert in that language. But try taking the 'FNC1' out of your request and see if that works.
I think it's trying to automatically add FNC1 to any GS1 Datamatrix (see section starting a line 23903) when it sees an AI, whereas for Data Matrix it has to be explicitly requested.
The FNC1 character is invisible to the console, so it can be tricky to see, but I've managed to parse it out of raw strings using the following:
var decoded = decodedString.split(decodeURI("%1D"));
If you're getting the FNC codes in parentheses, you could probably use a REGEX to remove them.
In using the Page Object gem, I'm trying to pull text from a page to verify error messages. One of these error messages contains double-quotes, but when the page object pulls the text from the page, it pulls some other characters.
expected ["Please select a category other than the Default â?oEMSâ?? before saving."]
to include "Please select a category other than the Default \"EMS\" before saving."
(RSpec::Expectations::ExpectationNotMetError)
I'm not quite sure how to escape these - I'm not sure where I could use Regexs and be able to escape these odd characters.
Honestly you are over complicating your validation.
I would recommend simplifying what you are trying to do, start by asking yourself: Is the part in quotes a critical part of your validation?
If it is, isolate it by doing a String.contains("EMS")
If it is not, then you are probably doing too much work, only check for exactly what you need in validation:
String.beginsWith("Please select a category other than the Default")
With respect to the actual issue you are having, on a technical level you have an encoding issue. Encode your result string with utf-8 before you pass it to your validation and you will be fine.
Good luck
It's pretty likely that somewhere along the line encoded the string improperly. (A tipoff is the accented characters followed by ?.) It seems pretty likely that the quotes were converted to "smart quotes" somewhere. This table compares Window-1252 to UTF-8:
Code Point Characters UTF-8 Bytes
Unicode Windows
1252 Expected Actual
------ ---- - --- -----------
U+201C 0x93 “ “ %E2 %80 %9C
U+201D 0x94 ” †%E2 %80 %9D
What you'll want to do is spot check various places in the code to find the first place the string is encoded in something other than UTF-8:
puts error_str.encoding
(For clarity, error_str is the variable that holds the string you are testing. I'm using puts, but you might want have another way to log diagnostic messages.)
Once you find the string that's not encoded UTF-8, you can convert it:
error_str.encode('UTF-8')
Or, if the string is hardcoded somewhere, just replace the string.
For more debugging advice, see: 3 Steps to Fix Encoding Problems in Ruby and How to Get From They’re to They’re.
I have a DB using windows-1252 character encoding and dynamic SQL that does simple single quote escaping like this...
l_str := REPLACE(TRIM(someUserInput),'''','''''');
Because the DB is windows-1252 when the notorious Unicode Character 'MODIFIER LETTER APOSTROPHE' (U+02BC) is sent it gets converted.
Example: The front end app submits this...
TESTʼEND
But ends up searching on this...
and someColumn like '%TESTʼEND%'
What I want to know is, since the ʼ was converted into ʼ (which luckily is safe just yields wrong search results) is there any scenario where a non-windows-1252 characters can be converted into something that WILL break this thus making SQL injection possible?
I know about bind variables, I know the DB should be unicode as well, that's not what I'm asking here. I am needing proof that what you see above is not safe. I have searched for days and cannot find a way to cause SQL injection when doing simple single quote escaping like this when the DB is windows-1252. Thanks!
Oh, and always assuming the column being search is a varchar, not number. I am aware of the issues and how things change when dealing with numbers. So assume this is always the case:
l_str := REPLACE(TRIM(someUserInput),'''','''''');
...
... and someVarcharColumn like '%'||l_str||'%'
Putting the argument of using bind variables aside, since you said you wanted proof that it could break without bind variables.
Here's what's going on in your example -
The Unicode character 'MODIFIER LETTER APOSTROPHE' (U+02BC) in UTF-8 is made up of 2 bytes - 0xCA 0xBC.
Of that 0xCA is 'LATIN CAPITAL LETTER E WITH CIRCUMFLEX' which looks like - Ê
and 0xBC is 'VULGAR FRACTION ONE QUARTER' which looks like ¼.
This happens because your client probably uses an encoding that supports multi-byte characters but your DB doesn't. You would want to make sure that the encoding in both database and client is the same to avoid these issues.
Coming back to the question - is it possible that dynamic SQL without bind variables can be injected into because of these special unicode characters - The answer is probably yes.
All you need to break that dynamic sql using this encoding difference is a multibyte character, one of whose bytes is 0x27 which is an apostrophe.
I said 'probably' because a quick search on fileformat.info for 0x27 didn't give me anything back. Not sure if I'm using that site right. However that doesn't mean that it isn't possible, maybe a different client could use a different encoding.
I would recommend to never use dynamic SQL where input parameter values are used without bind variables, irrespective of whatever encoding you choose. You're just setting yourself up for so many problems going forward, apart from the performance penalty you have to pay to do a hard parse every single time.
Edit: And of course, most importantly, there is nothing stopping your client to send an actual apostrophe instead of the unicode multibyte character and that would be your definitive proof that the SQL is not safe and can be injected into.
Edit2: I missed your first part where you replace one apostrophe with 2. That should technically take care of the multibyte characters too. I'd still be against this approach.
Your problem is not about SQL Injection, the problem is the character set of your front end app.
Your front end app sends the text in UTF-8, however the database "thinks" it is a Windows-1252 string.
Set your client NLS_LANG value to AMERICAN_AMERICA.AL32UTF8 (you may choose a different territory and/or language), then it should look better.
Then your front end app sends the string in UTF-8 and the database recognize it as UTF-8. It will be converted to Windows-1252 internally. I case you enter a string which is not supported by CP1252 (e.g. Cyrillic Capital Letter Ж) it will end up to something like Cyrillic Capital Letter ¿ - which should be fine in terms of SQL injection.
See this answer to get more information about database and client character sets.
I try to convert a UTF8 string to a Java Unicode string.
String question = request.getParameter("searchWord");
byte[] bytes = question.getBytes();
question = new String(bytes, "UTF-8");
The input are Chinese Characters and when I compare the hex code of each caracter it is the same Chinses character. So I'm pretty sure that the charset is UTF8.
Where do I go wrong?
There's no such thing as a "UTF-8 string" in Java. Everything is in Unicode.
When you call String.getBytes() without specifying an encoding, that uses the platform default encoding - that's almost always a bad idea.
You shouldn't have to do anything to get the right characters here - the request should be handling it all for you. If it's not doing so, then chances are it's lost data already.
Could you give an example of what's actually going wrong? Specify the Unicode values of the characters in the string you're receiving (e.g. by using toCharArray() and then converting each char to an int) and what you expected to receive.
EDIT: To diagnose this, use something like this:
public static void dumpString(String text) {
for (int i = 0; i < text.length(); i++) {
System.out.println(i + ": " + (int) text.charAt(i));
}
}
Note that that will give the decimal value of each Unicode character. If you have a handy hex library method around, you may want to use that to give you the hex value. The main point is that it will dump the Unicode characters in the string.
First make sure that the data is actually encoded as UTF-8.
There are some inconsistency between browsers regarding the encoding used when sending HTML form data. The safest way to send UTF-8 encoded data from a web form is to put that form on a page that is served with the Content-Type: text/html; charset=utf-8 header or contains a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> meta tag.
Now to properly decode the data call request.setCharacterEncoding("UTF-8") in your servlet before the first call to request.getParameter().
The servlet container takes care of the encoding for you. If you use setCharacterEncoding() properly you can expect getParameter() to return normal Java strings.
Also you may need a special filter which will take care of encoding of your requests. For example such filter exists in spring framework org.springframework.web.filter.CharacterEncodingFilter
String question = request.getParameter("searchWord");
is all you have to do in your servlet code. At this point you have not to deal with encodings, charsets etc. This is all handled by the servlet-infrastucture. When you notice problems like displaying �, ?, ü somewhere, there is maybe something wrong with request the client sent. But without knowing something of the infrastructure or the logged HTTP-traffic, it is hard to tell what is wrong.
possibly.
question = new String(bytes, "UNICODE");