PHP - detecting the user supplied character's char set - utf-8

Is it possible to detect the user's string's char set?
If not, how about the next question..
Are there reliable built-in PHP functions that can accurately tell if the user supplied string ( be it supplied thru get/post/cookie etc), are in a UTF-8 or not? In other words, can I do something like
is_utf8($_GET['first_name'])
Is there anyway this function could produce a TRUE where in reality the first_name was not in UTF-8?

Regarding 1:
You can give mb_detect_encoding a try, but it's pretty much a shot in the dark. An "encoded" string is just a bunch of bytes. Such byte sequences are often equally valid in any number of different encodings. It's therefore by definition not possible to detect an unknown encoding reliably, you can only guess. For this reason there exist meta information such as HTTP headers which should communicate the encoding of the transferred content. Check those if available.
Regarding 2:
mb_check_encoding($var, 'UTF-8') will tell you whether the string is a valid UTF-8 string. As far as I've seen, in recent versions of PHP it does what it says on the tin. That still doesn't mean the string is necessarily really a UTF-8 string, it just means the byte sequence is in an order that is valid in UTF-8.

Related

Oracle PL/SQL SQL Injection Test from Unicode to Windows-1252

I have a DB using windows-1252 character encoding and dynamic SQL that does simple single quote escaping like this...
l_str := REPLACE(TRIM(someUserInput),'''','''''');
Because the DB is windows-1252 when the notorious Unicode Character 'MODIFIER LETTER APOSTROPHE' (U+02BC) is sent it gets converted.
Example: The front end app submits this...
TESTʼEND
But ends up searching on this...
and someColumn like '%TESTʼEND%'
What I want to know is, since the ʼ was converted into ʼ (which luckily is safe just yields wrong search results) is there any scenario where a non-windows-1252 characters can be converted into something that WILL break this thus making SQL injection possible?
I know about bind variables, I know the DB should be unicode as well, that's not what I'm asking here. I am needing proof that what you see above is not safe. I have searched for days and cannot find a way to cause SQL injection when doing simple single quote escaping like this when the DB is windows-1252. Thanks!
Oh, and always assuming the column being search is a varchar, not number. I am aware of the issues and how things change when dealing with numbers. So assume this is always the case:
l_str := REPLACE(TRIM(someUserInput),'''','''''');
...
... and someVarcharColumn like '%'||l_str||'%'
Putting the argument of using bind variables aside, since you said you wanted proof that it could break without bind variables.
Here's what's going on in your example -
The Unicode character 'MODIFIER LETTER APOSTROPHE' (U+02BC) in UTF-8 is made up of 2 bytes - 0xCA 0xBC.
Of that 0xCA is 'LATIN CAPITAL LETTER E WITH CIRCUMFLEX' which looks like - Ê
and 0xBC is 'VULGAR FRACTION ONE QUARTER' which looks like ¼.
This happens because your client probably uses an encoding that supports multi-byte characters but your DB doesn't. You would want to make sure that the encoding in both database and client is the same to avoid these issues.
Coming back to the question - is it possible that dynamic SQL without bind variables can be injected into because of these special unicode characters - The answer is probably yes.
All you need to break that dynamic sql using this encoding difference is a multibyte character, one of whose bytes is 0x27 which is an apostrophe.
I said 'probably' because a quick search on fileformat.info for 0x27 didn't give me anything back. Not sure if I'm using that site right. However that doesn't mean that it isn't possible, maybe a different client could use a different encoding.
I would recommend to never use dynamic SQL where input parameter values are used without bind variables, irrespective of whatever encoding you choose. You're just setting yourself up for so many problems going forward, apart from the performance penalty you have to pay to do a hard parse every single time.
Edit: And of course, most importantly, there is nothing stopping your client to send an actual apostrophe instead of the unicode multibyte character and that would be your definitive proof that the SQL is not safe and can be injected into.
Edit2: I missed your first part where you replace one apostrophe with 2. That should technically take care of the multibyte characters too. I'd still be against this approach.
Your problem is not about SQL Injection, the problem is the character set of your front end app.
Your front end app sends the text in UTF-8, however the database "thinks" it is a Windows-1252 string.
Set your client NLS_LANG value to AMERICAN_AMERICA.AL32UTF8 (you may choose a different territory and/or language), then it should look better.
Then your front end app sends the string in UTF-8 and the database recognize it as UTF-8. It will be converted to Windows-1252 internally. I case you enter a string which is not supported by CP1252 (e.g. Cyrillic Capital Letter Ж) it will end up to something like Cyrillic Capital Letter ¿ - which should be fine in terms of SQL injection.
See this answer to get more information about database and client character sets.

Julia: Strange characters in my string

I scraped some text from the internet, which I put in an UTF8String. I can use this string normally, but when I select some specific characters (strange character with accents, like in my case ú), which are not part of the UTF8 standard, I get an error, saying that I used invalid indexes. This only happens when the string contains strange characters; my code works with normal string that do not contain strange characters.
Any way to solve this?
EDIT:
I have a variable word of type SubString{UTF8String}
When I use do method(word), no problems occur. When I do method(word[2:end]) (assuming length of at least 2), I get an error in case the second character is strange (not in UTF8).
Julia does indexing on byte positions instead of character position. It is way more efficient for a variable length encoding like UTF-8, but it makes some operations use some more boilerplate.
The problem is that some codepoints is encoded as multiple bytes and when you slice the string from 2:end you would have got half of the first character (witch is invalid and you get an error).
The solution is to get the second valid index instead of 2 in the slice. I think that is something like str[nextind(str, 1):end]
PS. Sorry for a less than clear answer on my phone.
EDIT:
I tried this, and it seems like SubString{UTF8String} and UTF8String has different behaviour on slicing. I've reported it as bug #7811 on GitHub.

CredentialProvider does not accept passwords with special (German) letters

I wrote a CredentialProvider that allows to log in to Windows. But today I found out this strange error that GetSerialization() seems not to accept passwords which contain the German 'umlaut' letters like 'ä' or 'ü'. Does anyone know the solution?
Thanks in advance
It'll depend on the details of the format in which GetSerialization() returns the password.
If your CredentialProvider returns a serialized KERB_INTERACTIVE_UNLOCK_LOGON structure, as the standard password provider does, then the username, password, and domain name values in the structure must all be passed as UNICODE_STRING values. Note that UNICODE_STRING is a structure that contains current length and maximum length values and a buffer of 16-bit Unicode (UTF-16LE) characters. As they're Unicode they can certainly hold letters with umlauts.
However, if your CredntialProvider handles the password in a narrow character buffer you may be handling your umlaut characters as 8-bit Windows CP1252 characters. You'll need to convert those to 16-bit Unicode before placing them in the KERB_INTERACTIVE_UNLOCK_LOGON structure and serializing it.

What does #encoding BINARY mean in the Ruby code?

There is a line #encoding BINARY in the beginning of the code, what does it mean?
http://ruby.runpaint.org/encoding
Ruby defines an encoding named ASCII-8BIT, with an alias of BINARY, which does not correspond to any known encoding. It is intended to be associated with binary data, such as the bytes that make up a PNG image, so has no restrictions on content. One byte always corresponds with one character. This allows a String, for instance, to be treated as bag of bytes rather than a sequence of characters. ASCII-8BIT, then, effectively corresponds to the absence of an encoding, so methods that expect an encoding name recognise nil as a synonym.
That line is how we tell the Ruby interpreter to expect a certain character set in the source file.
James Grey has a great series on dealing with character encodings in Ruby. In particular, "Ruby 1.9's Three Default Encodings " might be good reading if you want to understand the details.

UTF-8 string delimiter

I am parsing a binary protocol which has UTF-8 strings interspersed among raw bytes. This particular protocol prefaces each UTF-8 string with a short (two bytes) indicating the length of the following UTF-8 string. This gives a maximum string length 2^16 > 65 000 which is more than adequate for the particular application.
My question is, is this a standard way of delimiting UTF-8 strings?
I wouldn't call that delimiting, more like "length prefixing". Some people call them Pascal strings since in the early days the language Pascal was one of the popular ones that stored strings that way in memory.
I don't think there's a formal standard specifically for just that, as it's a rather obvious way of storing UTF-8 strings (or any strings of bytes for that matter). It's defined over and over as a part of many standards that deal with messages that contain strings, though.
UTF8 is not normally de-limited, you should be able to spot the multibyte characters in there by using the rules mentioned here: http://en.wikipedia.org/wiki/UTF-8#Description
i would use a delimiter which starts with 0x11......
but if you send raw bytes you will have to exclude this delimiter from the data\messages processed ,this means that if there is a user input similar to that delimiter, you will have to convert it.
if the user inputs any utf8 represented char you may simply send it as is.

Resources