PLSQL - convert UTF-8 NVARCHAR2 to VARCHAR2 - utf-8

I have a table with a column configured as NVARCHAR2, I'm able save the string in UTF-8 without any issues.
But the application the calls the value does not fully support UTF-8.
This means that the string is passed to the database and back after the string is converted into HTML letter code. Each letter in the string is converted to such HTML code.
I'm looking for an easier solution.
I've considered converting it to BASE64, but it contains various characters which are considered illegal in the application.
In addition tried using HEXTORAW & RAWTOHEX.
None of the above helped.
If the column contains 'κόσμε' I need to find a way to convert/encode it to something else, but the decode should be possible to do from the HTML running the application.

Try using ASCIISTR function, it will convert it in something similar as JSON encodes unicode strings (it's actually the same, except "\" is used instead of "\u") and then when you receive it back from front end try using UNISTR to convert it back to unicode.
ASCIISTR: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions006.htm
UNISTR: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions204.htm
SELECT ASCIISTR(N'κόσμε') FROM DUAL;
SELECT UNISTR('\03BA\1F79\03C3\03BC\03B5') FROM DUAL;

Related

How do I add unprintable ascii control characters to a VARCHAR2 in Oracle PL/SQL?

I am supporting a label printing system which uses PL/SQL and an ORACLE database to fill in values that are then encoded in various barcodes (among other things). For a specific 2 dimensional data matrix barcode I'm attempting to create, I need to include some unprintable characters as control characters for the system that is scanning the barcode.
The problem is that I have no idea how to encode those characters.
Example:
v_string VARCHAR2(1000);
...
v_string := '[)><RS>06<GS>12SC<GS>16S2'||'<GS>V'||:(IN:-SPLR-)||'<GS>3S'||v_serial||'<GS>P'||:(IN:-CUSTITEM-)||'<GS>Q'||v_boxqty||'<GS>1T'||:(IN:-LOT-)||'<GS>15K123456789123'||:(IN:-PRODSEQ-)||'<RS><EOT>';
Then v_string is passed as a parameter to a function that actually populates the barcode on the printed label. The problem is that every <GS>, <RS>, and <EOT> in the string are supposed to be control characters. I have the ascii decimal and hex values for those control characters, but no idea how to add them into the above string instead of the placeholders.
Any help would be appreciated.
You can use the chr() function to supply individual ASCII characters:
CHR returns the character having the binary equivalent to n as a VARCHAR2 value in either the database character set or, if you specify USING NCHAR_CS, the national character set.
(You can also use unistr() for Unicode characters, but that doesn't seem to be necessary in this case; but note the ASCII/EBCDIC message in the chr() document...)
For those control characters you can use chr(4), chr(29) and chr(30):
v_string := '[)>'||chr(30)||'06'||chr(29)||'12SC'||chr(29)||'16S2'||chr(29)||'V'||:(IN:-SPLR-)||chr(29)||'3S'||v_serial||chr(29)||'P'||:(IN:-CUSTITEM-)||chr(29)||'Q'||v_boxqty||chr(29)||'1T'||:(IN:-LOT-)||chr(29)||'15K123456789123'||:(IN:-PRODSEQ-)||chr(30)||chr(4);
db<>fiddle showing the generated string - the printable parts, anyway - and it's dumped value, so you can see the 4/29/30 characters are actually there.
You could also build your string as you have it, then pass it through replace() to replace the <GS> etc. placeholders with the chr() values.
If you have the ASCII value, you can use the CHR function to construct your string, as in:
v_string:='xx'||chr(10).....

Oracle convert from utf-16 hex to utf-8 character

My database character set is AL32UTF8 and national character set AL16UTF16. I need to store in a table numeric values of some characters according to db character set and later on display a specific character using numeric value. I had some problems with understanding how this encoding works (differences between unistr, chr, ascii functions and so on), but eventually I found website where the following code was used:
chr(ascii(convert(unistr(hex), AL32UTF8)))
And it works fine when hex code is smaller than 1000 when I use for example:
chr(ascii(convert(unistr('\1555'), AL32UTF8)))
chr(ascii(convert(unistr('\1556'), AL32UTF8)))
it returns the same ascii value (ascii(convert(unistr('\hex >= 1000'), AL32UTF8))). Could anyone look at this and try to explain what's the reason? I really thought I understood how it works, but now I'm confused a bit.

Oracle PL/SQL SQL Injection Test from Unicode to Windows-1252

I have a DB using windows-1252 character encoding and dynamic SQL that does simple single quote escaping like this...
l_str := REPLACE(TRIM(someUserInput),'''','''''');
Because the DB is windows-1252 when the notorious Unicode Character 'MODIFIER LETTER APOSTROPHE' (U+02BC) is sent it gets converted.
Example: The front end app submits this...
TESTʼEND
But ends up searching on this...
and someColumn like '%TESTʼEND%'
What I want to know is, since the ʼ was converted into ʼ (which luckily is safe just yields wrong search results) is there any scenario where a non-windows-1252 characters can be converted into something that WILL break this thus making SQL injection possible?
I know about bind variables, I know the DB should be unicode as well, that's not what I'm asking here. I am needing proof that what you see above is not safe. I have searched for days and cannot find a way to cause SQL injection when doing simple single quote escaping like this when the DB is windows-1252. Thanks!
Oh, and always assuming the column being search is a varchar, not number. I am aware of the issues and how things change when dealing with numbers. So assume this is always the case:
l_str := REPLACE(TRIM(someUserInput),'''','''''');
...
... and someVarcharColumn like '%'||l_str||'%'
Putting the argument of using bind variables aside, since you said you wanted proof that it could break without bind variables.
Here's what's going on in your example -
The Unicode character 'MODIFIER LETTER APOSTROPHE' (U+02BC) in UTF-8 is made up of 2 bytes - 0xCA 0xBC.
Of that 0xCA is 'LATIN CAPITAL LETTER E WITH CIRCUMFLEX' which looks like - Ê
and 0xBC is 'VULGAR FRACTION ONE QUARTER' which looks like ¼.
This happens because your client probably uses an encoding that supports multi-byte characters but your DB doesn't. You would want to make sure that the encoding in both database and client is the same to avoid these issues.
Coming back to the question - is it possible that dynamic SQL without bind variables can be injected into because of these special unicode characters - The answer is probably yes.
All you need to break that dynamic sql using this encoding difference is a multibyte character, one of whose bytes is 0x27 which is an apostrophe.
I said 'probably' because a quick search on fileformat.info for 0x27 didn't give me anything back. Not sure if I'm using that site right. However that doesn't mean that it isn't possible, maybe a different client could use a different encoding.
I would recommend to never use dynamic SQL where input parameter values are used without bind variables, irrespective of whatever encoding you choose. You're just setting yourself up for so many problems going forward, apart from the performance penalty you have to pay to do a hard parse every single time.
Edit: And of course, most importantly, there is nothing stopping your client to send an actual apostrophe instead of the unicode multibyte character and that would be your definitive proof that the SQL is not safe and can be injected into.
Edit2: I missed your first part where you replace one apostrophe with 2. That should technically take care of the multibyte characters too. I'd still be against this approach.
Your problem is not about SQL Injection, the problem is the character set of your front end app.
Your front end app sends the text in UTF-8, however the database "thinks" it is a Windows-1252 string.
Set your client NLS_LANG value to AMERICAN_AMERICA.AL32UTF8 (you may choose a different territory and/or language), then it should look better.
Then your front end app sends the string in UTF-8 and the database recognize it as UTF-8. It will be converted to Windows-1252 internally. I case you enter a string which is not supported by CP1252 (e.g. Cyrillic Capital Letter Ж) it will end up to something like Cyrillic Capital Letter ¿ - which should be fine in terms of SQL injection.
See this answer to get more information about database and client character sets.

PL/SQL Apply Same Functions More Than Once

There is an encoding problem at existing Oracle database. From Java side, I apply these and fix it:
textToEscape = textToEscape.replace(/ö/g, 'ö');
textToEscape = textToEscape.replace(/ç/g, 'ç');
textToEscape = textToEscape.replace(/ü/g, 'ü');
textToEscape = textToEscape.replace(/ÅŸ/g, 'ş');
textToEscape = textToEscape.replace(/Ä/g, 'ğ');
There is a procedure which retrieves data from database. I want to write a function and apply that replace sequence inside it. I found that link:
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions134.htm
However I want to apply consequent replaces. How can I chain them?
you can use Oracle CONVERT function to convert data into correct character set (compatible with your JAVA charset) inside database procedure itself.
That should handle all cases for you.
Assuming your database character set is AL32UTF8, The malformed characters that you see stem from a repeated conversion of an 8-bit character set encoding (presumably iso-8859-9 [Turkish]) to unicode in the utf-8 representation. The second of these conversions, of course, has been applied erroneously to the byte sequence that constituted the valis utf representation of your data.
You can reverse this within the database using the utl_raw package. Say tab.col contains your data, the following statement rectifies it.
update tab set col = utl_raw.cast_to_varchar2 ( utl_raw.convert ( utl_raw.cast_to_raw ( col ), 'WE8ISO8859P9', 'AL32UTF8' ) );
The casts retag the type of the character data which effectively allows for operating on the underlying octet (byte) sequence. on this level, the eroneus utf-8 mapping is invereted. since the result is still a valid representation in the database character set, a simple re-cast delivers the result.

Read a CSV file with special characters in Ruby and store into SQL Server

I'm trying to import a CSV file (UTF-8 encoding) in Ruby (2.0.0) in to my database (MSSQL 2008R2, COLLATION French_CI_AS), but the special characters (French accents on vowels) are not stored properly : éèçôü becomes éèçôü (or other similar jibberish).
I use this piece of code to read the file :
CSV.foreach(file, col_sep: ';', encoding: "utf-8") do |row|
# ...
end
I tried various encoding in the CSV options (utf-8, iso-8859-1, windows-1252), but none would store the special characters correctly.
Before you ask, my database collation supports those characters, since we have successfully imported data containing those using PHP importers. If I dump the data using puts or a file logger, everything is correct.
Is something wrong with my code, or do I need to specify something else (like the ruby class file encoding for example) ?
Thanks
EDIT : The data saving is done by a PHP REST API that works fine with accented characters. It stores data as it is received.
In Ruby, I parse my data, store it in an object and then send the JSON-encoded object in the body of my PUT request. But if I use an SQL query directly from Ruby, the problem remains :
query = <<-SQL
UPDATE MyTable SET MyTable_title = '#{row_data['title']}' WHERE MyTable_id = '#{row_data['id']}'
SQL
res = db.execute query
I was thinking that this had something to do with the encoding type on your CSV file, so started digging around on that. I did find that windows-1252 encoding will insert control characters.
You can read more about it here: Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

Resources