characters from notepad file getting converted into special characters while reading using utl_file.get_line procedure - oracle

I have written a program to read data from a text file and load it into a table using UTL_FILE package in oracle. While reading a few lines some characters are getting converted into special characters, for example:
string in file = 63268982_GHC –EXH PALOMARES EVA
value entered into database = 63268982_GHC âEXH PALOMARES EVA
I tried using Convert function but it did not achieve anything.
My Oracle version is 11gR2 and it's using the nls charset WE8ISO8859P1. Because these strings represent physical file names I get a mismatch when I try to match with the filename.
I tried re-converting the value stored in Oracle in WE charset back to ascii like below:
convert('63268989_GHC âEXH PALOMARES','us7ascii','WE8ISO8859P1')
but the outcome is different from what was there in text file while reading. Can anyone please suggest how this problem can be overcome.

The – character in the file is not a regular hyphen (-, chr(45)) but an En Dash / U+2013 stored as three bytes, decimal 226, 128, 147 or hex e2, 80, 93. Interpreted individually rather than as a single multibyte character, these correspond to – as shown here.
Try opening the file with utl_file.fopen_nchar and reading lines with utl_file.get_line_nchar.
Oracle 11gR2 Database Globalization Support Guide: Programming with Unicode.

Related

Oracle convert from utf-16 hex to utf-8 character

My database character set is AL32UTF8 and national character set AL16UTF16. I need to store in a table numeric values of some characters according to db character set and later on display a specific character using numeric value. I had some problems with understanding how this encoding works (differences between unistr, chr, ascii functions and so on), but eventually I found website where the following code was used:
chr(ascii(convert(unistr(hex), AL32UTF8)))
And it works fine when hex code is smaller than 1000 when I use for example:
chr(ascii(convert(unistr('\1555'), AL32UTF8)))
chr(ascii(convert(unistr('\1556'), AL32UTF8)))
it returns the same ascii value (ascii(convert(unistr('\hex >= 1000'), AL32UTF8))). Could anyone look at this and try to explain what's the reason? I really thought I understood how it works, but now I'm confused a bit.

Play framework JDBC ebean mysql exception with characters řů but accepts áõ

Trying to save models and i get a:
java.sql.SQLException: Incorrect string value: ...
Saving a text like "jedna dva tři kachna dům a kachní maso"
I'm using default.url="jdbc:mysql://[url]/[database]?characterEncoding=UTF-8"
řů have no encoding in latin1; áõ do. That suggests that CHARACTER SET latin1 is involved somewhere. Let's see SHOW CREATE TABLE.
C599, etc, are valid utf8 encodings for the corresponding characters.
? occurs when the destination character set cannot represent the character. Again, this points to the column/table being latin1, when it should be utf8 (or utf8mb4).
More discussion, and for debugging similar situations: Trouble with utf8 characters; what I see is not what I stored
Probably has some special character, and the UTF-8 encode that you are forcing may cause some error.
This ASCII string has the following text:
String:
jedna dva tři kachna dům a kachní maso
ASCII:
'jedna dva t\xc5\x99i kachna d\xc5\xafm a kachn\xc3\xad maso'

special character was lost when saving excel into csv file

I have an Excel file including latin character, which is shown as follows:
abcón
After saving it into a csv file, the latin character was lost
abc??n
What causes this problem and how to solve it? Thanks.
It's likely that the ó you're using in the excel file isn't supported in ascii text. There are a couple different symbols that look almost if not entirely identical. From the Insert->Symbol character map, 00F3 is supported and is from the latin extended alphabet. However, 1F79 from the greek extended alphabet is not supported and from my casual inspection is identical. Try replacing the char in question with the char from the char map.
Alternatively, you can use Alt-Codes and use 0243 for the char which should work.

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.

Resources