MSSQL-Server/ruby-gem sequel: How to read UTF-8 values? - ruby

I use the ruby-gem sequel to read utf-8-encoded data from a MSSQL-Server table.
The fields of the table are defined as nvarchar, they look correct in the Microsoft Server Management Studio (Cyrillic is Cyrillic, Chinese looks chinese).
I connect my database with
db = Sequel.connect(
:adapter=>'ado',
:host =>connectiondata[:server],
:database=>connectiondata[:dsn],
#Login via SSO
)
sel = db[:TEXTE].filter(:language=> 'EN')
sel.each{|data|
data.each{|key, val|
puts "#{val.encoding}: #{val.inspect}" #-> CP850: ....
puts val.encode('utf-8')
}
}
This works fine for English, German returns also a useable result:
CP850: "(2 St\x81ck) f\x81r
(2 Stück) für ...
But the result is converted to CP850, it is not the original UTF-8.
Cyrillic languages (I tested with Bulgarian) and Chinese produce only '?'
(reasonable, because CP850 doesn't include Chinese and Bulgarian characters).
I also connected via a odbc-connection:
db = Sequel.odbc(odbckey,
:db_type => 'mssql', #necessary
#:encoding => 'utf-8', #Only MySQL-Adapter
)
The result is ASCII-8BIT, I have to convert the data with force_encoding to CP1252 (not CP850!).
But Cyrillic and Chinese is still not possible.
What I tried already:
The MySQL-adapter seems to have an encoding option, with MSSQL I detected no effect.
I did similar tests with sqlite and sequel and I had no problem with unicode.
I installed SQLNCLI10.dll and used it as provider. But I get a Invalid connection string attribute-error (same with sqlncli).
So my closing question: How can I read UTF-8 data in MS-SQL via ruby and sequel?
My environment:
Client:
Windows 7
Ruby 1.9.2
sequel-3.33.0
Database:
SQL Server 2005
Database has collation Latin1_General_CI_AS
After preparing my question I found a solution. I will post it as an answer.
But I still hope, there is a better way.

If you can avoid it, you really don't want to use the ado adapter (it's OK for read-only workloads, but I wouldn't recommend it for other workloads). I would try the tinytds adapter, as I believe that will handle encodings properly, and it defaults to UTF-8.
Sequel itself does not do any transcoding, it leaves the handling of encodings to the lower level driver.

After preparing my question I found a solution on my own.
When I add a
Encoding.default_external='utf-8'
to my code, I get the correct results.
As a side effect each File.open expects now also UTF-8-encoded files (This can be overwritten by additional parameters in File.open).
As an alternative, this works also:
Encoding.default_internal='utf-8'
As I mentioned in my question, I don't like to change global settings, only to change the behaviour of one interface.
So I still hope on a better solution.

Related

Does FlameRobin have problems with the insertion of umlauts?

I have a table with a field
VALCONTENT BLOB SUB_TYPE TEXT SEGMENT SIZE 80
When I browse the table, right click on an entry, select "Edit blob" the content is shown.
If I enter "normal" test ("Hello world"), I can click "Save" and things work.
If I use umlauts ("Hällö Wörld"), I get an error message:
IBPP::SQLExcetption, Contenten: Statement: Execure (Update MyTable set
foo= ? where ..." Message isc_dsql_execute2 failed, -303, incompatible
column, malformed string
Am I doing something wrong or is FlameRobin not able to handle UTF8?
I am using Firebird 4.0 64bit, FlameRobin 0.9.3 Unicode x64 (all just downloaded).
Extracting the DDL with "iSQL -o" shows in the first line
/* CREATE DATABASE 'E:\foo.fdb' PAGE_SIZE 16384 DEFAULT CHARACTER SET
UTF8; */
I can reproduce the issue (with blob character set UTF8 and connection character set UTF8), which suggests this is a bug in FlameRobin. I recommend reporting it on https://github.com/mariuz/flamerobin/issues. I'm not sure what the problem is. Updating does seem to work fine when using connection character set WIN1252.
Consider using a different tool, maybe DBeaver, or IBExpert, etc.

Autocommit ODBC api not working through IBM iAccess to unixODBC to ruby-odbc

I am currently using ODBC to access an IBM as400 machine through Rails -> (small as400 odbc adapter) -> odbc_adapter (gem 4.2.4) -> ruby-odbc (gem 0.999991) -> unixODBC (2.3.4, ubuntu 18.04) -> IBMiAccess (latest). By some miracle, this all works pretty well, except for recently we were having problems with strings containing specific characters causing an error to be raised in ruby-odbc.
Retrieving a record with the special character '¬' fails with:
ActiveRecord::StatementInvalid (ArgumentError: negative string size (or size too big): SELECT * from data WHERE id = 4220130.0)
Seems the special character ends up taking up 2 bytes and whatever conversions are happening don't handle this correctly.
Strings without special characters are being returned with encoding Encoding:ASCII-8BIT.
There is a utf8 version of ruby-odbc, which I was able to load by requiring it in our iSeries adapter, and then requiring odbc_adapter:
require 'odbc_utf8' # force odbc_adapter to use the utf8 version
require 'odbc_adapter'
This allows the utf8 version of odbc-ruby to occupy the ODBC module name, which the odbc_adapter will just use. Though there was a problem:
odbc_adapter calls .get_info for a number of fields on raw_connection (odbc-ruby), and these strings come back wrong, for example the string "DB2/400 SQL" which is from ODBC::SQL_DBMS_NAME looks like: "D\x00B\x002\x00/\x004\x000\x000\x00 \x00S\x00Q\x00L\x00", with an encoding of Encoding:ASCII-8BIT. odbc_adapter uses a regex to map dbms to our adapter, which doesn't match: /DB2\/400 SQL/ =~ (this_string) => null.
I'm not super familiar with string encodings, but was able to hack in a .gsub("\0", "") here to fix this detection. After this, I can return records with special characters in their strings. They are returned without error, with visible special characters in Encoding:UTF-8.
Of course, now querying on special characters fails:
ActiveRecord::StatementInvalid (ODBC::Error: HY001 (30027) [IBM][System i Access ODBC Driver]Memory allocation error.: SELECT * from data WHERE (mystring like '%¬%'))
but I'm not too concerned with that. The problem now is that it seems the UTF8 version of ruby-odbc sets the ODBC version to 3, where on the non-utf8 version it was 2:
Base.connection.raw_connection.odbc_version => 3
And this seems to prevent autocommit from working (works on version 2):
Base.connection.raw_connection.autocommit => true
Base.connection.raw_connection.autocommit = false
ODBC::Error (IM001 (0) [unixODBC][Driver Manager]Driver does not support this function)
This function is used to start/end transactions in the odbc_adapter, and seems to be a standard feature of odbc:
https://github.com/localytics/odbc_adapter/blob/master/lib/odbc_adapter/database_statements.rb#L51
I poked around in the IBMiAccess documentation, and found something about transaction levels and a new "trueautocommit" option, but I can't seem to figure out if this trueautocommit replaces autocommit, or even if autocommit is no longer supported in this way.
https://www.ibm.com/support/pages/ibm-i-access-odbc-commit-mode-data-source-setting-isolation-level-and-autocommit
Of course I have no idea of how to set this new 'trueautocommit' connection setting via the ruby-odbc gem. It does support .set_option on the raw_connection, so I can call something like .set_option(ODBC::SQL_AUTOCOMMIT, false), which fails in exactly the same way. ODBC::SQL_AUTOCOMMIT is just a constant for 102, which I've found referenced in a few places regarding ODBC, so I figure if I could figure out the constant for TRUEAUTOCOMMIT, I might be able to set it in this way, but can't find any documentation for this.
Any advice for getting .autocommit working in this configuration?
Edit: Apparently you can use a DSN for odbc, so I've also tried creating one in /etc/odbc.ini, along with the option for "TrueAutoCommit = 1" but this hasn't changed anything as far as getting .autocommit to work.

Hibernate Insert Greek Characters into Oracle

I have a requirement of inserting greek characters, such as 'ϕ', into Oracle. My existing DB structure wasn't supporting it. On investigating found various solutions and I adopted the solution of using NCLOB instead of CLOB. It works perfectly fine when I use unicode, 03A6, for 'ϕ', and use UNISTR function in SQL editor to insert. Like the one below.
UPDATE config set CLOB = UNISTR('\03A6')
However, it fails when I try to insert the character through my application, using hibernate. On debugging, I find that the string before inserting is '\u03A6'. After insert, I see it as ¿.
Can some one please help me how I can resolve this? How do I make use of UNISTR?
PN: I don't use any native sqls or hqls. I use entity object of the table.
Edit:
Hibernate version used is 3.5.6. Cannot change the version as there are so many other plugins dependent on this. Thus, cannot use #Nationalized or #Type(type="org.hibernate.type.NClobType") on my field in Hibernate Entity
After racking my brain on different articles and trying so many options, I finally decided to tweak my code a bit to handle this in Java and not through hibernate or Oracle DB.
Before inserting into the DB, I identify unicode characters in the string and format it to append &#x in the beginning and ; at the end. This encoded string will be displayed with its actual unicode in the UI (JSP, HTML) that is UTF-8 compliant. Below is the code.
Formatter formatter = new Formatter();
for (char c : text.toCharArray()) {
if (CharUtils.isAscii(c)) {
formatter.format("%c", c);
} else {
formatter.format("&#x%04x;", (int) c);
}
}
String result = formatter.toString();
Ex:
String test = "ABC \uf06c DEF \uf06cGHI";
will be formatted to
test = "ABC  DEF GHI";
This string when rendered in UI (and even in word doc) it displays it as
ABC  DEF GHI
I tried it with various unicode characters and it works fine.

Yaml encoding issues after upgrading Ruby to 1.9.3 from 1.8.7

Maybe you can help with a Yaml encoding I have.
We have an application that stores some settings serialized in a database as a Yaml string, for example:
---
quantity_units: Stunden,Tage, Monate, Pauschal, Jahre, GB, MB, Stück, Seite, SMS
categories: Shirts
number_schema: P-[Y4]-[CY3]
We are in a process of moving from Ruby 1.8.7 to Ruby 1.9.3, and the Yaml parsing library has changed between the versions, leaving us with decoded strings like this from Stück to Stück.
I only want to know how could properly convert these strings into unicode, and I'll take care of the rest.
I don't know which encoding was using in 1.8.7 Yaml parser.
This looks like utf8 read as iso-8895-1, and interpreted as utf-8 by
the ruby adapter. You might want to check your current locale and the
locale of the database server. Also see what happens if you access the
data directly via console, and check the encoding there as well. It
looks like utf-8 on the database, but gets interpreted as iso-8859-1
somewhere in between.
If nothing helps, there's a snippet to pass your data through (and
write it back).
"Stück".encode('iso-8859-1').force_encoding('utf-8') # I've no idea what I'm doing.
# => "Stück"
Thank you #Tass, I write a strange method like your "# I've no idea what I'm doing."
I have an application in rails 2.3 under ruby 1.8 which share a Mysql database with rails 3.2 and ruby 1.9
On rails 2.2
When i save a serialized Array, sometimes i can see in mysql "binary!" or my string in wrong format, and so, when i display text with rails 3.2 i got strange behavior.
I wrote a method to handle this problem (I hope we will migrate rails 2.3):
def self.decode(words)
temp_name = words || ''
temp_name_encoding = temp_name.encoding
if temp_name_encoding == Encoding::ASCII_8BIT
return temp_name.encode('ASCII-8BIT').force_encoding('utf-8')
elsif temp_name_encoding == Encoding::UTF_8
return temp_name.encode('iso-8859-1').force_encoding('utf-8')
else
return temp_name
end
rescue Encoding::UndefinedConversionError
temp_name
end

UTF-8 formatting in SPARQL

How can I "say" to SPARQL that ?churchname is in UTF-8 formatting? because response is like:Pražský hrad
PREFIX lgv: <http://linkedgeodata.org/vocabulary#>
PREFIX abc: <http://dbpedia.org/class/yago/>
SELECT ?churchname
WHERE
{
<http://dbpedia.org/resource/Prague> geo:geometry ?gm .
?church a lgv:castle .
?church geo:geometry ?churchgeo .
?church lgv:name ?churchname .
FILTER ( bif:st_intersects (?churchgeo,?gm, 10))
}
GROUP BY ?churchname
ORDER BY ?churchname
Bit of a non-answer, I'm afraid: there is no way to do this in SPARQL. SPARQL works on character data (not bytes) so encoding isn't something it's concerned with.
There are a couple of reason you might have this problem. Firstly you might be handling the results incorrectly. Check whether the raw results do indeed have the encoding problem.
If you're doing the right thing then problem you are seeing is that the data is broken, and it certainly looks like a encoding issue has crept in upstream.
Your options are:
See if the SPARQL endpoint supports an extension function to change the encoding. I'm fairly sure this doesn't exist, but perhaps someone from virtuoso (which it looks like you're using) knows better.
Fix the results when you get them. Suboptimal, I know,

Resources