Failed to compare UTF-8 chrs in Ruby - ruby

I'm using Ruby - Cucumber for automation.
I'm trying to send Japanese chars as a parameter to the user defined function to verify in db.
Below is the statement what I have used :
x=$objDB.run_select_query_verifyText('select name from xxxx where id=1','ごせり槎ゃぱ')
In the run_select_query_verifyText() function I have the code to connect db and get the records from db and it will verify the the text which is passed as a parameter(Japanese chars. )
This function returns true if the string is match with table data in DB else false.
But I'm getting always false and I found that the Japanese string is converting as "??????" while comparing the data.
Note: My program is working fine with English chars.

Your problem is most likely with character encodings. The database returns the content in a different encoding that the Ruby string you are working with. You need to figure out what the db encoding is and make sure both are the same.
If you are using ruby 1.9, you can check the encoding current encoding with yourstring.encoding and change it to e.g. UTF-8 with yourstring.encode("UTF-8").
If you are on ruby 1.8 things are bit more tricky as the String class doesn't natively support encodings. You can use e.g. the character-encodings gem to work around this.

Related

AWS SAM throws UnicodeEncodeError when invoking NodeJS 12.x lambda function [duplicate]

What could be causing this error when I try to insert a foreign character into the database?
>>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256)
And how do I resolve it?
Thanks!
I ran into this same issue when using the Python MySQLdb module. Since MySQL will let you store just about any binary data you want in a text field regardless of character set, I found my solution here:
Using UTF8 with Python MySQLdb
Edit: Quote from the above URL to satisfy the request in the first comment...
"UnicodeEncodeError:'latin-1' codec can't encode character ..."
This is because MySQLdb normally tries to encode everythin to latin-1.
This can be fixed by executing the following commands right after
you've etablished the connection:
db.set_character_set('utf8')
dbc.execute('SET NAMES utf8;')
dbc.execute('SET CHARACTER SET utf8;')
dbc.execute('SET character_set_connection=utf8;')
"db" is the result of MySQLdb.connect(), and "dbc" is the result of
db.cursor().
Character U+201C Left Double Quotation Mark is not present in the Latin-1 (ISO-8859-1) encoding.
It is present in code page 1252 (Western European). This is a Windows-specific encoding that is based on ISO-8859-1 but which puts extra characters into the range 0x80-0x9F. Code page 1252 is often confused with ISO-8859-1, and it's an annoying but now-standard web browser behaviour that if you serve your pages as ISO-8859-1, the browser will treat them as cp1252 instead. However, they really are two distinct encodings:
>>> u'He said \u201CHello\u201D'.encode('iso-8859-1')
UnicodeEncodeError
>>> u'He said \u201CHello\u201D'.encode('cp1252')
'He said \x93Hello\x94'
If you are using your database only as a byte store, you can use cp1252 to encode “ and other characters present in the Windows Western code page. But still other Unicode characters which are not present in cp1252 will cause errors.
You can use encode(..., 'ignore') to suppress the errors by getting rid of the characters, but really in this century you should be using UTF-8 in both your database and your pages. This encoding allows any character to be used. You should also ideally tell MySQL you are using UTF-8 strings (by setting the database connection and the collation on string columns), so it can get case-insensitive comparison and sorting right.
The best solution is
set mysql's charset to 'utf-8'
do like this comment(add use_unicode=True and charset="utf8")
db = MySQLdb.connect(host="localhost", user = "root", passwd = "", db = "testdb", use_unicode=True, charset="utf8") – KyungHoon Kim Mar
13 '14 at 17:04
detail see :
class Connection(_mysql.connection):
"""MySQL Database Connection Object"""
default_cursor = cursors.Cursor
def __init__(self, *args, **kwargs):
"""
Create a connection to the database. It is strongly recommended
that you only use keyword parameters. Consult the MySQL C API
documentation for more information.
host
string, host to connect
user
string, user to connect as
passwd
string, password to use
db
string, database to use
port
integer, TCP/IP port to connect to
unix_socket
string, location of unix_socket to use
conv
conversion dictionary, see MySQLdb.converters
connect_timeout
number of seconds to wait before the connection attempt
fails.
compress
if set, compression is enabled
named_pipe
if set, a named pipe is used to connect (Windows only)
init_command
command which is run once the connection is created
read_default_file
file from which default client values are read
read_default_group
configuration group to use from the default file
cursorclass
class object, used to create cursors (keyword only)
use_unicode
If True, text-like columns are returned as unicode objects
using the connection's character set. Otherwise, text-like
columns are returned as strings. columns are returned as
normal strings. Unicode objects will always be encoded to
the connection's character set regardless of this setting.
charset
If supplied, the connection character set will be changed
to this character set (MySQL-4.1 and newer). This implies
use_unicode=True.
sql_mode
If supplied, the session SQL mode will be changed to this
setting (MySQL-4.1 and newer). For more details and legal
values, see the MySQL documentation.
client_flag
integer, flags to use or 0
(see MySQL docs or constants/CLIENTS.py)
ssl
dictionary or mapping, contains SSL connection parameters;
see the MySQL documentation for more details
(mysql_ssl_set()). If this is set, and the client does not
support SSL, NotSupportedError will be raised.
local_infile
integer, non-zero enables LOAD LOCAL INFILE; zero disables
autocommit
If False (default), autocommit is disabled.
If True, autocommit is enabled.
If None, autocommit isn't set and server default is used.
There are a number of undocumented, non-standard methods. See the
documentation for the MySQL C API for some hints on what they do.
"""
I hope your database is at least UTF-8. Then you will need to run yourstring.encode('utf-8') before you try putting it into the database.
Use the below snippet to convert the text from Latin to English
import unicodedata
def strip_accents(text):
return "".join(char for char in
unicodedata.normalize('NFKD', text)
if unicodedata.category(char) != 'Mn')
strip_accents('áéíñóúü')
output:
'aeinouu'
You are trying to store a Unicode codepoint \u201c using an encoding ISO-8859-1 / Latin-1 that can't describe that codepoint. Either you might need to alter the database to use utf-8, and store the string data using an appropriate encoding, or you might want to sanitise your inputs prior to storing the content; i.e. using something like Sam Ruby's excellent i18n guide. That talks about the issues that windows-1252 can cause, and suggests how to process it, plus links to sample code!
SQLAlchemy users can simply specify their field as convert_unicode=True.
Example:
sqlalchemy.String(1000, convert_unicode=True)
SQLAlchemy will simply accept unicode objects and return them back, handling the encoding itself.
Docs
Latin-1 (aka ISO 8859-1) is a single octet character encoding scheme, and you can't fit \u201c (“) into a byte.
Did you mean to use UTF-8 encoding?
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 106: ordinal not in range(256)
Solution 1:
\u2013 - google the character meaning to identify what character actually causing this error, Then you can replace that specific character, in the string with some other character, that's part of the encoding you are using.
Solution 2:
Change the string encoding to some encoding which includes all the character of your string. and then you can print that string, it will work just fine.
below code is used to change encoding of the string , borrowed from #bobince
u'He said \u201CHello\u201D'.encode('cp1252')
The latest version of mysql.connector has only
db.set_charset_collation('utf8', 'utf8_general_ci')
and NOT
db.set_character_set('utf8') //This feature is not available
I ran into the same problem when I was using PyMySQL. I checked this package version, it's 0.7.9.
Then I uninstall it and reinstall PyMySQL-1.0.2, the issue is solved.
pip uninstall PyMySQL
pip install PyMySQL
Python: You will need to add
# - * - coding: UTF-8 - * - (remove the spaces around * )
to the first line of the python file. and then add the following to the text to encode: .encode('ascii', 'xmlcharrefreplace'). This will replace all the unicode characters with it's ASCII equivalent.

Read a CSV file with special characters in Ruby and store into SQL Server

I'm trying to import a CSV file (UTF-8 encoding) in Ruby (2.0.0) in to my database (MSSQL 2008R2, COLLATION French_CI_AS), but the special characters (French accents on vowels) are not stored properly : éèçôü becomes éèçôü (or other similar jibberish).
I use this piece of code to read the file :
CSV.foreach(file, col_sep: ';', encoding: "utf-8") do |row|
# ...
end
I tried various encoding in the CSV options (utf-8, iso-8859-1, windows-1252), but none would store the special characters correctly.
Before you ask, my database collation supports those characters, since we have successfully imported data containing those using PHP importers. If I dump the data using puts or a file logger, everything is correct.
Is something wrong with my code, or do I need to specify something else (like the ruby class file encoding for example) ?
Thanks
EDIT : The data saving is done by a PHP REST API that works fine with accented characters. It stores data as it is received.
In Ruby, I parse my data, store it in an object and then send the JSON-encoded object in the body of my PUT request. But if I use an SQL query directly from Ruby, the problem remains :
query = <<-SQL
UPDATE MyTable SET MyTable_title = '#{row_data['title']}' WHERE MyTable_id = '#{row_data['id']}'
SQL
res = db.execute query
I was thinking that this had something to do with the encoding type on your CSV file, so started digging around on that. I did find that windows-1252 encoding will insert control characters.
You can read more about it here: Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

String not valid UTF-8 (BSON::InvalidStringEncoding) when saving a UTF8 compatible string to MongoDB through Mongoid ORM

I am importing data from a MySQL table into MongoDB using Mongoid for my ORM. I am getting an error when trying to save an email address as a string. The error is:
/Library/Ruby/Gems/1.8/gems/bson-1.2.4/lib/../lib/bson/bson_c.rb:24:in `serialize': String not valid UTF-8 (BSON::InvalidStringEncoding)
from /Library/Ruby/Gems/1.8/gems/bson-1.2.4/lib/../lib/bson/bson_c.rb:24:in `serialize'
From my GUI - this is a screenshot of the table info. You can see it's encoded in UTF8.
Also from my GUI - this is a screen shot of the field in my MySQL table that I am importing
This is what happens when I grab the data from MySQL CLI.
And finally, when I inspect the data in my ruby object, I get something that looks like this:
I'm a bit confused here because regardless my table is in UTF-8 and that funky is apparently valid UTF-8 character as a double byte. Anyone know why I'm getting this error?
Try using this helper:
http://snippets.dzone.com/posts/show/4527
It puts a method utf8? on the String. So you can grab the String from mysql and see if it is utf8:
my_string.utf8?
If is not, then you can try change the encoding of your String using other methods like:
my_string.asciify_utf8
my_string.latin1_to_utf8
my_string.cp1252_to_utf8
my_string.utf16le_to_utf8
Maybe this String is saved on mysql in one of these encodings.

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

What options do exist now to implement UTF8 in Ruby and RoR?

Following the development of Ruby very closely I learned that detailed character encoding is implemented in Ruby 1.9. My question for now is: How may Ruby be used at the moment to talk to a database that stores all data in UTF8?
Background: I am involved in a new project where Ruby/RoR is at least an option. But the project needs to rely on an internationalized character set (it's spread over many countries), preferably UTF8.
So how do you deal with that? Thanks in advance.
Ruby 1.8 works fine with UTF-8 strings for basic operations with the strings. Depending on your application's need, some operations will either not work or not work as expected.
Eg:
1) The size of strings will give you bytes, not characters since the mult-byte support is not there yet. But do you need to know the size of your strings in characters?
2) No splitting a string at a character boundary. But do you need this? Etc.
3) Sorting order will be funky if sorted in Ruby. The suggestion of using the db to sort is a good idea.
etc.
Re poster's comment about sorting data after reading from db: As noted, results will probably not match users' expectations. So the solution is to sort on the db. And it will usually be faster, anyhow--databases are designed to sort data.
Summary: My Ruby 1.8.6 RoR app works fine with international Unicode characters processed and stored as UTF-8 on modern browsers. Right to left languages work fine too. Main issues: be sure that your db and all web pages are set to use UTF-8. If you already have some data in your db, then you'll need to go through a conversion process to change it to UTF-8.
Regards,
Larry
"Unicode ahoy! While Rails has always been able to store and display unicode with no beef, it’s been a little more complicated to truncate, reverse, or get the exact length of a UTF-8 string. You needed to fool around with KCODE yourself and while plenty of people made it work, it wasn’t as plug’n’play easy as you could have hoped (or perhaps even expected).
So since Ruby won’t be multibyte-aware until this time next year, Rails 1.2 introduces ActiveSupport::Multibyte for working with Unicode strings. Call the chars method on your string to start working with characters instead of bytes." Click Here for more
Although I haven't tested it, the character-encodings library (currently in alpha) adds methods to the String class to handle UTF-8 and others. Its page on RubyForge is here. It is designed for Ruby 1.8.
It is my experience, however, that, using Ruby 1.8, if you store data in your database as UTF-8, Ruby will not get in the way as long as your character encoding in the HTTP header is UTF-8. It may not be able to operate on the strings, but it won't break anything. Example:
file.txt:
¡Hola! ¿Como estás? Leí el artículo. ¡Fue muy excellente!
Pardon my poor Spanish; it was the best example of Unicode I could come up with.
in irb:
str = File.read("file.txt")
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\n"
str += "Foo is equal to bar."
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
str = " " + str + " "
=> " \302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar. "
str.strip
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
Basically, it will just treat the UTF-8 as ASCII with odd characters in it. It will not sort lexigraphically if the code points are out of order; however, it will sort by code point. Example:
"\302" <=> "\301"
=> -1
How much are you planning on operating on the data in the Rails app, anyway? Most sorting etc. is usually done by your database engine.

Resources