Encoding utf-8 string with MySQL and Rails 3 not working - ruby

yet another encoding problem with MySQL, UTF-8 and Rails 3 application.
We recently migrated our code from Rails 2 to Rails 3. We use MySQL and the mysql2 gem. The thing is, in our old database we had content that included some utf-8 chars instead of their corresponding htmlentities, such as \xC3\x9F for an o with a dieresis.
We have those strings as a YAML serialization of some strings that have to go into the website. The problem is that when the records from the database are loaded into the ActiveRecord objects, this is done with strange characters, thus showing really nasty on the web. For example, ß is shown as à and so on.
I played a bit with the new encoding magic of Rails 3, trying various combinations of force_encoding and encode methods with no luck.
For the record, mysql is started with this two lines:
character-set-server=utf8
collation-server=utf8_unicode_ci
Any idea on what are we doing wrong, why the YAML is not reading correctly those escaped characters and what could we do to solve the issue?
Cheers

Ok I just found out what the problem was: the yaml text was done with syck and now psyck is not liking it much.
I found the answer here https://stackoverflow.com/a/8570162/196708

Related

php wrong utf8 characters from mysql using Twig

Im doing some webapp on php, im using my own MVC pattern, including Activerecord, and Twig templates.
So i have some problems with charset, there is some details about my encoding.
Im using polish characters
Mysql encoding is set to utf8_unicode_ci (i tried urf8_general_ci)
Twig template have standard html-5 header with utf8 encoding
Im not sure about files encoding (using netbeans), but sublime text 2 console on view.encoding() says: u'Undefined', i dont try to change it yet.
Problem description:
When im using polish characters like ółąćź in Twig template file - everything looks good, there is no problem. I tried to use:
echo $twig->render('hello.tpl', array('locations'=>"óóśąłłąś"));
And in this case is no problem too.
But when I get my data from database the polish characters are like "�"
I tried to get data by structural php mysql call, and by activerecord - ex. Model::all().
It allways have problems with characters from database in Twig template.
And yes, i set my active record config like: dbname?charset=utf8
The answer is funny.
I tried again to do it structural and i used this query:
mysql_query("SET NAMES 'utf8'", $dbLink);
It works, all characters are visible now.
On activerecord the problem still apears, so i updated activerecord to nigtly build, and everything works now !

Nokogiri producing different results on heroku?

I'm having a very strange problem and I'd appreciate help tracking it down.
I'm using the nokogiri gem to parse some html, and I am parsing a file which has a weird character in it. Not entirely sure what this character is, in vim it shows as ^Q.
On my own computer, everything works fine, however on heroku it inserts a </body></html><html> when it hits the character and selectors only return the elements before the weird character.
To illustrate:
Nokogiri::HTML( open("http://thoms.net.nz/e2.html")).css("body div").count is 1 on heroku, and two on my computer. - The file containing this character can be downloaded from http://thoms.net.nz/e2.html.
Both my computer and heroku are running nokogiri 1.5.5 with ruby 1.9.3.
The ^Q is a software control character (XON), which isn't supposed to be in HTML. I suspect its unexpected presence is confusing both Nokogiri and Heroku, but in different ways.
HTML documents from the wilds of the internet can be corrupted in any numbers of ways. I've seen all sorts of garbage in them, and if I couldn't make sense of it using iconv or a Unicode transliteration, I'd resort to a quick global search and replace to remove anything not in the normal ASCII range before further processing.
In Ruby, global search and replace uses String#gsub.
doc = Nokogiri::HTML(html.gsub("\u0011", ''))

convert ascii characters to ruby encoding

I'm testing a feature with watir and running into an issue with validating ascii characters in the html.
I'm grabbing the product description from a database like so 'Company® Some Product' and use it as the string that i'm validating against.
and it shows up that way in the html. However Ruby is looking for Company\u00AE Some Product, so my test is failing.
Anyone have any solutions for getting around these special characters when they turn up?
HTML Entities gem may help:
http://htmlentities.rubyforge.org/
http://htmlentities.rubyforge.org/doc/

How to get 'è' (and not 'e') with activerecord and ruby 1.8.7

I am writing a simple script to update a table data.
I am unable to get a record trough a field named "Agliè"; the problem is "è".
c = Comune.find_by_denominazione_italiano_tedesco('Agliè')
I realised that the problem can be patched using "Aglie", but I need to preserve the accent difference (these are town names, some are the same, except of the accent).
My db character set is UTF-8, the collation is latin1_swedish_ci; however, changing it to utf8_general_ci makes no difference. My ruby script is in utf-8; I tried changing it to latin1 as well, no difference again.
Any suggestion?
Cheers,
Davide
Looks like it was a file encoding problem after all, grr.
Thanks anyway folks.

clean up strange encoding in ruby

I'm currently playing a bit with couchdb.
I'm trying to migrate some blog data from redis (key value store) to couchdb (key value store).
Seeing as I probably migrated this data a gazillion times from and to different blogging engines (everybody has got to have a hobby :) ), there seem to be some encoding snafus.
I'm using CouchREST to access CouchDB from ruby and I'm getting this:
<JSON::GeneratorError: source sequence is illegal/malformed>
the problem seems to be the body_html part of the object:
<Post:0x00000000e9ee18 #body_html="[.....]Wie Sie bereits wissen, m\xF6chte EUserv k\xFCnftig seine [...]
Those are supposed to be Umlauts ("möchte" and "künftig").
Any idea how to get rid of those problems? I tried some conversions using the ruby 1.9 encoding feature or iconv before inserting, but haven't got any luck yet :(
If I try to e.g. convert that stuff to ISO-8859-1 using the .encode() method of ruby 1.9, this is what happens (different text, same problem):
#<Encoding::UndefinedConversionError: "\xC6\x92" from UTF-8 to ISO-8859-1>
I try to e.g. convert that stuff to ISO-8859-1
Close. You actually want to do it the other way around: you've got ISO-8859-1(*), you want UTF-8(**). So str.encode('utf-8', 'iso-8859-1') would be more likely to do the trick.
*: actually you might well have Windows code page 1252, which is like ISO-8859-1, but with extra smart-quotes and things in the range 0x80-0x9F which ISO-8859-1 uses for control codes. If so, use 'cp1252' instead.
**: well, you probably do. Working with UTF-8 is the best way forward so you can store all possible characters. If you really want to keep working in ISO-8859-1/cp1252, then presumably the problem is just that Ruby has mis-guessed the character set in use and you can fix it by calling str.force_encoding('iso-8859-1').

Resources