Ruby CSV parsing from Excel with multilingual document - ruby

I have a csv document exported from Excel and containing both english and non-english (russian) letters.
I've managed to open it with
CSV.open #tmp, "rb:ISO-8859-1", {col_sep: ";"}
but it read russian symbols as \xCE\xF1\xF2\xE0\xEB\xFC\xED\xFB\xE5 \xE7\xE0\xEF\xF7 etc.
I've try "rb:ISO-8859-1:UTF-8" but get "ArgumentError: invalid byte sequence in UTF-8", same as csv.open runned without mode.
How this could be fixed? Also, how I could find 'mode' argument options - I couldn't understand from docs where it is described.
Main environment is Ubuntu server, if it matters.

try using this format
r:ISO-8859-15:UTF-8

Related

Read a CSV file with special characters in Ruby and store into SQL Server

I'm trying to import a CSV file (UTF-8 encoding) in Ruby (2.0.0) in to my database (MSSQL 2008R2, COLLATION French_CI_AS), but the special characters (French accents on vowels) are not stored properly : éèçôü becomes éèçôü (or other similar jibberish).
I use this piece of code to read the file :
CSV.foreach(file, col_sep: ';', encoding: "utf-8") do |row|
# ...
end
I tried various encoding in the CSV options (utf-8, iso-8859-1, windows-1252), but none would store the special characters correctly.
Before you ask, my database collation supports those characters, since we have successfully imported data containing those using PHP importers. If I dump the data using puts or a file logger, everything is correct.
Is something wrong with my code, or do I need to specify something else (like the ruby class file encoding for example) ?
Thanks
EDIT : The data saving is done by a PHP REST API that works fine with accented characters. It stores data as it is received.
In Ruby, I parse my data, store it in an object and then send the JSON-encoded object in the body of my PUT request. But if I use an SQL query directly from Ruby, the problem remains :
query = <<-SQL
UPDATE MyTable SET MyTable_title = '#{row_data['title']}' WHERE MyTable_id = '#{row_data['id']}'
SQL
res = db.execute query
I was thinking that this had something to do with the encoding type on your CSV file, so started digging around on that. I did find that windows-1252 encoding will insert control characters.
You can read more about it here: Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

The simplest way to puts sterling-pound in ruby from a yaml file

I have a yaml file with a pound-sterling sign on it -
amount: "£50"
when I access the symbol it return the following:
"£50"
I am using hashie:mash to load and access my yaml... ideas are welcome, haven't found anything on the webs that give a straight forward solution (or at least one that works for me)
The external encoding is your issue; Ruby is assuming that any data read from external files is CP-850, rather than UTF-8.
You can solve this a few ways:
Set Encoding.default_external ='utf-8'. This will tell Ruby to read files as UTF-8 by default.
Explicitly read your file as UTF-8, via open('file.yml', 'r:utf-8')
Convert your string to UTF-8 before you pass it to your YAML parser:
You can do this via String#force_encoding, which tells Ruby to reinterpret the raw bytes with a different encoding:
text = open("file.yml").read
text.force_encoding("utf-8")
YAML.load text

Extended charsets chars not reccognized and converting to ? mark

I have a string contain some special char like "\u2012" i.e. FIGURE DASH. When i am trying to print this on console I am getting a '?' mark instead of its symbol. I have an editor where in I can insert the symbol using alt+numpad like alt+2012. In editor it I could see the symbol save it in a xml file and get the value using nodevalue, I get a '?' mark.
To summerize I am facing problem to read extended latin a charset. What i need is When i insert such symbols and read it, i should get something like &#xXXXX;.
Please help!
TIA :)
Simply I have a String inpath = "À";, I want to get its unicode value..like &#xXXXX;
The default console encoding in Windows is some MS-DOS code page and they don't support the character. You can try running chcp 65001 before running the program but you might also need to change the console font as well.
You don't need to do anything you wouldn't do with any other character, as long as you use UTF-8. You aren't doing that in many places. You need to explicitly write in your code to save and read the file in UTF-8, and not rely on the platform default encoding.

How to reformat CSV file to match proper CSV format

I have a web application that parse users uploaded csv files.
Some users upload csv files don't match proper csv format mentioned here
For example:
abc,hello mahmoud,this is" description, bad
This should be
abc,hello mahmoud,"this is"" description", bad
When I used ruby fastercsv library to parse the wrong csv, it fails. However, it success when I open the file by excel or openoffice.
Is there any ruby library can reformat the csv text to put it in a proper format?
From the docs:
What you don‘t want to do is feed FasterCSV invalid CSV. Because of
the way the CSV format works, it‘s common for a parser to need to read
until the end of the file to be sure a field is invalid. This eats a
lot of time and memory.
Luckily, when working with invalid CSV, Ruby‘s built-in methods will
almost always be superior in every way. For example, parsing
non-quoted fields is as easy as:
data.split(",")
This would give you an array. If you really want valid CSV (f.e. because you rescued the MalformedCSVError) then there is... fasterCSV!
require 'csv'
str= %q{abc,hello mahmoud,this is" description, bad}
puts str.split(',').to_csv
#=> abc,hello mahmoud,"this is"" description", bad

InstallShield 2011 error 7185 importing Japanese strings in the string table of basic MSI project

I am trying to import Japanese strings inside my "Basic MSI" project, it use to work before without any issues but now when I try to import some Japanese strings from a text file then it throws following error (I have changed some of the personal data from the error message.)
ISDEV : error -7185: The Japanese: 日本語 translation for string identifier IDS_XXXX_1111 includes characters that are not available on code page 932.
I think there are some of the characters inside the IDS_XXXX_1111 are not part of code page 932. How to detect those characters using some tool?
Also documentation mentions about changing some encoding settings to UTF-8 in InstallShield 2011, if you are aware then please guide me.
Thanks in advance
Rahul
My favorite way to detect such characters is with python. For example, reading a file like the InstallShield string tables in python 2.x:
import codecs
strings = codecs.open("strings.txt", "r", "UTF-16"):
for line in strings.readlines():
line = line.strip()
try:
line.encode("cp932")
except UnicodeError:
print "Can't encode: " + line.encode("cp932", "replace")
Your alternatives are to pinpoint the characters that cannot be represented on the relevant code page and replace them with ones that can, or to go to the Releases view and select yes for the Build UTF-8 Database setting.

Resources