Uploaded file char-set conversion with Ruby - ruby

I have an application where we're having our clients upload a csv file to our server. We then process and put the data from the csv into our database. We're running into some issues with char-sets especially when we're dealing with JSON, in particular some non-converted UTF-8 characters are breaking IE on JSON responses.
Is there a way to convert the uploaded csv file to UTF-8 before we start processing it? Is there a way to determine the character encoding of an uploaded file? I've played with iconv a bit but we're not always sure what encoding the uploaded file will have. Thanks.

This solution might be not ideal, but should do the job.
First, the ingredients:
chardet (sudo gem install chardet)
fastercsv (sudo gem install
fastercsv)
Now the actual code (not tested):
require 'rubygems'
require 'UniversalDetector'
require 'fastercsv'
require 'iconv'
file_to_import = File.open("path/to/your.csv")
# determine the encoding based on the first 100 characters
chardet = UniversalDetector::chardet(file_to_import.read[0..100])
if chardet['confidence'] > 0.7
charset = chardet['encoding']
else
raise 'You better check this file manually.'
end
file_to_import.each_line do |l|
converted_line = Iconv.conv('utf-8', charset, l)
row = FasterCSV.parse(converted_line)[0]
# do the business here
end

Related

Why is Ruby failing to convert CP-1252 to UTF-8?

I have a CSV files saved from Excel which is CP-1252/Windows-1252. I tried the following, but it still comes out corrupted. Why?
csv_text = File.read(arg[:file], encoding: 'cp1252').encode('utf-8')
# csv_text = File.read(arg[:file], encoding: 'cp1252')
csv = CSV.parse csv_text, :headers => true
csv.each do |row|
# create model
p model
The result
>rake import:csv["../file.csv"] | grep Brien
... name: "Oâ?TBrien ...
However it works in the console
> "O\x92Brien".force_encoding("cp1252").encode("utf-8")
=> "O'Brien"
I can open the CSV file in Notepad++, Encoding > Character Sets > Western European > Windows-1252, see the correct characters, then Encoding > Convert to UTF-8. However, there are many files an I want Ruby to handle this.
Similar: How to change the encoding during CSV parsing in Rails. But this doesn't explain why this is failing.
Ruby 2.4, Reference: https://ruby-doc.org/core-2.4.3/IO.html#method-c-read
Wow, it was caused by the shitty grep in DevKit.
>rake import:csv["../file.csv"]
... name: "O'Brien ...
>where grep
C:\DevKit2\bin\grep.exe
I also did not need the .encode('utf-8').
Let that be a lesson kids. Never take anything for granted. Trust no one!

How to create, read and transform an XML file with Ruby

I am downloading an XML record from Musicbrainz.org, applying an XSLT transformation and outputting a new and different XML record.
I am running into one issue that I wonder if it is a limitation with my approach, XSLT transformations or applying Ruby to text.
I download the record:
require 'open-uri'
mb_metadata = open('http://musicbrainz.org/ws/2/release/?query=barcode:744861082927', 'User-Agent' => 'MarcBrainz marc4brainz#gmail.com').read
File.open('mb_record.xml', 'w').write(mb_metadata)
This works fine.
Then I want to transform that record. First I tried using Nokogiri:
# mb_metadata to transformed record
mb_record = Nokogiri::XML(File.read('mb_record.xml'))
#if we have the xslt document locally this introduces it
template = Nokogiri::XSLT(File.read('mb_to_marc.xsl'))
# this transforms the input document with the template.xslt
puts template.transform(mb_record)
If I run this on its own it works, however if I download the record and then run this it doesn't, it produces a transformed record which just contains some inserts, no element from the original XML file is transformed.
So I thought this might be an issue with Nokogiri and then I tried using the Ruby/XSLT gem:
xslt = XML::XSLT.new()
xslt.xml = 'mb_record.xml'
xslt.xsl = 'mb_to_marc.xsl'
out = xslt.serve()
print out;
Again, if I'm running this on a local file it works, but if I download it and try to transform it it doesn't work - it produces the following error:
xslt.xml = 'mb_record.xml'
Both methods work fine if I just run them on a file which has been downloaded already.
So what's the issue? Is it a naming problem, an XSLT issue, or something else?
Here's the whole script:
#!/usr/bin/env ruby
# encoding: UTF-8
require 'rubygems' if RUBY_VERSION >= '1.9'
require 'pathname'
require 'httpclient'
require 'xml/xslt'
require 'nokogiri'
require 'open-uri'
# DOWNLOAD RECORD FROM MusicBrainz.org - this works
mb_metadata = open('http://musicbrainz.org/ws/2/release/?query=barcode:744861082927', 'User-Agent' => 'MarcBrainz marc4brainz#gmail.com').read
#puts record
File.open('mb_record.xml', 'w').write(mb_metadata)
# mb_metadata to transformed record - this works on a saved file but not if the file is created earlier in this file .
#
#mb_record = Nokogiri::XML(File.read('mb_record.xml'))
#if we have the xslt document locally this introduces it
#template = Nokogiri::XSLT(File.read('mb_to_marc.xsl'))
# this is supposed to transform the input document with the template.xslt
#puts template.transform(mb_record)
# TRYING ANOTHER TACK
# This works if acting on a saved file. i.e. if I comment out the nokogiri lines above and just run the lines below - to 'print out' the xml is correctly transfored by the xslt to produce more xml.
# I added 'sleep 3' to see if that would help but it doesn't make a difference.
xslt = XML::XSLT.new()
xslt.xml = 'mb_record.xml'
xslt.xsl = 'mb_to_marc.xsl'
out = xslt.serve()
print out;
File.open('mb_record.xml', 'w').write(mb_metadata)
is better written as
File.write('mb_record.xml', mb_metadata)
The first will result in a file that hasn't been closed, and possibly not flushed to the disk, which can mean the file has no contents, or only partial contents.
The second writes the file and immediately flushes and closes it.

Problems with Ruby encoding in Windows

I wrote a simple code that reads an email from MS-Outlook, using 'win32ole', and then save its subjects to an CSV file. Everything goes well except the encoding system. When I open my CSV file the words such as "André" are printed as "Andr\x82". I want my output format to be equal to my input.
# encoding: 'CP850'
require 'win32ole'
require 'CSV'
Encoding.default_external = 'CP850'
ol = WIN32OLE.new('Outlook.Application')
inbox = ol.GetNamespace("MAPI").GetDefaultFolder(6)
email_subjecs = []
inbox.Items.each do |m|
email_subjects << m.Subject
end
CSV.open('MyFile.csv',"w") do |csv|
csv << email_subjects
end
O.S: Windows 7 64bit
Encoding.default_external -> CP850
Languadge -> PT
ruby -v -> 1.9.2p290 (2011-07-09) [i386-mingw32]
It seems a simple problem related to external windows encoding and I tryied many solution posted here but I realy can't solve this.
1) Your file name is missing a closing quote.
2) The default open mode for CSV.open() is 'rb', so you can't possibly write to a file with the code you posted.
3) You didn't post the encoding of the text you are trying to write to the file.
4) You didn't post the encoding that you want the the data to be written in.
5)
When I open my CSV file the words such as "é" are printed as "\x82"
Tell your viewing device not to do that.
The magic comment only sets the encoding the current (.rb) file should be read as. It does not set default_external. Try set RUBYOPT=-E utf-8, open your file with CSV.open('MyFile.csv', encoding: 'UTF-8'), or set Encoding.default_external at the top of your file (discouraged).

Encoding error when saving a document through a rake task on rails

I have a rake task that downloads an XML document through HTTP and writes it to file. The XML downloaded has a rather nasty encoding, but it's encoded as 8-bit ASCII with a code page "windows-1254" on the XML.
url = URI("http://report.paragaranti.com/rasyonet_xml_fund_data.asp")
http = Net::HTTP.new url.host
http.read_timeout = 120
response = http.get url.path
response.error! unless response.instance_of? Net::HTTPOK
filename = "#{Date.today}.xml"
File.open(filename, 'w') {|f| f.write(response.body)}
The code above works when I'm executing it as a simple script with no errors. However when I do the same thing through as a rake task through rails I get the following exception:
"\xF0" from ASCII-8BIT to UTF-8
It must be something to do with the encoding of the string, but I'm not sure why it happens or why the code has different behavior in a rails environment and outside it.
I managed to resolve the problem by doing:
File.open(filename, 'wb') {|f| f.write(response.body)}
That is, writing the file as binary. Still, an explanation of what's going on here would be much appreciated (especially the part about why it does not work in a rails environment..)

Ruby: How to determine if file being read is binary or text

I am writing a program in Ruby which will search for strings in text files within a directory - similar to Grep.
I don't want it to attempt to search in binary files but I can't find a way in Ruby to determine whether a file is binary or text.
The program needs to work on both Windows and Linux.
If anyone could point me in the right direction that would be great.
Thanks,
Xanthalas
libmagic is a library which detects filetypes. For this solution I assume, that all mimetype's which start with text/ represent text files. Eveything else is a binary file. This assumption is not correct for all mime types (eg. application/x-latex, application/json), but libmagic detect's these as text/plain.
require "filemagic"
def binary?(filename)
begin
fm= FileMagic.new(FileMagic::MAGIC_MIME)
!(fm.file(filename)=~ /^text\//)
ensure
fm.close
end
end
gem install ptools
require 'ptools'
File.binary?(file)
An alternative to using the ruby-filemagic gem is to rely on the file command that ships with most Unix-like operating systems. I believe it uses the same libmagic library under the hood but you don't need the development files required to compile the ruby-filemagic gem. This is helpful if you're in an environment where it's a bit of work to install additional libraries (e.g. Heroku).
According to man file, text files will usually contain the word text in their description:
$ file Gemfile
Gemfile: ASCII text
You can run the file command through Ruby can capture the output:
require "open3"
def text_file?(filename)
file_type, status = Open3.capture2e("file", filename)
status.success? && file_type.include?("text")
end
Updating above answer with such example, when file name includes "text":
file /tmp/ball-texture.png
/tmp/ball-texture.png: PNG image data, 11 x 18, 8-bit/color RGBA, non-interlaced
So updated code will be like:
def text_file?(filename)
file_type, status = Open3.capture2e('file', filename)
status.success? && file_type.split(':').last.include?('text')
end

Resources