From the libxml-ruby API docs (http://libxml.rubyforge.org/rdoc/index.html), under LibXML::XML::Document, I tried the following:
filename = 'something.xml'
stats_doc = XML::Document.new()
stats_doc.root = XML::Node.new('root_node')
stats_doc.root << XML::Node.new('elem1')
stats_doc.save(filename, :indent => true, :encoding => 'utf-8')
... which resulted in this error:
parse_stats.rb:26:in `save': can't convert String into Integer (TypeError)
(where the last line in the block above was line 26).
I tried changing the file name to an integer, which gave me this:
parse_stats.rb:26:in `save': wrong argument type Fixnum (expected String) (TypeError)
So I gathered that I need to use a string, but strings don't seem to work. Since I seem to be unable to find any examples of libxml-ruby in action off Google, I'm at a loss. Any help would be very appreciated, or links to any online example where I can see how libxml-ruby is used for creating XML documents.
libxml-ruby 1.1.3
rubygems 1.3.1
ruby 1.8.7
Seems that the problem is with encoding. Try XML::Encoding::UTF_8 instead of "utf-8".
require 'rubygems'
require 'libxml'
filename = 'something.xml'
stats_doc = LibXML::XML::Document.new()
stats_doc.root = LibXML::XML::Node.new('root_node')
stats_doc.root << LibXML::XML::Node.new('elem1')
stats_doc.save(filename, :indent => true, :encoding => LibXML::XML::Encoding::UTF_8)
Related
I have the following header:
From: =?iso-8859-1?Q?Marta_Falc=E3o?= <marta.falcao#example.com.br>
I can easily split out the stuff before the <, which leaves me with
"=?iso-8859-1?Q?Marta_Falc=E3o?="
What can I use to turn this into "Marta Falcão"?
Using the newer Mail gem:
Mail::Encodings.value_decode(str) or
Mail::Encodings.unquote_and_convert_to(str, to_encoding)
Thanks to Roland Illig for his comment, which led me to two options:
install rfc2047-ruby and call Rfc2047.decode(header)
install TMail and call TMail::Unquoter.unquote_and_convert_to(header, 'utf-8') or better yet TMail::Address.parse(header).friendly, the latter of which strips out the <email address> part
Use Ruby to implement RFC 2047 isn't hard:
module Rfc2047
TOKEN = /[\041\043-\047\052\053\055\060-\071\101-\132\134\136\137\141-\176]+/.freeze
ENCODED_TEXT = /[\041-\076\100-\176]+/.freeze
ENCODED_WORD = /=\?(?<charset>#{TOKEN})\?(?<encoding>[QB])\?(?<encoded_text>#{ENCODED_TEXT})\?=/i.freeze
class << self
def encode(input)
"=?#{input.encoding}?B?#{[input].pack('m0')}?="
end
def decode(input)
match_data = ENCODED_WORD.match(input)
raise ArgumentError if match_data.nil?
charset, encoding, encoded_text = match_data.captures
decoded =
case encoding
when 'Q', 'q' then encoded_text.unpack1('M')
when 'B', 'b' then encoded_text.unpack1('m')
end
decoded.force_encoding(charset)
end
end
end
Rfc2047.decode '=?iso-8859-1?Q?Marta_Falc=E3o?=' # => Marta_Falcão
Update
mikel/mail is currently having an encoding issue which might not decode the string correctly.
If that really bothers you, you can try new_rfc_2047:
$ gem install new_rfc_2047
$ ruby -rrfc_2047 -e 'puts Rfc2047.decode "From: =?iso-8859-1?Q?Marta_Falc=E3o?= <marta.falcao#example.com.br>"'
From: Marta Falcão <marta.falcao#example.com.br>
Since the source code of mikel/mail is a little too complicated for me to do the modification, I just made my own gem for this.
Gem source is here: https://github.com/tonytonyjan/rfc_2047/
Is there a way to read files encoded in UTF-8 with BOM (Byte order marks) on Ruby v2.5.0?
On Ruby 2.3.1 this used to work:
csv = CSV.open(file_path, encoding: 'bom|utf-8')
However, on 2.5.0 the following error ocurrs:
ArgumentError:
unknown encoding name - bom|utf-8
You can try this as well:
File.open(file_path, "r:bom|utf-8")
You can try this:
require 'file_with_bom'
File.open(file_path, "w:utf-8", :bom => true ) do |csv|
end
it works well
I'm working with a file that appears to have UTF-16LE encoding. If I run
File.read(file, :encoding => 'utf-16le')
the first line of the file is:
"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n
If I read the file using something like
csv_text = File.read(file, :encoding => 'utf-16le')
I get an error stating
ASCII incompatible encoding needs binmode (ArgumentError)
If I switch the encoding in the above to
csv_text = File.read(file, :encoding => 'utf-8')
I make it to the SmarterCSV section of the code, but get an error that states
`=~': invalid byte sequence in UTF-8 (ArgumentError)
The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:
require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
csv_text = File.read(file, :encoding => 'utf-16le')
File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
puts 'made it here'
SmarterCSV.process('/tmp/tmp_file', {
:col_sep => "\t",
:force_simple_split => true,
:headers_in_file => false,
:user_provided_headers => headers
}).each do |row|
converted_row = {}
converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
converted_row[:timestamp] = row[:timestamp]
converted_row[:sender] = row[:sender][2..-2]
converted_row[:phone_number] = row[:phone_number][2..-2]
converted_row[:message] = row[:message][1..-2]
converted_row[:room] = file.gsub(path, '')
end
end
Update - 05/13/15
Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.
Add binmode to the File.read call.
File.read(file, :encoding => 'utf-16le', mode: "rb")
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
ref: http://ruby-doc.org/core-2.0.0/IO.html#method-c-read
Now pass the correct encoding to SmarterCSV
SmarterCSV.process('/tmp/tmp_file', {
:file_encoding => "utf-16le", ...
Update
It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.
Unfortunately, you're using a 'flat-file' style of storage and character encoding is going to be an issue on both ends (reading or writing).
I would suggest using something along the lines of str = str.force_encoding("UTF-8") and see if you can get that to work.
I am doing one the examples at the mechanize doc site and I want to parse the results using
nokogiri.
My problem is that when the following line gets executed:
doc = Nokogiri::HTML(search_results, 'UTF-8' )
the following error occurs:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
from mechanize_test.rb:16:in `<main>'
I have installed ruby 1.9 on a windows vista machine
The results returned by mechanize are non-latin (utf8)
The code sample follows.
# encoding: UTF-8
require 'rubygems'
require 'mechanize'
require 'nokogiri'
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get("http://www.google.com/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = "invitations"
search_results = agent.submit(search_form)
puts search_results.body
doc = Nokogiri::HTML(search_results, 'UTF-8')
#Douglas Drouillard
Thanx for looking into this. I found out I made a mistake. The call to nokogiri should have been:
doc = Nokogiri::HTML(search_results.body, 'UTF-8')
Note that search_results is different that search_results.body.
Search_results contains info coming right out of mechanize instantiation
while search_resuls.body contains html utf8 info that nokogiri can parse with no problem.
This appears to be issue with what Nokogiri expects as parameters to the parse method that is being called. The first issue I see, is that you are passing in the encoding option in the wrong parameter slot,
A parsing example from Nokogiri project page that specifies encoding
Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')
Notice the encoding is the third parameter, not the second. But that still does not fully explain the behavior you are seeing, as the encoding should simply be ignored.
Per the Nokogiri documentation a call to Nokogiri::HTML() is a convenience method for the parse method.
Code for Nokogiri::HTML::parse
def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
document.parse(thing, url, encoding, options, &block)
end
The source for the Nokogiri::HTML::Document parse method is a bit long, but here is the relevant part though:
string_or_io.respond_to?(:encoding)
unless string_or_io.encoding.name == "ASCII-8BIT"
encoding ||= string_or_io.encoding.name
end
end
Notice string_or_io.encoding.name, this matches the error your saw, undefined method 'name' for "UTF-8":String (NoMethodError).
Does your search_results object has an attribute with a key value pair of {:encoding => 'UTF-8'}? It appears Nokogiri is looking for the encoding to store an object that then has a name attribute of 'UTF-8'.
I'm trying to read this ATOM Feed (http://ffffound.com/feed), but I'm unable to get to any of the values which are defined as part of the namespace e.g. media:content and media:thumbnail.
Do I need to make the parser aware of the namespaces?
Here's what I 've got:
require 'rss/2.0'
require 'open-uri'
source = "http://ffffound.com/feed"
content = ""
open(source) do |s| content = s.read end
rss = RSS::Parser.parse(content, false)
I believe you would have to use libxml-ruby for that.
gem 'libxml-ruby', '>= 0.8.3'
require 'xml'
xml = open("http://ffffound.com/feed")
parser = XML::Parser.string(xml, :options =>XML::Parser::Options::RECOVER)
doc = parser.parse
doc.find("channel").first.find("items").each do |item|
puts item.find("media:content").first
#and just guessing you want that url thingy
puts item.find("media:content").first.attributes.get_attribute("url").value
end
I hope that points you in the right direction.