How to create, read and transform an XML file with Ruby - ruby

I am downloading an XML record from Musicbrainz.org, applying an XSLT transformation and outputting a new and different XML record.
I am running into one issue that I wonder if it is a limitation with my approach, XSLT transformations or applying Ruby to text.
I download the record:
require 'open-uri'
mb_metadata = open('http://musicbrainz.org/ws/2/release/?query=barcode:744861082927', 'User-Agent' => 'MarcBrainz marc4brainz#gmail.com').read
File.open('mb_record.xml', 'w').write(mb_metadata)
This works fine.
Then I want to transform that record. First I tried using Nokogiri:
# mb_metadata to transformed record
mb_record = Nokogiri::XML(File.read('mb_record.xml'))
#if we have the xslt document locally this introduces it
template = Nokogiri::XSLT(File.read('mb_to_marc.xsl'))
# this transforms the input document with the template.xslt
puts template.transform(mb_record)
If I run this on its own it works, however if I download the record and then run this it doesn't, it produces a transformed record which just contains some inserts, no element from the original XML file is transformed.
So I thought this might be an issue with Nokogiri and then I tried using the Ruby/XSLT gem:
xslt = XML::XSLT.new()
xslt.xml = 'mb_record.xml'
xslt.xsl = 'mb_to_marc.xsl'
out = xslt.serve()
print out;
Again, if I'm running this on a local file it works, but if I download it and try to transform it it doesn't work - it produces the following error:
xslt.xml = 'mb_record.xml'
Both methods work fine if I just run them on a file which has been downloaded already.
So what's the issue? Is it a naming problem, an XSLT issue, or something else?
Here's the whole script:
#!/usr/bin/env ruby
# encoding: UTF-8
require 'rubygems' if RUBY_VERSION >= '1.9'
require 'pathname'
require 'httpclient'
require 'xml/xslt'
require 'nokogiri'
require 'open-uri'
# DOWNLOAD RECORD FROM MusicBrainz.org - this works
mb_metadata = open('http://musicbrainz.org/ws/2/release/?query=barcode:744861082927', 'User-Agent' => 'MarcBrainz marc4brainz#gmail.com').read
#puts record
File.open('mb_record.xml', 'w').write(mb_metadata)
# mb_metadata to transformed record - this works on a saved file but not if the file is created earlier in this file .
#
#mb_record = Nokogiri::XML(File.read('mb_record.xml'))
#if we have the xslt document locally this introduces it
#template = Nokogiri::XSLT(File.read('mb_to_marc.xsl'))
# this is supposed to transform the input document with the template.xslt
#puts template.transform(mb_record)
# TRYING ANOTHER TACK
# This works if acting on a saved file. i.e. if I comment out the nokogiri lines above and just run the lines below - to 'print out' the xml is correctly transfored by the xslt to produce more xml.
# I added 'sleep 3' to see if that would help but it doesn't make a difference.
xslt = XML::XSLT.new()
xslt.xml = 'mb_record.xml'
xslt.xsl = 'mb_to_marc.xsl'
out = xslt.serve()
print out;

File.open('mb_record.xml', 'w').write(mb_metadata)
is better written as
File.write('mb_record.xml', mb_metadata)
The first will result in a file that hasn't been closed, and possibly not flushed to the disk, which can mean the file has no contents, or only partial contents.
The second writes the file and immediately flushes and closes it.

Related

How to replace the first few bytes of a file in Ruby without opening the whole file?

I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.

Load gemspec from stdin

I'm trying to adapt some existing code to also handle gems. This existing code needs the version number of the thing in question (here: the gem) and does some git stuff to get the relevant file (here I take the gemspec) in the right version, and then passes it on stdin to another script that extract the version number (and other stuff).
To avoid having to write code to parse a gemspec, I was trying to do:
spec = Gem::Specification::load('-')
puts spec.name
puts spec.version
But I can't make it read from stdin (it works fine if I hardcode a file name, but that won't work in my usecase). Can I do that, or is there another (easy) way to do it?
Gem::Specification.load expects either a File instance or a path to a file as the first argument so the easiest way to solve this would be to simply create a Tempfile instance and write the data from stdin to it.
file = Tempfile.new
begin
file.write(data_from_stdin)
file.rewind
spec = Gem::Specification.load(file)
puts spec.name
puts spec.version
ensure
file.close
file.unlink
end

RubyZip Unzipping .docx, modifying, and zipping back up throws Errno::EACCESS error

So, I'm using Nokogiri and Rubyzip to unzip a .docx file, modify the word/docoument.xml file in it (in this case just change every element wrapped in to say "Dreams!"), and then zip it back up.
require 'nokogiri'
require 'zip'
zip = Zip::File.open("apple.docx")
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
inputs = xml.root.xpath("//w:t")
inputs.each{|element| element.content = "DREAMS!"}
zip.get_output_stream("word/document.xml", "w") {|f| f.write(xml.to_s)}
zip.close
Running the code through IRB line by line works perfectly and makes the changes to the .docx file as I needed, but if I run the script from the command line
ruby xmltodoc.rb
I receive the following error:
C:/Ruby193/lib/ruby/gems/1.9.1/gems/rubyzip-1.1.7/lib/zip/file.rb:416:in `rename': Permission denied - (C:/Users/Bane/De
sktop/apple.docx20150326-6016-k9ff1n, apple.docx) (Errno::EACCES)
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/rubyzip-1.1.7/lib/zip/file.rb:416:in `on_success_replace'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/rubyzip-1.1.7/lib/zip/file.rb:308:in `commit'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/rubyzip-1.1.7/lib/zip/file.rb:332:in `close'
from ./xmltodoc.rb:15:in `<main>'
All users on my computer have all permissions for that .docx file. The file also doesn't have any special settings--just a new file with a paragraph. This error only shows up on Windows, but the script works perfectly on Mac and Ubuntu. Running Powershell as Admin throws the same error. Any ideas?
On my Windows 7 system the following works.
require 'nokogiri'
require 'zip'
Zip::File.open("#{File.dirname(__FILE__)}/apple.docx") do |zipfile|
doc = zipfile.read("word/document.xml")
xml = Nokogiri::XML.parse(doc)
inputs = xml.root.xpath("//w:t")
inputs.each{|element| element.content = "DREAMS!"}
zipfile.get_output_stream("word/document.xml") {|f| f.write(xml.to_s)}
end
Instead you also could use the gem docx, here is an example, the names of the bookmarks are in dutch because, well that's the language my MS Office is in.
require 'docx'
# Create a Docx::Document object for our existing docx file
doc = Docx::Document.open('C:\Users\Gebruiker\test.docx'.gsub(/\\/,'/'))
# Insert a single line of text after one of our bookmarks
# p doc.bookmarks['bladwijzer1'].methods
doc.bookmarks['bladwijzer1'].insert_text_after("Hello world.")
# Insert multiple lines of text at our bookmark
doc.bookmarks['bladwijzer3'].insert_multiple_lines(['Hello', 'World', 'foo'])
# Save document to specified path
doc.save('example-edited.docx')

My file is getting shorter and I don't know why

I have a requirement where I need to edit part of xml file and save it, but in my code some part of the xml file it not saving.I want to modify <mtn:ttl>4</mtn:ttl> to <mtn:ttl>9</mtn:ttl>, this part is getting modified in the below code but while writting/saving only part of file is getting chaged or the format of the file is getting chaged, can any one tell me how to solve this? original xml file size is 79kb but after editing and saving its becoming 78kb...
require "rexml/text"
require "rexml/document"
include REXML
File.open("c://conf//cad-mtn-config.xml") do |config_file|
# Open the document and edit the file
config = Document.new(config_file)
if testField.to_s.match(/<mtn:ttl>/)
config.root.elements[4].elements[11].elements[1].elements[1].elements[1].elements[8].text="9"
# Write the result to a new file.
formatter = REXML::Formatters::Default.new
File.open("c://mtn-3//mtn-2.2//conf//cad-mtn-config.xml", 'w') do |result|
formatter.write(config, result)
end
end
end
It looks like your trying to use regular expressions, why not just use rexml? The only requirement is that you need to know where the namespace is located online. Note if it were not mtn:ttl and just ttl you would not need the namespace.
require 'rexml/document'
file_path="path to file"
contents=File.new(file_path).read
xml_doc=REXML::Document.new(contents)
xml_doc.add_namespace('mtn',"http://url to mtn namespace")
xml_doc.root.elements.each('mtn:ttl') do |element|
element.text="9"
end
File.open(file_path,"w") do |data|
data<<xml_doc
end

Uploaded file char-set conversion with Ruby

I have an application where we're having our clients upload a csv file to our server. We then process and put the data from the csv into our database. We're running into some issues with char-sets especially when we're dealing with JSON, in particular some non-converted UTF-8 characters are breaking IE on JSON responses.
Is there a way to convert the uploaded csv file to UTF-8 before we start processing it? Is there a way to determine the character encoding of an uploaded file? I've played with iconv a bit but we're not always sure what encoding the uploaded file will have. Thanks.
This solution might be not ideal, but should do the job.
First, the ingredients:
chardet (sudo gem install chardet)
fastercsv (sudo gem install
fastercsv)
Now the actual code (not tested):
require 'rubygems'
require 'UniversalDetector'
require 'fastercsv'
require 'iconv'
file_to_import = File.open("path/to/your.csv")
# determine the encoding based on the first 100 characters
chardet = UniversalDetector::chardet(file_to_import.read[0..100])
if chardet['confidence'] > 0.7
charset = chardet['encoding']
else
raise 'You better check this file manually.'
end
file_to_import.each_line do |l|
converted_line = Iconv.conv('utf-8', charset, l)
row = FasterCSV.parse(converted_line)[0]
# do the business here
end

Resources