Edit docx using nokogiri and rubyzip - ruby

Here, I'm using a rubyzip and nokogiri to modify a .docx file.
RubyZip -> Unzip .docx file
Nokogiri -> Parse and change in content of the body of word/document.xml
As I wrote the sample code just below but code modify the file but others file were disturbed. In other words, updated file is not opening showing error the word processor is crashed. How can I resolve this issue ?
require 'zip/zipfilesystem'
require 'nokogiri'
zip = Zip::ZipFile.open("SecurityForms.docx")
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
wt = xml.root.xpath("//w:t", {"w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}).first
wt.content = "FinalStatement"
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}
zip.close

According to the official Github documentation, you should Use write_buffer instead open. There's also a code example at the link.

Following is the code that edit the content of a .docx template file.It first creae a new copy of your template.docx remember u will create this template file and keep this file in the same folder where you create your ruby class like you will create My_Class.rb and copy following code in it.It works perfectly for my case. Remember you need to install rubyzip and nokogiri gem in a gemset.(Google them to install).Thanks
require 'rubygems'
require 'zip/zipfilesystem'
require 'nokogiri'
class Edit_docx
def initialize
coupling = [('a'..'z'),('A'..'Z')].map{|i| i.to_a}.flatten
secure_string = (0...50).map{ coupling[rand(coupling.length)] }.join
FileUtils.cp 'template.docx', "#{secure_string}.docx"
zip = Zip::ZipFile.open("#{secure_string}.docx")
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
wt = xml.root.xpath("//w:t", {"w"=>"http://schemas.openxmlformats.org/wordprocessingml/2006/main"})
#puts wt
wt.each_with_index do |tag,i|
tag.content = i.to_s + ""
end
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}
zip.close
puts secure_string
#FileUtils.rm("#{secure_string}.docx")
end
N.new
end

Related

Read a Zip::Entry object after unzipping an xml file

I have an external xml file download that needs unzipped and parsed. I have downloaded and unzipped it but now it is stuck as an Zip::Entry object and I am unable to parse it with Nokogiri.
require 'open-uri'
require 'zip'
require 'nokogiri'
url = 'https://download.api.bingads.microsoft.com/ReportDownload/Download.aspx?xmlfile'
zip_file = open(url)
# file pulled down successfully => tmp/localpath
unzippedxml = Zip::File.open(zip_file.path) do |z|
xml_file = z.first
end
#output is my xml file => myxml.xml
unzippedxml.class => Zip::Entry
Nokogiri::XML("unzippedxml")
=> #<Nokogiri::XML::Document:0x212b2c0 name="document")
How do I parse this file? I've created a dummy xml file that didn't need unzipped and I've been able to parse it in the console but I am unable to get this one open.
Any help would be greatly appreciated!
Zip::ZipFile represents the entire Zip container; what you need instead is inside this container, an object of class Zip::ZipEntry. You could for example use Zip::ZipFile.read to get a file with a specific name:
require 'zip/zip'
zip = Zip::ZipFile.open('some.zip') # open zip
xml_source = zip.read('filename_inside_zip.xml') # read file contents
# now use the contents of xml_source with Nokogiri
Or, if you don't know the name but there's always only one file in the Zip, you can just take the first one:
require 'zip/zip'
zip = Zip::ZipFile.open('some.zip') # open zip
entry = zip.entries.reject(&:directory?).first # take first non-directory
xml_source = entry.get_input_stream{|is| is.read } # read file contents
# now use the contents of xml_source with Nokogiri

How to properly automate xml to xls

I am getting a lot of xml files recently, that i want to analyse in excel. In stead of using the xml conversion standard in (newer versions of) excel, I want to use a Ruby code that does it for a number of files automatically.
I am not very familiar, however, with rexml. After half a days work I got the code to convert just one(!) xml node. This is how it looks:
require 'rexml/document'
Dir.glob("FILES/archive/*.xml") do |eksemel|
puts "converting #{eksemel}"
filename = (/\d+/.match(eksemel)).to_s
xml_file = File.open("#{eksemel}", "r")
csv_file = File.new("#{filename}.csv", "w")
xml = REXML::Document.new( xml_file )
counter = 0
xml.elements.each("RESULTS") do |e|
e.elements.each("component") do |f|
f.elements.each("paragraph") do |g|
counter = counter + 1
csv_file.puts g.text
end
end
end
end
Is there a way to a) instead of define the names of the elements and the number let ruby do it automatically and b) save all of these as separate columns in a csv file?
It isn't clear what you are using counter for. It would also help if you clarified what kind of structure the XML file has (for instance, are there many <paragraph> elements within each <component> element?). But, here is a cleaner way to write what I think you shooting for:
require 'rexml/document'
require 'csv'
Dir.glob('FILES/archive/*.xml') do |eksemel|
puts "converting #{eksemel}"
# I assume you are creating a .csv file with the same name as your .xml file
xml_file = File.new(eksemel)
csv_file = CSV.open(eksemel.sub(/\.xml$/, '.csv'), 'w')
xml = REXML::Document.new(xml_file)
counter = xml.elements.to_a('RESULTS//component//paragraph').length
xml.elements.each('RESULTS//component') do |component|
csv_file << component.elements.to_a('paragraph')
end
[xml_file, csv_file].each {|f| f.close}
end

xpath search using libxml + ruby

I am trying to search for a specific node in an XML file using XPath. This search worked just fine under REXML but REXML was too slow for large XML docs. So moved over to LibXML.
My simple example is processing a Yum repomd.xml file, an example can be found here: http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml
My test script is as follows:
require 'rubygems'
require 'libxml'
p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse
filelist = repomd.find_first("/repomd/data[#type='filelists']/location#href")
puts "Length: " + filelist.length.to_s
filelist.each do |f|
puts f.attributes['href']
end
I get this error:
Error: Invalid expression.
/usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find': Error: Invalid expression. (LibXML::XML::Error)
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find'
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:130:in `find_first'
from /tmp/scripty.rb:6
I have also tried simpler examples like below, but still no dice.
p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse
filelist = repomd.root.find(".//location")
puts "Length: " + filelist.length.to_s
In the above case I get the output:
Length: 0
Your inspired guidance would be greatly appreciated, and I have searched for what I am doing wrong, and I just can't figure it out...
Here is some code that will fetch the file and process it, still doesn't work...
require 'rubygems'
require 'open-uri'
require 'libxml'
raw_xml = open('http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml').read
p = LibXML::XML::Parser.string(raw_xml)
repomd = p.parse
filelist = repomd.find_first("//data[#type='filelists']/location[#href]")
puts "First: " + filelist
In the end I reverted back to REXML and used stream processing. Much faster and much easier XPath syntax implementation.
Looking at your code,it seems you want to collect only those location elements which has href attribute. If that's the case below should work:
"//data[#type='filelists']/location[#href]"

Xpath content not saved

It might just be an idiotic bug in the code that I haven't yet discovered, but it's been taking me quite some time: When parsing websites using nokogiri and xpath, and trying to save the content of the xpaths to a .csv file, the csv file has empty cells.
Basically, the content of the xpath returns empty OR my code doesn't properly read the websites.
This is what I'm doing:
require 'open-uri'
require 'nokogiri'
require 'csv'
CSV.open("neverend.csv", "w") do |csv|
csv << ["kuk","date","name"]
#first, open the urls from a document. The urls are correct.
File.foreach("neverendurls.txt") do |line|
#second, the loop for each url
searchablefile = Nokogiri::HTML(open(line))
#third, the xpaths. These work when I try them on the website.
kuk = searchablefile.at_xpath("(//tbody/tr/td[contains(#style,'60px')])[1]")
date = searchablefile.at_xpath("(//tbody/tr/td[contains(#style,'60px')])[1]/following-sibling::*[1]")
name = searchablefile.at_xpath("(//tbody/tr/td[contains(#style, '60px')])[1]/following-sibling::*[2]")
#fourth, saving the xpaths
csv << [kuk,date,name]
end
end
what am I missing here?
It's impossible to tell from what you posted, but let's clean that hot mess up with css:
kuk = searchablefile.at 'td[style*=60px]'
date = searchablefile.at 'td[style*=60px] + *'
name = searchablefile.at 'td[style*=60px] + * + *'

Parse ATOM in Ruby with custom namespaces

I'm trying to read this ATOM Feed (http://ffffound.com/feed), but I'm unable to get to any of the values which are defined as part of the namespace e.g. media:content and media:thumbnail.
Do I need to make the parser aware of the namespaces?
Here's what I 've got:
require 'rss/2.0'
require 'open-uri'
source = "http://ffffound.com/feed"
content = ""
open(source) do |s| content = s.read end
rss = RSS::Parser.parse(content, false)
I believe you would have to use libxml-ruby for that.
gem 'libxml-ruby', '>= 0.8.3'
require 'xml'
xml = open("http://ffffound.com/feed")
parser = XML::Parser.string(xml, :options =>XML::Parser::Options::RECOVER)
doc = parser.parse
doc.find("channel").first.find("items").each do |item|
puts item.find("media:content").first
#and just guessing you want that url thingy
puts item.find("media:content").first.attributes.get_attribute("url").value
end
I hope that points you in the right direction.

Resources