Editing word document (docx) with nokogiri - ruby

I did not find any questions related to this one that had a solution for me (sorry if it is duplicate)
I have a word template file that needs to be changed depending on the data that is given to me.
Been struggling to find a solution. I need to edit the word file and add new lines of text to a specific place.
The code right now opens the word document using zip/zip:
def generate_docx(xml)
require 'rubygems'
require 'nokogiri'
require 'zip/zip'
require 'pp'
zip = Zip::ZipFile.open(xml)
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
wt = xml.root.xpath("//w:t", {"w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main"})
count = 0
wt.each do |row|
pp "#{count}: #{row.text}"
count += 1
end
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}
zip.close
end
output:
1: some text 1
2: some text 2
3: some text 3
now I need to add a new line of text under the first row, so like
1: some text 1
2: NEW LINE of text
3: some text 2
4: some text 3
I did find this code part, which adds text node to my xml, but does not appear in my word document
wt[1].add_next_sibling "some text"

Related

Ruby Nokogiri converting KML to CSV

I'm trying to extract two different elements from a KML file and turn them into a CSV. I'm starting with the great site here: http://ckdake.com/content/2012/highgroove-hack-night-kml-heatmaps.html that generates a csv of coordinates. All I want to do now is add the name tag to the start of each line. I'm a ruby/nokogiri n00b so I can stick this bit of code in which gets me a) a list of all names followed by b) a list of all coordinates. But again - I'd like them on the same line.
require 'rubygems'
require 'nokogiri' # gem install nokogiri
#doc = Nokogiri::XML(File.open("WashingtonDC2013-01-04 12h09m01s.kml"))
#doc.css('name').each do |name|
puts name.content
end
#doc.css('coordinates').each do |coordinates|
coordinates.text.split(' ').each do |coordinate|
(lat,lon,elevation) = coordinate.split(',')
puts "#{lat},#{lon}\n"
end
end
How about this:
#doc.css('Placemark').each do |placemark|
name = placemark.css('name')
coordinates = placemark.at_css('coordinates')
if name && coordinates
print name.text + ","
coordinates.text.split(' ').each do |coordinate|
(lon,lat,elevation) = coordinate.split(',')
print "#{lat},#{lon}"
end
puts "\n"
end
end
I'm assuming here that there is one coordinates pair in the <coordinates> tags for each <Placemark>. If there are more, they'll all get appended onto the same line.
If that doesn't work, you'll need to post some of the KML file itself so I can test on it. I'm just guessing based on this sample KML file.

Script that saves a series of pages then tries to combine them but only combines one?

Here's my code..
require "open-uri"
base_url = "http://en.wikipedia.org/wiki"
(1..5).each do |x|
# sets up the url
full_url = base_url + "/" + x.to_s
# reads the url
read_page = open(full_url).read
# saves the contents to a file and closes it
local_file = "my_copy_of-" + x.to_s + ".html"
file = open(local_file,"w")
file.write(read_page)
file.close
# open a file to store all entrys in
combined_numbers = open("numbers.html", "w")
entrys = open(local_file, "r")
combined_numbers.write(entrys.read)
entrys.close
combined_numbers.close
end
As you can see. It basically scrapes the contents of the wikipedia articles 1 through 5 and then attempts to combine them nto a single file called numbers.html.
It does the first bit right. But when it gets to the second. It only seem's to write in the contents of the fifth article in the loop.
I can't see where im going wrong though. Any help?
You chose the wrong mode when opening your summary file. "w" overwrites existing files while "a" appends to existing files.
So use this to get your code working:
combined_numbers = open("numbers.html", "a")
Otherwise with each pass of the loop the file contents of numbers.html are overwritten with the current article.
Besides I think you should use the contents in read_page to write to numbers.html instead of reading them back in from your freshly written file:
require "open-uri"
(1..5).each do |x|
# set up and read url
url = "http://en.wikipedia.org/wiki/#{x.to_s}"
article = open(url).read
# saves current article to a file
# (only possible with 1.9.x use open too if on 1.8.x)
IO.write("my_copy_of-#{x.to_s}.html", article)
# add current article to summary file
open("numbers.html", "a") do |f|
f.write(article)
end
end

Get text of a paragraph with all the markup (and their content) removed

How can I get only the text of the node <p> which has other tags in it like:
<p>hello my website is click here <b>test</b></p>
I only want "hello my website is"
This is what I tried:
begin
node = html_doc.css('p')
node.each do |node|
node.children.remove
end
return (node.nil?) ? '' : node.text
rescue
return ''
end
Update 2: all right, well you are removing all children with node.children.remove, including the text nodes, a proposed solution might look like:
# 1. select all <p> nodes
doc.css('p').
# 2. map children, and flatten
map { |node| node.children }.flatten.
# 3. select text nodes only
select { |node| node.text? }.
# 4. get text and join
map { |node| node.text }.join(' ').strip
This sample returns "hello my website is", but note that doc.css('p') als finds <p> tags within <p> tags.
Update: sorry, misread your question, you only want "hello my website is", see solution above, original answer:
Not directly with nokogiri, but the sanitize gem might be an option: https://github.com/rgrove/sanitize/
Sanitize.clean(html, {}) # => " hello my website is click here test "
FYI, it uses nokogiri internally.
Your test case did not include any interesting text interleaved with the markup.
If you want to turn <p>Hello <b>World</b>!</p> into "Hello !", then removing the children is one way to do it. Simpler (and less destructive) is to just find all the text nodes and join them:
require 'nokogiri'
html = Nokogiri::HTML('<p>Hello <b>World</b>!</p>')
# Find the first paragraph (in this case the only one)
para = html.at('p')
# Find all the text nodes that are children (not descendants),
# change them from nodes into the strings of text they contain,
# and then smush the results together into one big string.
p para.search('text()').map(&:text).join
#=> "Hello !"
If you want to turn <p>Hello <b>World</b>!</p> into "Hello " (no exclamation point) then you can simply do:
p para.children.first.text # if you know that text is the first child
p para.at('text()').text # if you want to find the first text node
As #Iwe showed, you can use the String#strip method to removing leading/trailing whitespace from the result, if you like.
There's a different way to go about this. Rather than bother with removing nodes, remove the text that those nodes contain:
require 'nokogiri'
doc = Nokogiri::HTML('<p>hello my website is click here <b>test</b></p>')
text = doc.search('p').map{ |p|
p_text = p.text
a_text = p.at('a').text
p_text[a_text] = ''
p_text
}
puts text
>>hello my website is test
This is a simple example, but the idea is to find the <p> tags, then scan inside those for the tags that contain the text you don't want. For each of those unwanted tags, grab their text and delete it from the surrounding text.
In the sample code, you'd have a list of undesirable nodes at the a_text assignment, loop over them, and iteratively remove the text, like so:
text = doc.search('p').map{ |p|
p_text = p.text
%w[a].each do |bad_nodes|
bad_nodes_text = p.at(bad_nodes).text
p_text[bad_nodes_text] = ''
end
p_text
}
You get back text which is an array of the tweaked text contents of the <p> nodes.

trying to get the delta between columns using FasterCSV

A bit of a noob here so apologies in advance.
I am trying to read a CSV file which has a number of columns, I would like see if one string "foo" exists anywhere in the file, and if so, grab the string one cell over (aka same row, one column over) and then write that to a file
my file c.csv:
foo,bar,yip
12,apple,yap
23,orange,yop
foo,tom,yum
so in this case, I would want "bar" and "tom" in a new csv file.
Here's what I have so far:
#!/usr/local/bin/ruby -w
require 'rubygems'
require 'fastercsv'
rows = FasterCSV.read("c.csv")
acolumn = rows.collect{|row| row[0]}
if acolumn.select{|v| v =~ /foo/} == 1
i = 0
for z in i..(acolumn).count
puts rows[1][i]
end
I've looked here https://github.com/circle/fastercsv/blob/master/examples/csv_table.rb but I am obviously not understanding it, my best guess is that I'd have to use Table to do what I want to do but after banging my head up against the wall for a bit, I decided to ask for advice from the experienced folks. help please?
Given your input file c.csv
foo,bar,yip
12,apple,yap
23,orange,yop
foo,tom,yum
then this script:
#!/usr/bin/ruby1.8
require 'fastercsv'
FasterCSV.open('output.csv', 'w') do |output|
FasterCSV.foreach('c.csv') do |row|
foo_index = row.index('foo')
if foo_index
value_to_the_right_of_foo = row[foo_index + 1]
output << value_to_the_right_of_foo
end
end
end
will create the file output.csv
bar
tom

How do I tell the line number for a node using the Nokogiri reader interface?

I'm trying to write a Nokogiri script that will grep XML for text nodes containing ASCII double-quotes («"»). Since I want a grep-like output I need the line number, and the contents of each line. However, I am unable to see how to tell the line number where the element starts at. Here is my code:
require 'rubygems'
require 'nokogiri'
ARGV.each do |filename|
xml_stream = File.open(filename)
reader = Nokogiri::XML::Reader(xml_stream)
titles = []
text = ''
grab_text = false
reader.each do |elem|
if elem.node_type == Nokogiri::XML::Node::TEXT_NODE
data = elem.value
lines = data.split(/\n/, -1);
lines.each_with_index do |line, idx|
if (line =~ /"/) then
STDOUT.printf "%s:%d:%s\n", filename, elem.line()+idx, line
end
end
end
end
end
elem.line() does not work.
XML and parsers don't really have a concept of line numbers. You're talking about the physical layout of the file.
You can play a game with the parser using accessors looking for text nodes containing linefeeds and/or carriage returns but that can be thrown off because XML allows nested nodes.
require 'nokogiri'
xml =<<EOT_XML
<atag>
<btag>
<ctag
id="another_node">
other text
</ctag>
</btag>
<btag>
<ctag id="another_node2">yet
another
text</ctag>
</btag>
<btag>
<ctag id="this_node">this text</ctag>
</btag>
</atag>
EOT_XML
doc = Nokogiri::XML(xml)
# find a particular node via CSS accessor
doc.at('ctag#this_node').text # => "this text"
# count how many "lines" there are in the document
doc.search('*/text()').select{ |t| t.text[/[\r\n]/] }.size # => 12
# walk the nodes looking for a particular string, counting lines as you go
content_at = []
doc.search('*/text()').each do |n|
content_at << [n.line, n.text] if (n.text['this text'])
end
content_at # => [[14, "this text"]]
This works because of the parser's ability to figure out what is a text node and cleanly return it, without relying on regex or text matches.
EDIT: I went through some old code, snooped around in Nokogiri's docs some, and came up with the above edited changes. It's working correctly, including working with some pathological cases. Nokogiri FTW!
As of 1.2.0 (released 2009-02-22), Nokogiri supports Node#line, which returns the line number in the source where that node is defined.
It appears to use the libxml2 function xmlGetLineNo().
require 'nokogiri'
doc = Nokogiri::XML(open 'tmpfile.xml')
doc.xpath('//xmlns:package[#arch="x86_64"]').each do |node|
puts '%4d %s' % [node.line, node['name']]
end
NOTE if you are working with large xml files (> 65535 lines), be sure to use Nokogiri 1.13.0 or newer (released 2022-01-06), or your Node#line results will not be accurate for large line numbers. See PR 2309 for an explanation.

Resources