How to parse XML nodes to CSV with Ruby and Nokogiri - ruby

I have an XML file:
?xml version="1.0" encoding="iso-8859-1"?>
<Offers xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://ssc.channeladvisor.com/files/cageneric.xsd">
<Offer>
<Model><![CDATA[11016001]]></Model>
<Manufacturer><![CDATA[Crocs, Inc.]]></Manufacturer>
<ManufacturerModel><![CDATA[11016-001]]></ManufacturerModel>
...lots more nodes
<Custom6><![CDATA[<li>Bold midsole stripe for a sporty look.</li>
<li>Odor-resistant, easy to clean, and quick to dry.</li>
<li>Ventilation ports for enhanced breathability.</li>
<li>Lightweight, non-marking soles.</li>
<li>Water-friendly and buoyant; weighs only ounces.</li>
<li>Fully molded Croslite™ material for lightweight cushioning and comfort.</li>
<li>Heel strap swings back for snug fit, forward for wear as a clog.</li>]]></Custom6>
</Offer>
....lots lots more <Offer> entries
</Offers>
I want to parse each instance of 'Offer' into its own row in a CSV file:
require 'csv'
require 'nokogiri'
file = File.read('input.xml')
doc = Nokogiri::XML(file)
a = []
csv = CSV.open('output.csv', 'wb')
doc.css('Offer').each do |node|
a.push << node.content.split
end
a.each { |a| csv << a }
This runs nicely except I'm splitting on whitespace rather than each element of the Offer node so every word is going into its own column in the CSV file.
Is there a way to pick up the content of each node and how do I use the node names as headers in the CSV file?

This assumes that each Offer element always has the same child nodes (though they can be empty):
CSV.open('output.csv', 'wb') do |csv|
doc.search('Offer').each do |x|
csv << x.search('*').map(&:text)
end
end
And to get headers (from the first Offer element):
CSV.open('output.csv', 'wb') do |csv|
csv << doc.at('Offer').search('*').map(&:name)
doc.search('Offer').each do |x|
csv << x.search('*').map(&:text)
end
end
search and at are Nokogiri functions that can take either XPath or CSS selector strings. at will return the first occurrence of an element; search will provide an array of matching elements (or an empty array if no matches are found). The * in this case will select all nodes that are direct children of the current node.
Both name and text are also Nokogiri functions (for an element). name provides the element's name; text provides the text or CDATA content of a node.

Try this, and modify it to push into your CSV:
doc.css('Offer').first.elements.each do |n|
puts "#{n.name}: #{n.content}"
end

Related

how to use nokogiri to parse xml file for specific values?

I have an xml file from which I need to extract all values that contain https://www.example.com/a/b:
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
Given the above, this should return two results. I've opened the file and parsed it with Nokogiri, but I do not understand how to access the values of the //loc key.
require 'nokogiri'
require 'open-uri'
doc = File.open('./sitemap-en.xml') { |f| Nokogiri::XML(f) }
puts doc.xpath('//loc')
The above code puts the entire xml file, but I want it paired down so that I get everything under the /a/b subdirectories. How can I do this?
Both of the following solutions assume the following:
require 'nokogiri'
xml = <<-XML
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
XML
doc = Nokogiri::XML(xml)
To return a list of all loc elements, select only those whose inner text begins with https://www.example.com/a/b, and print the URL text:
elements = doc.xpath("//loc")
filtered_elements = elements.select do |element|
element.text.start_with? 'https://www.example.com/a/b'
end
filtered_elements.each do |element|
puts element.text
end
To capture a list of loc elements whose inner text contains the string https://www.example.com/a/b and print each URL:
elements = doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]")
elements.each do |element|
puts element.text
end
To quickly print URLs using a slightly modified version of the previous XPATH query
puts doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]/text()")

How to properly automate xml to xls

I am getting a lot of xml files recently, that i want to analyse in excel. In stead of using the xml conversion standard in (newer versions of) excel, I want to use a Ruby code that does it for a number of files automatically.
I am not very familiar, however, with rexml. After half a days work I got the code to convert just one(!) xml node. This is how it looks:
require 'rexml/document'
Dir.glob("FILES/archive/*.xml") do |eksemel|
puts "converting #{eksemel}"
filename = (/\d+/.match(eksemel)).to_s
xml_file = File.open("#{eksemel}", "r")
csv_file = File.new("#{filename}.csv", "w")
xml = REXML::Document.new( xml_file )
counter = 0
xml.elements.each("RESULTS") do |e|
e.elements.each("component") do |f|
f.elements.each("paragraph") do |g|
counter = counter + 1
csv_file.puts g.text
end
end
end
end
Is there a way to a) instead of define the names of the elements and the number let ruby do it automatically and b) save all of these as separate columns in a csv file?
It isn't clear what you are using counter for. It would also help if you clarified what kind of structure the XML file has (for instance, are there many <paragraph> elements within each <component> element?). But, here is a cleaner way to write what I think you shooting for:
require 'rexml/document'
require 'csv'
Dir.glob('FILES/archive/*.xml') do |eksemel|
puts "converting #{eksemel}"
# I assume you are creating a .csv file with the same name as your .xml file
xml_file = File.new(eksemel)
csv_file = CSV.open(eksemel.sub(/\.xml$/, '.csv'), 'w')
xml = REXML::Document.new(xml_file)
counter = xml.elements.to_a('RESULTS//component//paragraph').length
xml.elements.each('RESULTS//component') do |component|
csv_file << component.elements.to_a('paragraph')
end
[xml_file, csv_file].each {|f| f.close}
end

Ruby Nokogiri converting KML to CSV

I'm trying to extract two different elements from a KML file and turn them into a CSV. I'm starting with the great site here: http://ckdake.com/content/2012/highgroove-hack-night-kml-heatmaps.html that generates a csv of coordinates. All I want to do now is add the name tag to the start of each line. I'm a ruby/nokogiri n00b so I can stick this bit of code in which gets me a) a list of all names followed by b) a list of all coordinates. But again - I'd like them on the same line.
require 'rubygems'
require 'nokogiri' # gem install nokogiri
#doc = Nokogiri::XML(File.open("WashingtonDC2013-01-04 12h09m01s.kml"))
#doc.css('name').each do |name|
puts name.content
end
#doc.css('coordinates').each do |coordinates|
coordinates.text.split(' ').each do |coordinate|
(lat,lon,elevation) = coordinate.split(',')
puts "#{lat},#{lon}\n"
end
end
How about this:
#doc.css('Placemark').each do |placemark|
name = placemark.css('name')
coordinates = placemark.at_css('coordinates')
if name && coordinates
print name.text + ","
coordinates.text.split(' ').each do |coordinate|
(lon,lat,elevation) = coordinate.split(',')
print "#{lat},#{lon}"
end
puts "\n"
end
end
I'm assuming here that there is one coordinates pair in the <coordinates> tags for each <Placemark>. If there are more, they'll all get appended onto the same line.
If that doesn't work, you'll need to post some of the KML file itself so I can test on it. I'm just guessing based on this sample KML file.

How do I tell CSV to make a column instead of a row?

I wrote a very simple program with Nokogiri to scrape a website and create a CSV file. It is getting the correct data and making the CSV, but the data is pushed into one cell (A1), and I would rather it come out as a column, with each value separated by a comma.
How do I tell CSV to make a column for each value separated by a comma instead of putting all the info into a single cell?
require 'open-uri'
require 'nokogiri'
require 'csv'
doc = Nokogiri::HTML(open('somewebpage.com'))
CSV.open("webpagedata.csv", "wb") do |csv|
data = doc.css('.information h3 a').map { |link| link['href'] }
puts data
csv << [data]
end
The result from doc.css('.information h3 a').map { |link| link['href'] } is already an array, so when you add it to your CSV file you don’t need to wrap it in [...].
Change the line
csv << [data]
to
csv << data
The CSV library deals mainly in rows, so if you want to create a column rather than a row, then you need to add a (single entry) row for each entry of the column:
CSV.open("webpagedata.csv", "wb") do |csv|
data = doc.css('.information h3 a').map { |link| link['href'] }
data.each do |entry|
csv << [entry]
end
end
Note that in this case you do need the [...] around the entry, as you need to add an array not a single item.

How do I tell the line number for a node using the Nokogiri reader interface?

I'm trying to write a Nokogiri script that will grep XML for text nodes containing ASCII double-quotes («"»). Since I want a grep-like output I need the line number, and the contents of each line. However, I am unable to see how to tell the line number where the element starts at. Here is my code:
require 'rubygems'
require 'nokogiri'
ARGV.each do |filename|
xml_stream = File.open(filename)
reader = Nokogiri::XML::Reader(xml_stream)
titles = []
text = ''
grab_text = false
reader.each do |elem|
if elem.node_type == Nokogiri::XML::Node::TEXT_NODE
data = elem.value
lines = data.split(/\n/, -1);
lines.each_with_index do |line, idx|
if (line =~ /"/) then
STDOUT.printf "%s:%d:%s\n", filename, elem.line()+idx, line
end
end
end
end
end
elem.line() does not work.
XML and parsers don't really have a concept of line numbers. You're talking about the physical layout of the file.
You can play a game with the parser using accessors looking for text nodes containing linefeeds and/or carriage returns but that can be thrown off because XML allows nested nodes.
require 'nokogiri'
xml =<<EOT_XML
<atag>
<btag>
<ctag
id="another_node">
other text
</ctag>
</btag>
<btag>
<ctag id="another_node2">yet
another
text</ctag>
</btag>
<btag>
<ctag id="this_node">this text</ctag>
</btag>
</atag>
EOT_XML
doc = Nokogiri::XML(xml)
# find a particular node via CSS accessor
doc.at('ctag#this_node').text # => "this text"
# count how many "lines" there are in the document
doc.search('*/text()').select{ |t| t.text[/[\r\n]/] }.size # => 12
# walk the nodes looking for a particular string, counting lines as you go
content_at = []
doc.search('*/text()').each do |n|
content_at << [n.line, n.text] if (n.text['this text'])
end
content_at # => [[14, "this text"]]
This works because of the parser's ability to figure out what is a text node and cleanly return it, without relying on regex or text matches.
EDIT: I went through some old code, snooped around in Nokogiri's docs some, and came up with the above edited changes. It's working correctly, including working with some pathological cases. Nokogiri FTW!
As of 1.2.0 (released 2009-02-22), Nokogiri supports Node#line, which returns the line number in the source where that node is defined.
It appears to use the libxml2 function xmlGetLineNo().
require 'nokogiri'
doc = Nokogiri::XML(open 'tmpfile.xml')
doc.xpath('//xmlns:package[#arch="x86_64"]').each do |node|
puts '%4d %s' % [node.line, node['name']]
end
NOTE if you are working with large xml files (> 65535 lines), be sure to use Nokogiri 1.13.0 or newer (released 2022-01-06), or your Node#line results will not be accurate for large line numbers. See PR 2309 for an explanation.

Resources