how to use nokogiri to parse xml file for specific values? - ruby

I have an xml file from which I need to extract all values that contain https://www.example.com/a/b:
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
Given the above, this should return two results. I've opened the file and parsed it with Nokogiri, but I do not understand how to access the values of the //loc key.
require 'nokogiri'
require 'open-uri'
doc = File.open('./sitemap-en.xml') { |f| Nokogiri::XML(f) }
puts doc.xpath('//loc')
The above code puts the entire xml file, but I want it paired down so that I get everything under the /a/b subdirectories. How can I do this?

Both of the following solutions assume the following:
require 'nokogiri'
xml = <<-XML
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
XML
doc = Nokogiri::XML(xml)
To return a list of all loc elements, select only those whose inner text begins with https://www.example.com/a/b, and print the URL text:
elements = doc.xpath("//loc")
filtered_elements = elements.select do |element|
element.text.start_with? 'https://www.example.com/a/b'
end
filtered_elements.each do |element|
puts element.text
end
To capture a list of loc elements whose inner text contains the string https://www.example.com/a/b and print each URL:
elements = doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]")
elements.each do |element|
puts element.text
end
To quickly print URLs using a slightly modified version of the previous XPATH query
puts doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]/text()")

Related

XSLT, RUBY, how to output the next element name from root?

I am working on a ruby script that involves with XSLT to convert XML to CSV. One of my code's logic is to grab the parent node element after root dynamically so it can treat it as row of records in the CSV file. I was able to get what I want by using Oxygen to convert the XML but I am running in this error by using Nokogiri:
/Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:32:in parse_stylesheet_doc': compilation error: file selectXMLelement.xsl line 5 element stylesheet (RuntimeError)
xsl:version: only 1.1 features are supported
compilation error: file selectXMLelement.xsl line 8 element value-of
xsl:value-of : could not compile select expression 'concat(':',/data:root/*/local-name())'
from /Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:32:inparse'
from /Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:13:in XSLT'
from EXTC-v1.rb:37:inapi_component'
from EXTC-v1.rb:43:in block in <main>'
from EXTC-v1.rb:43:ineach'
from EXTC-v1.rb:43:in `'
I would like to know if there is a way to use Nokogiri to get what I want instead of the XSLT, and how to feed into my Ruby script logic.
I have tried to use this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:data="urn:com.sample/bsvc"
exclude-result-prefixes="data"
version="2.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="concat(':',/data:root/*[1]/local-name())"/>
</xsl:template>
</xsl:stylesheet>
example XML, and I am successfully to output what I want from the XSLT ":Data_Request" by using Oxygen
<data:root>
<data:Data_Request>
<data:name>John Doe</data:name>
<data:phone>123456776</data:phone>
</data:Data_Request>
</data:root>
My Ruby script:
def xslt_transform(filename)
#dir = File.join(Dir.pwd,'/input/')
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
template = Nokogiri::XSLT(File.open('Remove-CDATA.xsl'))
transformed_doc = template.transform(doc)
File.write(filename, transformed_doc)
end
Dir.glob('*.xml').each {|filename| xslt_transform(filename)}
#this is where iam trying to use the XSLT
def api_component(filename)
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
template = Nokogiri::XSLT(File.open('selectXMLelement.xsl'))
transformed_doc = template.transform(doc)
puts filename
end
api_name = Dir.glob('*xml').each {|filename| api_component(filename)}
puts api_name
def xml_to_csv(filename)
dir = File.join(Dir.pwd,'/input/')
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml','.csv')
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
# Returns a new hash created by traversing the hash and its subhashes,
# executing the given block on the key and value. The block should return a 2-element array of the form [key, value].
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
#api_component = doc.xpath('/*/*[1]')
# if a new and not empty record, add to our records collection
if key == :Data_Request && !record.empty? #for regular XML parsng, use the request data. For example :Location_Data
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/data:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
dir = File.join(Dir.pwd,'/output/')
File.open('../output/'+csv_filename, 'wb') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print filename+ " is ready!\n"
print ''
end
end
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }
As you can see, right now I have it hard coded the node element: if key == :Data_Request && !record.empty?
Is there a way to do so with Nokogiri? and it can dynamically detect all the XML files in the read path? If not, how can I achieve it with XSLT embeded in my script?
Side question! Is there a way to make all the data format as Text format with my script too so it can keep the leading zeros? :)

Find end replace content in all tags

I want to find and replace all nodes in XML files. I tried this:
def modify_xml_content(request_body, node, content)
doc = Nokogiri::XML(request_body)
node = doc.search(node).first
node.content = content
puts "Modifying #{node}"
doc.to_xml
rescue
request_body
end
Example XML
<billing_address>
<first_name>Max</first_name>
<last_name>Mustermann</last_name>
<address1>Muster Str. 12</address1>
<zip_code>10178</zip_code>
<city>New York</city>
<state>WA</state>
<country>US</country>
</billing_address>
<shipping_address>
<first_name>Max</first_name>
<last_name>Mustermann</last_name>
<address1>Muster Str. 12</address1>
<zip_code>10178</zip_code>
<city>New York</city>
<state>WA</state>
<country>US</country>
</shipping_address>
How I can find and replace all content in tags for example not only the first found matching tag?
Do each instead of first:
doc.search(node).each do |n|
n.content = content
end

How to replace XML node contents using Nokogiri

I'm using Ruby to read an XML document and update a single node, if it exists, with a new value.
http://www.nokogiri.org/tutorials/modifying_an_html_xml_document.html
is not obvious to me how to change the node data, let alone how to save it back to the file.
def ammend_parent_xml(folder, target_file, new_file)
# open parent XML file that contains file reference
get_xml_files = Dir.glob("#{#target_folder}/#{folder}/*.xml").sort.select {|f| !File.directory? f}
get_xml_files.each { |xml|
f = File.open(xml)
# Use Nokgiri to read the file into an XML object
doc = Nokogiri::XML(f)
filename = doc.xpath('//Route//To//Node//FileName')
filename.each_with_index {
|fl, i|
if target_file == fl.text
# we found the file, now rename it to new_file
# ???????
end
}
}
end
This is some example XML:
<?xml version="1.0" encoding="utf-8">
<my_id>123</my_id>
<Route>
<To>
<Node>
<Filename>file1.txt</Filename>
<Filename>file2.mp3</Filename>
<Filename>file3.doc</Filename>
<Filename>file4.php</Filename>
<Filename>file5.jpg</Filename>
</Node>
</To>
</Route>
</xml>
I want to change "file3.doc" to "file3_new.html".
I would call:
def ammend_parent_xml("folder_location", "file3.doc", "file3_new.html")
To change an element in the XML:
#doc = Nokogiri::XML::DocumentFragment.parse <<-EOXML
<body>
<h1>OLD_CONTENT</h1>
<div>blah</div>
</body>
EOXML
h1 = #doc.at_xpath "body/h1"
h1.content = "NEW_CONTENT"
puts #doc.to_xml #h1 will be NEW_CONTENT
To save the XML:
file = File.new("xml_file.xml", "wb")
file.write(#doc)
file.close
There's a few things wrong with your sample XML.
There are two root elements my_id and Route
There is a missing ? in the first tag
Do you need the last line </xml>?
After fixing the sample I was able to get the element by using the example by Phrogz:
element = #doc.xpath("Route//To//Node//Filename[.='#{target_file}']").first
Note .first since it will return a NodeSet.
Then I would update the content with:
element.content = "foobar"
def amend_parent_xml(folder, target_file, new_file)
Dir["#{#target_folder}/#{folder}/*.xml"]
.sort.select{|f| !File.directory? f }
.each do |xml_file|
doc = Nokogiri.XML( File.read(xml_file) )
if file = doc.at("//Route//To//Node//Filename[.='#{target_file}']")
file.content = new_file # set the text of the node
File.open(xml_file,'w'){ |f| f<<doc }
break
end
end
end
Improvements:
Use File.read instead of File.open so that you don't leave a file handle open.
Uses an XPath expression to find the SINGLE matching node by looking for a node with the correct text value.
Alternatively you could find all the files and then if file=files.find{ |f| f.text==target_file }
Shows how to serialize a Nokogiri::XML::Document back to disk.
Breaks out of processing the files as soon as it finds a matching XML file.

How to parse XML nodes to CSV with Ruby and Nokogiri

I have an XML file:
?xml version="1.0" encoding="iso-8859-1"?>
<Offers xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://ssc.channeladvisor.com/files/cageneric.xsd">
<Offer>
<Model><![CDATA[11016001]]></Model>
<Manufacturer><![CDATA[Crocs, Inc.]]></Manufacturer>
<ManufacturerModel><![CDATA[11016-001]]></ManufacturerModel>
...lots more nodes
<Custom6><![CDATA[<li>Bold midsole stripe for a sporty look.</li>
<li>Odor-resistant, easy to clean, and quick to dry.</li>
<li>Ventilation ports for enhanced breathability.</li>
<li>Lightweight, non-marking soles.</li>
<li>Water-friendly and buoyant; weighs only ounces.</li>
<li>Fully molded Crosliteâ„¢ material for lightweight cushioning and comfort.</li>
<li>Heel strap swings back for snug fit, forward for wear as a clog.</li>]]></Custom6>
</Offer>
....lots lots more <Offer> entries
</Offers>
I want to parse each instance of 'Offer' into its own row in a CSV file:
require 'csv'
require 'nokogiri'
file = File.read('input.xml')
doc = Nokogiri::XML(file)
a = []
csv = CSV.open('output.csv', 'wb')
doc.css('Offer').each do |node|
a.push << node.content.split
end
a.each { |a| csv << a }
This runs nicely except I'm splitting on whitespace rather than each element of the Offer node so every word is going into its own column in the CSV file.
Is there a way to pick up the content of each node and how do I use the node names as headers in the CSV file?
This assumes that each Offer element always has the same child nodes (though they can be empty):
CSV.open('output.csv', 'wb') do |csv|
doc.search('Offer').each do |x|
csv << x.search('*').map(&:text)
end
end
And to get headers (from the first Offer element):
CSV.open('output.csv', 'wb') do |csv|
csv << doc.at('Offer').search('*').map(&:name)
doc.search('Offer').each do |x|
csv << x.search('*').map(&:text)
end
end
search and at are Nokogiri functions that can take either XPath or CSS selector strings. at will return the first occurrence of an element; search will provide an array of matching elements (or an empty array if no matches are found). The * in this case will select all nodes that are direct children of the current node.
Both name and text are also Nokogiri functions (for an element). name provides the element's name; text provides the text or CDATA content of a node.
Try this, and modify it to push into your CSV:
doc.css('Offer').first.elements.each do |n|
puts "#{n.name}: #{n.content}"
end

How to search an XML when parsing it using SAX in nokogiri

I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title tag.
<root>
<site>some site</site>
<title>good title</title>
</root>
I have the following code:
require 'rubygems'
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'title'
puts "found title"
end
end
def characters(text)
puts text
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")
problem is that it prints text between all the tags. How can I just print text between the title tag?
You just need to keep track of when you're inside a <title> so that characters knows when it should pay attention. Something like this (untested code) perhaps:
class PostCallbacks < XML::SAX::Document
def initialize
#in_title = false
end
def start_element(element, attributes)
if element == 'title'
puts "found title"
#in_title = true
end
end
def end_element(element)
# Doesn't really matter what element we're closing unless there is nesting,
# then you'd want "#in_title = false if element == 'title'"
#in_title = false
end
def characters(text)
puts text if #in_title
end
end
The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title> right at the beginning.
I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.
Here's how you'd use it:
require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first
It will stop right when it finds first occurrence of <title> tag.

Resources