XSLT, RUBY, how to output the next element name from root? - ruby

I am working on a ruby script that involves with XSLT to convert XML to CSV. One of my code's logic is to grab the parent node element after root dynamically so it can treat it as row of records in the CSV file. I was able to get what I want by using Oxygen to convert the XML but I am running in this error by using Nokogiri:
/Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:32:in parse_stylesheet_doc': compilation error: file selectXMLelement.xsl line 5 element stylesheet (RuntimeError)
xsl:version: only 1.1 features are supported
compilation error: file selectXMLelement.xsl line 8 element value-of
xsl:value-of : could not compile select expression 'concat(':',/data:root/*/local-name())'
from /Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:32:inparse'
from /Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:13:in XSLT'
from EXTC-v1.rb:37:inapi_component'
from EXTC-v1.rb:43:in block in <main>'
from EXTC-v1.rb:43:ineach'
from EXTC-v1.rb:43:in `'
I would like to know if there is a way to use Nokogiri to get what I want instead of the XSLT, and how to feed into my Ruby script logic.
I have tried to use this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:data="urn:com.sample/bsvc"
exclude-result-prefixes="data"
version="2.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="concat(':',/data:root/*[1]/local-name())"/>
</xsl:template>
</xsl:stylesheet>
example XML, and I am successfully to output what I want from the XSLT ":Data_Request" by using Oxygen
<data:root>
<data:Data_Request>
<data:name>John Doe</data:name>
<data:phone>123456776</data:phone>
</data:Data_Request>
</data:root>
My Ruby script:
def xslt_transform(filename)
#dir = File.join(Dir.pwd,'/input/')
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
template = Nokogiri::XSLT(File.open('Remove-CDATA.xsl'))
transformed_doc = template.transform(doc)
File.write(filename, transformed_doc)
end
Dir.glob('*.xml').each {|filename| xslt_transform(filename)}
#this is where iam trying to use the XSLT
def api_component(filename)
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
template = Nokogiri::XSLT(File.open('selectXMLelement.xsl'))
transformed_doc = template.transform(doc)
puts filename
end
api_name = Dir.glob('*xml').each {|filename| api_component(filename)}
puts api_name
def xml_to_csv(filename)
dir = File.join(Dir.pwd,'/input/')
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml','.csv')
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
# Returns a new hash created by traversing the hash and its subhashes,
# executing the given block on the key and value. The block should return a 2-element array of the form [key, value].
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
#api_component = doc.xpath('/*/*[1]')
# if a new and not empty record, add to our records collection
if key == :Data_Request && !record.empty? #for regular XML parsng, use the request data. For example :Location_Data
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/data:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
dir = File.join(Dir.pwd,'/output/')
File.open('../output/'+csv_filename, 'wb') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print filename+ " is ready!\n"
print ''
end
end
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }
As you can see, right now I have it hard coded the node element: if key == :Data_Request && !record.empty?
Is there a way to do so with Nokogiri? and it can dynamically detect all the XML files in the read path? If not, how can I achieve it with XSLT embeded in my script?
Side question! Is there a way to make all the data format as Text format with my script too so it can keep the leading zeros? :)

Related

how to use nokogiri to parse xml file for specific values?

I have an xml file from which I need to extract all values that contain https://www.example.com/a/b:
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
Given the above, this should return two results. I've opened the file and parsed it with Nokogiri, but I do not understand how to access the values of the //loc key.
require 'nokogiri'
require 'open-uri'
doc = File.open('./sitemap-en.xml') { |f| Nokogiri::XML(f) }
puts doc.xpath('//loc')
The above code puts the entire xml file, but I want it paired down so that I get everything under the /a/b subdirectories. How can I do this?
Both of the following solutions assume the following:
require 'nokogiri'
xml = <<-XML
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
XML
doc = Nokogiri::XML(xml)
To return a list of all loc elements, select only those whose inner text begins with https://www.example.com/a/b, and print the URL text:
elements = doc.xpath("//loc")
filtered_elements = elements.select do |element|
element.text.start_with? 'https://www.example.com/a/b'
end
filtered_elements.each do |element|
puts element.text
end
To capture a list of loc elements whose inner text contains the string https://www.example.com/a/b and print each URL:
elements = doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]")
elements.each do |element|
puts element.text
end
To quickly print URLs using a slightly modified version of the previous XPATH query
puts doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]/text()")

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.
Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.
You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

How to replace XML node contents using Nokogiri

I'm using Ruby to read an XML document and update a single node, if it exists, with a new value.
http://www.nokogiri.org/tutorials/modifying_an_html_xml_document.html
is not obvious to me how to change the node data, let alone how to save it back to the file.
def ammend_parent_xml(folder, target_file, new_file)
# open parent XML file that contains file reference
get_xml_files = Dir.glob("#{#target_folder}/#{folder}/*.xml").sort.select {|f| !File.directory? f}
get_xml_files.each { |xml|
f = File.open(xml)
# Use Nokgiri to read the file into an XML object
doc = Nokogiri::XML(f)
filename = doc.xpath('//Route//To//Node//FileName')
filename.each_with_index {
|fl, i|
if target_file == fl.text
# we found the file, now rename it to new_file
# ???????
end
}
}
end
This is some example XML:
<?xml version="1.0" encoding="utf-8">
<my_id>123</my_id>
<Route>
<To>
<Node>
<Filename>file1.txt</Filename>
<Filename>file2.mp3</Filename>
<Filename>file3.doc</Filename>
<Filename>file4.php</Filename>
<Filename>file5.jpg</Filename>
</Node>
</To>
</Route>
</xml>
I want to change "file3.doc" to "file3_new.html".
I would call:
def ammend_parent_xml("folder_location", "file3.doc", "file3_new.html")
To change an element in the XML:
#doc = Nokogiri::XML::DocumentFragment.parse <<-EOXML
<body>
<h1>OLD_CONTENT</h1>
<div>blah</div>
</body>
EOXML
h1 = #doc.at_xpath "body/h1"
h1.content = "NEW_CONTENT"
puts #doc.to_xml #h1 will be NEW_CONTENT
To save the XML:
file = File.new("xml_file.xml", "wb")
file.write(#doc)
file.close
There's a few things wrong with your sample XML.
There are two root elements my_id and Route
There is a missing ? in the first tag
Do you need the last line </xml>?
After fixing the sample I was able to get the element by using the example by Phrogz:
element = #doc.xpath("Route//To//Node//Filename[.='#{target_file}']").first
Note .first since it will return a NodeSet.
Then I would update the content with:
element.content = "foobar"
def amend_parent_xml(folder, target_file, new_file)
Dir["#{#target_folder}/#{folder}/*.xml"]
.sort.select{|f| !File.directory? f }
.each do |xml_file|
doc = Nokogiri.XML( File.read(xml_file) )
if file = doc.at("//Route//To//Node//Filename[.='#{target_file}']")
file.content = new_file # set the text of the node
File.open(xml_file,'w'){ |f| f<<doc }
break
end
end
end
Improvements:
Use File.read instead of File.open so that you don't leave a file handle open.
Uses an XPath expression to find the SINGLE matching node by looking for a node with the correct text value.
Alternatively you could find all the files and then if file=files.find{ |f| f.text==target_file }
Shows how to serialize a Nokogiri::XML::Document back to disk.
Breaks out of processing the files as soon as it finds a matching XML file.

Missing parts after parsing and processing a very large XML file in Ruby

I have to parse and modify a 22.2MB XML file (a wordpress export).
The problem is after parsing, the last part of the file is always missing, but I can't really figure out why.
I've tried using the saxerator gem, but it does not seem to solve my problem
Here I'm just trying to get all the <item> from the input file and display them in an output file:
class SaxImport
def initialize input_file, output_file
f = File.read(input_file, File.size(input_file))
xml_data = Saxerator.parser(f) do |config|
config.output_type = :xml
end
category_fr_list = {}
items = []
output = File.open output_file, "w"
xml_data.for_tag(:item).reverse_each do |item|
output << item.to_xml
end
output.close
end
end
import_en = SaxImport.new 'weekly.xml', 'weekly.processed.xml'

How do I traverse an inner node using SAX in Nokogiri?

I'm quite new to Nokogiri and Ruby and seeking a little help.
I am parsing a very large XML file using class MyDoc < Nokogiri::XML::SAX::Document. Now I want to traverse the inner part of a block.
Here's the format of my XML file:
<Content id="83087">
<Title></Title>
<PublisherEntity id="1067">eBooksLib</PublisherEntity>
<Publisher>eBooksLib</Publisher>
......
</Content>
I can already tell if the "Content" tag is found, now I want to know how to traverse inside of it. Here's my shortened code:
class MyDoc < Nokogiri::XML::SAX::Document
#check the start element. set flag for each element
def start_element name, attrs = []
if(name == 'Content')
#get the <Title>
#get the <PublisherEntity>
#get the Publisher
end
end
def cdata_block(string)
characters(string)
end
def characters(str)
puts str
end
end
Purists may disagree with me, but the way I've been doing it is to use Nokogiri to traverse the huge file, and then use XmlSimple to work with a smaller object in the file. Here's a snippet of my code:
require 'nokogiri'
require 'xmlsimple'
def isend(node)
return (node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT)
end
reader = Nokogiri::XML::Reader(File.open('database.xml', 'r'))
# traverse the file looking for tag "content"
reader.each do |node|
next if node.name != 'content' || isend(node)
# if we get here, then we found start of node 'content',
# so read it into an array and work with the array:
content = XmlSimple.xml_in(node.outer_xml())
title = content['title'][0]
# ...etc.
end
This works very well for me. Some may object to mixing SAX and non-SAX (nokogiri and XmlSimple) in the same code, but for my purposes, it gets the job done with minimal hassle.
It's trickier to do with SAX. I think the solution will need to look something like this:
class MyDoc < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
#inside_content = true if name == 'Content'
#current_element = name
end
def end_element name
#inside_content = false if name == 'Content'
#current_element = nil
end
def characters str
puts "#{#current_element} - #{str}" if #inside_content && %w{Title PublisherEntity Publisher}.include?(#current_element)
end
end

Resources