How to replace XML node contents using Nokogiri - ruby

I'm using Ruby to read an XML document and update a single node, if it exists, with a new value.
http://www.nokogiri.org/tutorials/modifying_an_html_xml_document.html
is not obvious to me how to change the node data, let alone how to save it back to the file.
def ammend_parent_xml(folder, target_file, new_file)
# open parent XML file that contains file reference
get_xml_files = Dir.glob("#{#target_folder}/#{folder}/*.xml").sort.select {|f| !File.directory? f}
get_xml_files.each { |xml|
f = File.open(xml)
# Use Nokgiri to read the file into an XML object
doc = Nokogiri::XML(f)
filename = doc.xpath('//Route//To//Node//FileName')
filename.each_with_index {
|fl, i|
if target_file == fl.text
# we found the file, now rename it to new_file
# ???????
end
}
}
end
This is some example XML:
<?xml version="1.0" encoding="utf-8">
<my_id>123</my_id>
<Route>
<To>
<Node>
<Filename>file1.txt</Filename>
<Filename>file2.mp3</Filename>
<Filename>file3.doc</Filename>
<Filename>file4.php</Filename>
<Filename>file5.jpg</Filename>
</Node>
</To>
</Route>
</xml>
I want to change "file3.doc" to "file3_new.html".
I would call:
def ammend_parent_xml("folder_location", "file3.doc", "file3_new.html")

To change an element in the XML:
#doc = Nokogiri::XML::DocumentFragment.parse <<-EOXML
<body>
<h1>OLD_CONTENT</h1>
<div>blah</div>
</body>
EOXML
h1 = #doc.at_xpath "body/h1"
h1.content = "NEW_CONTENT"
puts #doc.to_xml #h1 will be NEW_CONTENT
To save the XML:
file = File.new("xml_file.xml", "wb")
file.write(#doc)
file.close
There's a few things wrong with your sample XML.
There are two root elements my_id and Route
There is a missing ? in the first tag
Do you need the last line </xml>?
After fixing the sample I was able to get the element by using the example by Phrogz:
element = #doc.xpath("Route//To//Node//Filename[.='#{target_file}']").first
Note .first since it will return a NodeSet.
Then I would update the content with:
element.content = "foobar"

def amend_parent_xml(folder, target_file, new_file)
Dir["#{#target_folder}/#{folder}/*.xml"]
.sort.select{|f| !File.directory? f }
.each do |xml_file|
doc = Nokogiri.XML( File.read(xml_file) )
if file = doc.at("//Route//To//Node//Filename[.='#{target_file}']")
file.content = new_file # set the text of the node
File.open(xml_file,'w'){ |f| f<<doc }
break
end
end
end
Improvements:
Use File.read instead of File.open so that you don't leave a file handle open.
Uses an XPath expression to find the SINGLE matching node by looking for a node with the correct text value.
Alternatively you could find all the files and then if file=files.find{ |f| f.text==target_file }
Shows how to serialize a Nokogiri::XML::Document back to disk.
Breaks out of processing the files as soon as it finds a matching XML file.

Related

how to use nokogiri to parse xml file for specific values?

I have an xml file from which I need to extract all values that contain https://www.example.com/a/b:
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
Given the above, this should return two results. I've opened the file and parsed it with Nokogiri, but I do not understand how to access the values of the //loc key.
require 'nokogiri'
require 'open-uri'
doc = File.open('./sitemap-en.xml') { |f| Nokogiri::XML(f) }
puts doc.xpath('//loc')
The above code puts the entire xml file, but I want it paired down so that I get everything under the /a/b subdirectories. How can I do this?
Both of the following solutions assume the following:
require 'nokogiri'
xml = <<-XML
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
XML
doc = Nokogiri::XML(xml)
To return a list of all loc elements, select only those whose inner text begins with https://www.example.com/a/b, and print the URL text:
elements = doc.xpath("//loc")
filtered_elements = elements.select do |element|
element.text.start_with? 'https://www.example.com/a/b'
end
filtered_elements.each do |element|
puts element.text
end
To capture a list of loc elements whose inner text contains the string https://www.example.com/a/b and print each URL:
elements = doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]")
elements.each do |element|
puts element.text
end
To quickly print URLs using a slightly modified version of the previous XPATH query
puts doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]/text()")

XSLT, RUBY, how to output the next element name from root?

I am working on a ruby script that involves with XSLT to convert XML to CSV. One of my code's logic is to grab the parent node element after root dynamically so it can treat it as row of records in the CSV file. I was able to get what I want by using Oxygen to convert the XML but I am running in this error by using Nokogiri:
/Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:32:in parse_stylesheet_doc': compilation error: file selectXMLelement.xsl line 5 element stylesheet (RuntimeError)
xsl:version: only 1.1 features are supported
compilation error: file selectXMLelement.xsl line 8 element value-of
xsl:value-of : could not compile select expression 'concat(':',/data:root/*/local-name())'
from /Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:32:inparse'
from /Library/Ruby/Gems/2.3.0/gems/nokogiri-1.10.3/lib/nokogiri/xslt.rb:13:in XSLT'
from EXTC-v1.rb:37:inapi_component'
from EXTC-v1.rb:43:in block in <main>'
from EXTC-v1.rb:43:ineach'
from EXTC-v1.rb:43:in `'
I would like to know if there is a way to use Nokogiri to get what I want instead of the XSLT, and how to feed into my Ruby script logic.
I have tried to use this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:data="urn:com.sample/bsvc"
exclude-result-prefixes="data"
version="2.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="concat(':',/data:root/*[1]/local-name())"/>
</xsl:template>
</xsl:stylesheet>
example XML, and I am successfully to output what I want from the XSLT ":Data_Request" by using Oxygen
<data:root>
<data:Data_Request>
<data:name>John Doe</data:name>
<data:phone>123456776</data:phone>
</data:Data_Request>
</data:root>
My Ruby script:
def xslt_transform(filename)
#dir = File.join(Dir.pwd,'/input/')
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
template = Nokogiri::XSLT(File.open('Remove-CDATA.xsl'))
transformed_doc = template.transform(doc)
File.write(filename, transformed_doc)
end
Dir.glob('*.xml').each {|filename| xslt_transform(filename)}
#this is where iam trying to use the XSLT
def api_component(filename)
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
template = Nokogiri::XSLT(File.open('selectXMLelement.xsl'))
transformed_doc = template.transform(doc)
puts filename
end
api_name = Dir.glob('*xml').each {|filename| api_component(filename)}
puts api_name
def xml_to_csv(filename)
dir = File.join(Dir.pwd,'/input/')
xml_str = File.read(filename)
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml','.csv')
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
# Returns a new hash created by traversing the hash and its subhashes,
# executing the given block on the key and value. The block should return a 2-element array of the form [key, value].
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
#api_component = doc.xpath('/*/*[1]')
# if a new and not empty record, add to our records collection
if key == :Data_Request && !record.empty? #for regular XML parsng, use the request data. For example :Location_Data
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/data:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
dir = File.join(Dir.pwd,'/output/')
File.open('../output/'+csv_filename, 'wb') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print filename+ " is ready!\n"
print ''
end
end
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }
As you can see, right now I have it hard coded the node element: if key == :Data_Request && !record.empty?
Is there a way to do so with Nokogiri? and it can dynamically detect all the XML files in the read path? If not, how can I achieve it with XSLT embeded in my script?
Side question! Is there a way to make all the data format as Text format with my script too so it can keep the leading zeros? :)

Find end replace content in all tags

I want to find and replace all nodes in XML files. I tried this:
def modify_xml_content(request_body, node, content)
doc = Nokogiri::XML(request_body)
node = doc.search(node).first
node.content = content
puts "Modifying #{node}"
doc.to_xml
rescue
request_body
end
Example XML
<billing_address>
<first_name>Max</first_name>
<last_name>Mustermann</last_name>
<address1>Muster Str. 12</address1>
<zip_code>10178</zip_code>
<city>New York</city>
<state>WA</state>
<country>US</country>
</billing_address>
<shipping_address>
<first_name>Max</first_name>
<last_name>Mustermann</last_name>
<address1>Muster Str. 12</address1>
<zip_code>10178</zip_code>
<city>New York</city>
<state>WA</state>
<country>US</country>
</shipping_address>
How I can find and replace all content in tags for example not only the first found matching tag?
Do each instead of first:
doc.search(node).each do |n|
n.content = content
end

Missing parts after parsing and processing a very large XML file in Ruby

I have to parse and modify a 22.2MB XML file (a wordpress export).
The problem is after parsing, the last part of the file is always missing, but I can't really figure out why.
I've tried using the saxerator gem, but it does not seem to solve my problem
Here I'm just trying to get all the <item> from the input file and display them in an output file:
class SaxImport
def initialize input_file, output_file
f = File.read(input_file, File.size(input_file))
xml_data = Saxerator.parser(f) do |config|
config.output_type = :xml
end
category_fr_list = {}
items = []
output = File.open output_file, "w"
xml_data.for_tag(:item).reverse_each do |item|
output << item.to_xml
end
output.close
end
end
import_en = SaxImport.new 'weekly.xml', 'weekly.processed.xml'

Nokogiri Builder: Replace RegEx match with XML

While using Nokogiri::XML::Builder I need to be able to generate a node that also replaces a regex match on the text with some other XML.
Currently I'm able to add additional XML inside the node. Here's an example;
def xml
Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.parent.add_child("Testing[1] footnote paragraph.")
add_footnotes(xml, 'An Entry')
}
}
end.to_xml
end
# further child nodes WILL be added to footnote
def add_footnotes(xml, text)
xml.footnote text
end
which produces;
<chapter>
<para>Testing[1] footnote paragraph.<footnote>An Entry</footnote></para>
</chapter>
But I need to be able to run a regex replace on the reference [1], replacing it with the <footnote> XML, producing output like the following;
<chapter>
<para>Testing<footnote>An Entry</footnote> footnote paragraph.</para>
</chapter>
I'm making the assumption here that the add_footnotes method would receive the reference match (e.g. as $1), which would be used to pull the appropriate footnote from a collection.
That method would also be adding additional child nodes, such as the following;
<footnote>
<para>Words.</para>
<para>More words.</para>
</footnote>
Can anyone help?
Here's a spin on your code that shows how to generate the output. You'll need to refit it to your own code....
require 'nokogiri'
FOOTNOTES = {
'1' => 'An Entry'
}
child_text = "Testing[1] footnote paragraph."
pre_footnote, footnote_id, post_footnote = /^(.+)\[(\d+)\](.+)/.match(child_text).captures
doc = Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.text(pre_footnote)
xml.footnote FOOTNOTES[footnote_id]
xml.text(post_footnote)
}
}
end
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<chapter>
<para>Testing<footnote>An Entry</footnote> footnote paragraph.</para>
</chapter>
The trick is you have to grab the text preceding and following your target so you can insert those as text nodes. Then you can figure out what needs to be added. For clarity in your code you should preprocess all the text, get your variables figured out, then fall into the XML generator. Don't try to do any calculations inside the Builder block, instead just reference variables. Think of Builder like a view in an MVC-type application if that helps.
FOOTNOTES could actually be a database lookup, a hash or some other data container.
You should also look at the << method, which lets you inject XML source, so you could pre-build the footnote XML, then loop over an array containing the various footnotes and inject them. Often it's easier to pre-process, then use gsub to treat things like [1] as placeholders. See "gsub(pattern, hash) → new_str" in the documentation, along with this example:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
For instance:
require 'nokogiri'
text = 'this is[1] text and[2] text'
footnotes = {
'[1]' => 'some',
'[2]' => 'more'
}
footnotes.keys.each do |k|
v = footnotes[k]
footnotes[k] = "<footnote>#{ v }</footnote>"
end
replacement_xml = text.gsub(/\[\d+\]/, footnotes) # => "this is<footnote>some</footnote> text and<footnote>more</footnote> text"
doc = Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para { xml.<<(replacement_xml) }
}
end
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <chapter>
# >> <para>this is<footnote>some</footnote> text and<footnote>more</footnote> text</para>
# >> </chapter>
I can try as below :
require 'nokogiri'
def xml
Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.parent.add_child("Testing[1] footnote paragraph.")
add_footnotes(xml, 'add text',"[1]")
}
}
end.to_xml
end
def add_footnotes(xml, text,ref)
string = xml.parent.child.content
xml.parent.child.content = ""
string.partition(ref).each do |txt|
next xml.text(txt) if txt != ref
xml.footnote text
end
end
puts xml
# >> <?xml version="1.0"?>
# >> <chapter>
# >> <para>Testing<footnote>add text</footnote> footnote paragraph.</para>
# >> </chapter>

Resources