How can I use Nokogiri with Ruby to replace values in existing xml? - ruby

I am using Ruby 1.9.3 with the lastest Nokogiri gem. I have worked out how to extract values from an xml using xpath and specifying the path(?) to the element. Here is the XML file I have:
<?xml version="1.0" encoding="utf-8"?>
<File xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Houses>
<Ranch>
<Roof>Black</Roof>
<Street>Markham</Street>
<Number>34</Number>
</Ranch>
</Houses>
</File>
I use this code to print a value:
doc = Nokogiri::XML(File.open ("C:\\myfile.xml"))
puts doc.xpath("//Ranch//Street")
Which outputs:
<Street>Markham</Street>
This is all working fine but what I need is to write/replace the value. I want to use the same kind of path-style lookup to pass in a value to replace the one that is there. So I want to pass a street name to this path and overwrite the street name that is there. I've been all over the internet but can only find ways to create a new XML or insert a completely new node in the file. Is there a way to replace values by line like this? Thanks.

You want the content= method:
Set the Node’s content to a Text node containing string. The string gets XML escaped, not interpreted as markup.
Note that xpath returns a NodeSet not a single Node, so you need to use at_xpath or get the single node some other way:
doc = Nokogiri::XML(File.open ("C:\\myfile.xml"))
node = doc.xpath("//Ranch//Street")[0] # use [0] to select the first result
node.content = "New value for this node"
puts doc # produces XML document with new value for the node

Related

Parsing a non-XML document with Nokogiri when the node names are/contain integers

When I run:
#!/usr/bin/env ruby
require 'nokogiri'
xml = <<-EOXML
<pajamas>
<bananas>
<foo>bar</foo>
<bar>bar</bar>
<1>bar</1>
</bananas>
</pajamas>
EOXML
doc = Nokogiri::XML(xml)
puts doc.at('/pajamas/bananas/foo')
puts doc.at('/pajamas/bananas/bar')
puts doc.at('/pajamas/bananas/1')
I get an ERROR: Invalid expression: /pajamas/bananas/1 (Nokogiri::XML::XPath::SyntaxError)
Is this a case of Nokogiri not liking ints as node names and/or is there a work around?
Looking at the documentation, I did not see a workaround to this. Removing the last line eliminates the error and prints the first two nodes as expected.
An XML element with a name that starts with a number is invalid XML.
XML elements must follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces Any name can be used, no words are reserved.
You're trying to parse invalid XML with a XML parser, it's just not going to work. If you're really getting <1> as a tag and can't control that somehow, I'd suggest replacing the tags using a regex before getting to nokogiri.

Parsing an XML file using nokogiri to create \index fields for LaTeX

I'm a professional indexer new to Ruby and nokogiri and I am in need of some assistance.
I'm working on a set of macros that will allow me to take an XML file, output from my indexing software, and parse it into valid \index{} commands for inclusion in a LaTeX source file. Each XML <record> contains at least two <field> tags, so I will have to iterate over the multiple <field> tags to build my \index{} entry.
The following is an example of an index record from the xml file.
<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>
I will produce intermediate output of this record in the form of:
\index{Titanic#\textit{SS Titanic}!passengers|textbf} 5
(The numeric locator is used to place the \index{} entry at the correct spot in the LaTex file and won't be included in the LaTeX source file)
I am using nokogiri to manipulate the xml file and have been able to reach the point where I return a nodelist that contains just the <field> tags for each <record>, but I need to be able to retrieve all the text in the <field>, including the formatting information (if I use the text method on a <field>, it returns "SS Titanic" for example, with all formatting information stripped away).
I'm stuck on how to access the entire text string in the <field> tag. Once I can get that, I have a good idea of how to structure my parser.
Any help will be greatly appreciated.
does this help?
xml = "<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>"
fields = Nokogiri::XML(xml).xpath(".//field")
puts fields.first.text #=> "SS Titanic"
puts fields.map(&:text) #=> ["SS Titanic", "passengers", "5"]

How to add a new node without prefix

I'm working with a SOAP API that requires some XML nodes without prefixes. Is it even possible to do with Nokogiri? Simply omitting the prefix from the node name makes Nokogiri use the default prefix "env".
node = Nokogiri::XML::Node.new('WageReportsToIR', envelope)
envelope.xpath('//env:Body').first.add_child(node)
results
<env:Body>\n <env:WageReportsToIR/>\n </env:Body>
Do I have any other option but to write a regex to remove the prefixes after I'm done editing the XML with Nokogiri?

Substituting text in a file with Ruby

I need to read in a file which will be in xml format but all crammed into a single line, and I need to parse that line to find a specific property and replace its value with something I have specified.
The file might contain:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><VerificationPoint type="Screenshot" version="2"><Description/><Verification object=":qP1B11_QLabel" type="PNG">
I need to search through this line, find the property "Verification object=" and replace the :qP1B11 with my own string. Please not that I don't want to replace the _QLabel" type="PNG"> part of the string if possible.
I can't use sub as I don't value of the property which could be anything, and I believe I should be able to do this with Regular Expressions but I have never had to use them before and all examples I've seen just make me more confused than earlier.
If anyone can present me with an elegant answer (and an explanation if using regexp) it would be a huge help!
Thanks
You have XML so use an XML parser. Nokogiri will make short work of that:
doc = Nokogiri::XML(that_string)
doc.search('Verification').each do |node|
node['object'] = node['object'].sub(/:qP1B11/, 'PANCAKES')
end
new_string = doc.to_xml
# <?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<VerificationPoint type="Screenshot" version="2">\n <Description/>\n <Verification object="PANCAKES_QLabel" type="PNG">\n</Verification>\n</VerificationPoint>\n"
You can adjust the output format using the options for to_xml.
If you only have one <Verification> then you could do it like this:
node = doc.at('Verification')
node['object'] = node['object'].sub(/:qP1B11/, 'PANCAKES')
new_string = doc.to_xml
In either case you'd adjust your regex and replacement to suit your needs.

How can I make empty tags self-closing with Nokogiri?

I've created an XML template in ERB. I fill it in with data from a database during an export process.
In some cases, there is a null value, in which case an element may be empty, like this:
<someitem>
</someitem>
In that case, the client receiving the export wants it to be converted into a self-closing tag:
<someitem/>
I'm trying to see how to get Nokogiri to do this, but I don't see it yet. Does anybody know how to make empty XML tags self-closing with Nokogiri?
Update
A regex was sufficient to do what I specified above, but the client now also wants tags whose children are all empty to be self-closing. So this:
<someitem>
<subitem>
</subitem>
<subitem>
</subitem>
</someitem>
... should also be
<someitem/>
I think that this will require using Nokogiri.
Search for
<([^>]+)>\s*</\1>
and replace with
<\1/>
In Ruby:
result = subject.gsub(/<([^>]+)>\s*<\/\1>/, '<\1/>')
Explanation:
< # Match opening bracket
( # Match and remember...
[^>]+ # One or more characters except >
) # End of capturing group
> # Match closing bracket
\s* # Match optional whitespace & newlines
< # Match opening bracket
/ # Match /
\1 # Match the contents of the opening tag
> # Match closing bracket
A couple questions:
<foo></foo> is the same as <foo />, so why worry about such a tiny detail? If it is syntactically significant because the text node between the two is a "\n", then put a test in your ERB template that checks for the value that would go there, and if it's not initialized output the self-closing tag instead? See "Yak shaving".
Why involve Nokogiri? You should be able to generate correct XML in ERB since you're in control of the template.
EDIT - Nokogiri's behavior is to not-rewrite parsed XML unless it has to. I suspect you'd have to remove the node in question, then reinsert it as an empty node to get Nokogiri to output what you want.

Resources