How can I make empty tags self-closing with Nokogiri? - ruby

I've created an XML template in ERB. I fill it in with data from a database during an export process.
In some cases, there is a null value, in which case an element may be empty, like this:
<someitem>
</someitem>
In that case, the client receiving the export wants it to be converted into a self-closing tag:
<someitem/>
I'm trying to see how to get Nokogiri to do this, but I don't see it yet. Does anybody know how to make empty XML tags self-closing with Nokogiri?
Update
A regex was sufficient to do what I specified above, but the client now also wants tags whose children are all empty to be self-closing. So this:
<someitem>
<subitem>
</subitem>
<subitem>
</subitem>
</someitem>
... should also be
<someitem/>
I think that this will require using Nokogiri.

Search for
<([^>]+)>\s*</\1>
and replace with
<\1/>
In Ruby:
result = subject.gsub(/<([^>]+)>\s*<\/\1>/, '<\1/>')
Explanation:
< # Match opening bracket
( # Match and remember...
[^>]+ # One or more characters except >
) # End of capturing group
> # Match closing bracket
\s* # Match optional whitespace & newlines
< # Match opening bracket
/ # Match /
\1 # Match the contents of the opening tag
> # Match closing bracket

A couple questions:
<foo></foo> is the same as <foo />, so why worry about such a tiny detail? If it is syntactically significant because the text node between the two is a "\n", then put a test in your ERB template that checks for the value that would go there, and if it's not initialized output the self-closing tag instead? See "Yak shaving".
Why involve Nokogiri? You should be able to generate correct XML in ERB since you're in control of the template.
EDIT - Nokogiri's behavior is to not-rewrite parsed XML unless it has to. I suspect you'd have to remove the node in question, then reinsert it as an empty node to get Nokogiri to output what you want.

Related

Parsing a non-XML document with Nokogiri when the node names are/contain integers

When I run:
#!/usr/bin/env ruby
require 'nokogiri'
xml = <<-EOXML
<pajamas>
<bananas>
<foo>bar</foo>
<bar>bar</bar>
<1>bar</1>
</bananas>
</pajamas>
EOXML
doc = Nokogiri::XML(xml)
puts doc.at('/pajamas/bananas/foo')
puts doc.at('/pajamas/bananas/bar')
puts doc.at('/pajamas/bananas/1')
I get an ERROR: Invalid expression: /pajamas/bananas/1 (Nokogiri::XML::XPath::SyntaxError)
Is this a case of Nokogiri not liking ints as node names and/or is there a work around?
Looking at the documentation, I did not see a workaround to this. Removing the last line eliminates the error and prints the first two nodes as expected.
An XML element with a name that starts with a number is invalid XML.
XML elements must follow these naming rules:
Names can contain letters, numbers, and other characters
Names cannot start with a number or punctuation character
Names cannot start with the letters xml (or XML, or Xml, etc)
Names cannot contain spaces Any name can be used, no words are reserved.
You're trying to parse invalid XML with a XML parser, it's just not going to work. If you're really getting <1> as a tag and can't control that somehow, I'd suggest replacing the tags using a regex before getting to nokogiri.

ruby regex - how to match everything up until specific html tag

I want to extract from a string which looks like this
Something<p class=text>Description</p>Something
just a "Description". I've tried this p class=text>[^<\/p]* and this p class=text>[^<]\/p* but none of that is working. How to achieve that ?
You don't need to match the entire class="text":
<p.*?>(.*?)<\/p>
The match group is the tag content. It's simply a lazy quantifier so it doesn't capture the next < (and what follows after that).

How to match br tag in XPath text() function

I have a following element.
driver = Selenium::WebDriver.for :phantomjs
driver.xpath("/html/body/form/table/tbody/tr[14]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/font").text
=> "unique\ntext"
But I don't want to rely on unstable table layout, so I decided to use text() function in xpath like:
driver.xpath("//font[text()='unique\ntext']")
=> nil
But as you see, I couldn't find the element by the text() function. The original text is unique<br>text.
How can I match the <br> tag by using XPath?
There is no id or name attributes that I can use.
The text() test selects any text nodes. In this example there are two such nodes, before and after the <br>. It is not the same as the text method or the string value of the parent node.
One way of selecting what you want could be like this:
driver.xpath("//font[ . ='unique\ntext']")
You might need to add extra newlines before or after the text. Note that this relies on Ruby doing the conversion of \n into an actual newline character before passing the query to the XPath processor, so you need to be careful about getting your quotes right. This compares the string-value of the node, which for an element is the concatenation of all the descendent text nodes, which is what you want.
A better solution might be to use the normalize-space() function here (as long as the unique aspect of the text doesn’t depend on the newlines).
Try:
driver.xpath("//font[normalize-space()='unique text']")
Note that all leading and trailing whitespace in the target text has been removed, and any internal whitespace is changed to a single space character.

Why are there blank nodes/attributes when using LibXML Ruby?

Using the Gem libxml-ruby, when we parse XML like so:
document = LibXML::XML::Parser.string( xmlData ).parse
for n in document.root.children
# Do something
end
What we actually get is something like this:
root
-node empty
-node with data
-node empty
Same thing with attributes, there's a blank one padding between those we actually care about. What we end up needing to use is :options => LibXML::XML::Parser::Options::NOBLANKS
Why? :(
(Not necessarily an answer, but need formatting.)
What does the XML look like?
This XML:
<baz>
<plugh>ohai</plugh>
</baz>
may contain whitespace text nodes for the CR/LF and indentation between the <baz> and <plugh> opening tags, and the same for between the closing tags. This may or may not be significant whitespace depending on the nature of the XML. Structurally, it's different than:
<baz><plugh>ohai</plugh></baz>

Locating the node by value containing whitespaces using XPath

I need to locate the node within an xml file by its value using XPath.
The problem araises when the node to find contains value with whitespaces inside.
F.e.:
<Root>
<Child>value</Child>
<Child>value with spaces</Child>
</Root>
I can not construct the XPath locating the second Child node.
Simple XPath /Root/Child perfectly works for both children, but /Root[Child=value with spaces] returns an empty collection.
I have already tried masking spaces with %20, & #20;, & nbsp; and using quotes and double quotes.
Still no luck.
Does anybody have an idea?
Depending on your exact situation, there are different XPath expressions that will select the node, whose value contains some whitespace.
First, let us recall that any one of these characters is "whitespace":
-- the Tab
-- newline
-- carriage return
' ' or -- the space
If you know the exact value of the node, say it is "Hello World" with a space, then a most direct XPath expression:
/top/aChild[. = 'Hello World']
will select this node.
The difficulties with specifying a value that contains whitespace, however, come from the fact that we see all whitespace characters just as ... well, whitespace and don't know if a it is a group of spaces or a single tab.
In XPath 2.0 one may use regular expressions and they provide a simple and convenient solution. Thus we can use an XPath 2.0 expression as the one below:
/*/aChild[matches(., "Hello\sWorld")]
to select any child of the top node, whose value is the string "Hello" followed by whitespace followed by the string "World". Note the use of the matches() function and of the "\s" pattern that matches whitespace.
In XPath 1.0 a convenient test if a given string contains any whitespace characters is:
not(string-length(.)= stringlength(translate(., '
','')))
Here we use the translate() function to eliminate any of the four whitespace characters, and compare the length of the resulting string to that of the original string.
So, if in a text editor a node's value is displayed as
"Hello World",
we can safely select this node with the XPath expression:
/*/aChild[translate(., '
','') = 'HelloWorld']
In many cases we can also use the XPath function normalize-space(), which from its string argument produces another string in which the groups of leading and trailing whitespace is cut, and every whitespace within the string is replaced by a single space.
In the above case, we will simply use the following XPath expression:
/*/aChild[normalize-space() = 'Hello World']
Try either this:
/Root/Child[normalize-space(text())=value without spaces]
or
/Root/Child[contains(text(),value without spaces)]
or (since it looks like your test value may be the issue)
/Root/Child[normalize-space(text())=normalize-space(value with spaces)]
Haven't actually executed any of these so the syntax may be wonky.
Locating the Attribute by value containing whitespaces using XPath
I have a input type element with value containing white space.
eg:
<input type="button" value="Import Selected File">
I solved this by using this xpath expression.
//input[contains(#value,'Import') and contains(#value ,'Selected')and contains(#value ,'File')]
Hope this will help you guys.
"x0020" worked for me on a jackrabbit based CQ5/AEM repository in which the property names had spaces. Below would work for a property "Record ID"-
[(jcr:contains(jcr:content/#Record_x0020_ID, 'test'))]
did you try #x20 ?
i've googled this up like on the second link:
try to replace the space using "x0020"
this seems to work for the guy.
All of the above solutions didn't really work for me.
However, there's a much simpler solution.
When you create the XMLDocument, make sure you set PreserveWhiteSpace property to true;
XmlDocument xmldoc = new XmlDocument();
xmldoc.PreserveWhitespace = true;
xmldoc.Load(xmlCollection);

Resources