Encode content as CDATA in generated RSS feed - ruby

I'm generating an RSS feed using Ruby's built-in RSS library, which seems to escape HTML when generating feeds. For certain elements I'd prefer that it preserve the original HTML by wrapping it in a CDATA block.
A minimal working example:
require 'rss/2.0'
feed = RSS::Rss.new("2.0")
feed.channel = RSS::Rss::Channel.new
feed.channel.title = "Title & Show"
feed.channel.link = "http://foo.net"
feed.channel.description = "<strong>Description</strong>"
item = RSS::Rss::Channel::Item.new
item.title = "Foo & Bar"
item.description = "<strong>About</strong>"
feed.channel.items << item
puts feed
...which generates the following RSS:
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Title & Show</title>
<link>http://foo.net</link>
<description><strong>Description</strong></description>
<item>
<title>Foo & Bar</title>
<description><strong>About</strong></description>
</item>
</channel>
</rss>
Instead of HTML-encoding the channel and item descriptions, I'd like to keep the original HTML and wrap them in CDATA blocks, e.g.:
<description><![CDATA[<strong>Description</strong>]]></description>

monkey-patching the element-generating method works for me:
require 'rss/2.0'
class RSS::Rss::Channel
def description_element need_convert, indent
markup = "#{indent}<description>"
markup << "<![CDATA[#{#description}]]>"
markup << "</description>"
markup
end
end
# ...
this prevents the call to Utils.html_escape which escapes a few special entities.

Related

Create non-self-closed empty tag with Nokogiri

When I try to create an XML document with Nokogiri::XML::Builder:
builder = Nokogiri::XML::Builder.new do |xml|
xml.my_tag({key: :value})
end
I get the following XML tag:
<my_tag key="value"/>
It is self-closed, but I need the full form:
<my_tag key="value"></my_tag>
When I pass a value inside the node (or even a space):
xml.my_tag("content", key: :value)
xml.my_tag(" ", key: :value)
It generates the full tag:
<my_tag key="value">content</my_tag>
<my_tag key="value"> </my_tag>
But if I pass either an empty string or nil, or even an empty block:
xml.my_tag("", key: :value)
It generates a self-closed tag:
<my_tag key="value"/>
I believe there should be some attribute or something else that helps me but simple Googling didn't find the answer.
I found a possible solution in "Building blank XML tags with Nokogiri?" but it saves all tags as non-self-closed.
You can use Nokogiri's NO_EMPTY_TAGS save option. (XML calls self-closing tags empty-element tags.)
builder = Nokogiri::XML::Builder.new do |xml|
xml.my_tag({key: :value})
end
puts builder.to_xml(save_with: Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS)
<?xml version="1.0"?>
<my_tag key="value"></my_tag>
Each of the options is represented in a bit, so you can mix and match the ones you want. For example, setting NO_EMPTY_TAGS by itself will leave your XML on one line without spacing or indentation. If you still want it formatted for humans, you can bitwise or (|) it with the FORMAT option.
builder = Nokogiri::XML::Builder.new do |xml|
xml.my_tag({key: :value}) do |my_tag|
my_tag.nested({another: :value})
end
end
puts builder.to_xml(
save_with: Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS
)
puts
puts builder.to_xml(
save_with: Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS |
Nokogiri::XML::Node::SaveOptions::FORMAT
)
<?xml version="1.0"?>
<my_tag key="value"><nested another="value"></nested></my_tag>
<?xml version="1.0"?>
<my_tag key="value">
<nested another="value"></nested>
</my_tag>
There are also a handful of DEFAULT_* options at the end of the list that already combine options into common uses.
Your update mentions "it saves all tags as non-self-closed", as if perhaps you only want this single tag instance to be non-self-closed, and the rest to self close. Nokogiri won't produce an inconsistent document like that, but if you must, you can concatenate some XML strings together that you built with different options.

Nokogiri Builder: Replace RegEx match with XML

While using Nokogiri::XML::Builder I need to be able to generate a node that also replaces a regex match on the text with some other XML.
Currently I'm able to add additional XML inside the node. Here's an example;
def xml
Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.parent.add_child("Testing[1] footnote paragraph.")
add_footnotes(xml, 'An Entry')
}
}
end.to_xml
end
# further child nodes WILL be added to footnote
def add_footnotes(xml, text)
xml.footnote text
end
which produces;
<chapter>
<para>Testing[1] footnote paragraph.<footnote>An Entry</footnote></para>
</chapter>
But I need to be able to run a regex replace on the reference [1], replacing it with the <footnote> XML, producing output like the following;
<chapter>
<para>Testing<footnote>An Entry</footnote> footnote paragraph.</para>
</chapter>
I'm making the assumption here that the add_footnotes method would receive the reference match (e.g. as $1), which would be used to pull the appropriate footnote from a collection.
That method would also be adding additional child nodes, such as the following;
<footnote>
<para>Words.</para>
<para>More words.</para>
</footnote>
Can anyone help?
Here's a spin on your code that shows how to generate the output. You'll need to refit it to your own code....
require 'nokogiri'
FOOTNOTES = {
'1' => 'An Entry'
}
child_text = "Testing[1] footnote paragraph."
pre_footnote, footnote_id, post_footnote = /^(.+)\[(\d+)\](.+)/.match(child_text).captures
doc = Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.text(pre_footnote)
xml.footnote FOOTNOTES[footnote_id]
xml.text(post_footnote)
}
}
end
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<chapter>
<para>Testing<footnote>An Entry</footnote> footnote paragraph.</para>
</chapter>
The trick is you have to grab the text preceding and following your target so you can insert those as text nodes. Then you can figure out what needs to be added. For clarity in your code you should preprocess all the text, get your variables figured out, then fall into the XML generator. Don't try to do any calculations inside the Builder block, instead just reference variables. Think of Builder like a view in an MVC-type application if that helps.
FOOTNOTES could actually be a database lookup, a hash or some other data container.
You should also look at the << method, which lets you inject XML source, so you could pre-build the footnote XML, then loop over an array containing the various footnotes and inject them. Often it's easier to pre-process, then use gsub to treat things like [1] as placeholders. See "gsub(pattern, hash) → new_str" in the documentation, along with this example:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
For instance:
require 'nokogiri'
text = 'this is[1] text and[2] text'
footnotes = {
'[1]' => 'some',
'[2]' => 'more'
}
footnotes.keys.each do |k|
v = footnotes[k]
footnotes[k] = "<footnote>#{ v }</footnote>"
end
replacement_xml = text.gsub(/\[\d+\]/, footnotes) # => "this is<footnote>some</footnote> text and<footnote>more</footnote> text"
doc = Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para { xml.<<(replacement_xml) }
}
end
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <chapter>
# >> <para>this is<footnote>some</footnote> text and<footnote>more</footnote> text</para>
# >> </chapter>
I can try as below :
require 'nokogiri'
def xml
Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.parent.add_child("Testing[1] footnote paragraph.")
add_footnotes(xml, 'add text',"[1]")
}
}
end.to_xml
end
def add_footnotes(xml, text,ref)
string = xml.parent.child.content
xml.parent.child.content = ""
string.partition(ref).each do |txt|
next xml.text(txt) if txt != ref
xml.footnote text
end
end
puts xml
# >> <?xml version="1.0"?>
# >> <chapter>
# >> <para>Testing<footnote>add text</footnote> footnote paragraph.</para>
# >> </chapter>

Cant find element in clone document

I am using Nokogiri (1.5.9 - java) in JRuby ( 1.6.7.2 ) to copy an XML template and edit it. I'm having problems finding elements in the cloned document.
lblock = doc.xpath(".//lblock[#blockName='WINDOW_LIST']").first
lblock.children = new_children # kind of NodeSet or Node
copy_doc = doc.dup( 1 ) # or dup(0)
lblock = copy_doc.xpath(".//lblock[#blockName='WINDOW_LIST']").first # nil
When print to_s or to_xml, so lblock there is with new_children.
Where is my mistake?
I can't duplicate the problem:
require 'nokogiri'
new_children = Nokogiri::XML::DocumentFragment.parse('<foo>bar</foo>')
doc = Nokogiri::XML(<<EOF)
<xml>
<lblock blockName="WINDOW_LIST" />
</xml>
EOF
lblock = doc.xpath(".//lblock[#blockName='WINDOW_LIST']").first
lblock.children = new_children # kind of NodeSet or Node
copy_doc = doc.dup(1) # or dup(0)
lblock = copy_doc.xpath(".//lblock[#blockName='WINDOW_LIST']").first # nil
puts lblock.to_xml
puts
puts doc.to_xml
Running that outputs:
<lblock blockName="WINDOW_LIST">
<foo>bar</foo>
</lblock>
<?xml version="1.0"?>
<xml>
<lblock blockName="WINDOW_LIST"><foo>bar</foo></lblock>
</xml>
That said, here's code that is cleaned up to show you some simpler ways to write it:
require 'nokogiri'
new_children = '<foo>bar</foo>'
doc = Nokogiri::XML(<<EOF)
<xml>
<lblock blockName="WINDOW_LIST" />
</xml>
EOF
lblock = doc.at_xpath('//lblock')
lblock.children = new_children
copy_doc = doc.dup(1)
lblock = copy_doc.at_css('lblock')
puts lblock.to_xml
puts
puts doc.to_xml
Which outputs this too after running:
<lblock blockName="WINDOW_LIST">
<foo>bar</foo>
</lblock>
<?xml version="1.0"?>
<xml>
<lblock blockName="WINDOW_LIST"><foo>bar</foo></lblock>
</xml>
Dissecting the code:
lblock = doc.at_xpath('//lblock')
lblock = copy_doc.at_css('lblock')
These use two different ways of finding the same thing. In this case, because the sample XML was simple, I used at, which returns the first matching node. at_xpath and at_css work with XPaths and CSS respectively. at would try to figure out whether the string is CSS or XPath, and normally gets it right, though I have seen it fooled.
lblock.children = new_children
In this case, new_children is a String. Nokogiri is smart enough to know it should convert the string into an XML fragment before using it. This makes it very easy to modify XML or HTML documents with strings, instead of having to create DocumentFragments.

Parsing an XML file with Nokogiri to determine the path (Ruby)

My code is supposed to "guess" the path(s) that lies before the relevant text nodes in my XML file. Relevant in this case means: text nodes nested within the recurring product/person/something tag, but not text nodes that are used outside of it.
This code:
#doc, items = Nokogiri.XML(#file), []
path = []
#doc.traverse do |node|
if node.class.to_s == "Nokogiri::XML::Element"
is_path_element = false
node.children.each do |child|
is_path_element = true if child.class.to_s == "Nokogiri::XML::Element"
end
path.push(node.name) if is_path_element == true && !path.include?(node.name)
end
end
final_path = "/"+path.reverse.join("/")
works for simple XML files, for example:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
<item>
<title>Some product title</title>
<brand>Some product brand</brand>
</item>
</channel>
</rss>
puts final_path # => "/rss/channel/item"
But when it gets more complicated, how should I then approach the challenge? For example with this one:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Some XML file title</title>
<description>Some XML file description</description>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
<item>
<titles>
<title>Some product title</title>
</titles>
<brands>
<brand>Some product brand</brand>
</brands>
</item>
</channel>
</rss>
If you are looking for a list of deepest "parent" paths in the XML, there is more than one way to view that.
Although I think your own code could be adjusted to achieve the same output, I was convinced the same thing could be achieved by using xpath. And my motivation is to get my XML skills unrusty (not used Nokogiri yet, but I will need to do so professionally soon). So here is how to get all parent paths that have just one child level beneath them, using xpath:
xml.xpath('//*[child::* and not(child::*/*)]').each { |node| puts node.path }
The output of this for your second example file is:
/rss/channel/item[1]/titles
/rss/channel/item[1]/brands
/rss/channel/item[2]/titles
/rss/channel/item[2]/brands
. . . if you took this list and gsub out the indexes, then make the array unique, then this looks a lot like the output of your loop . . .
paths = xml.xpath('//*[child::* and not(child::*/*)]').map { |node| node.path }
paths.map! { |path| path.gsub(/\[[0-9]+\]/,'') }.uniq!
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
Or in one line:
paths = xml.xpath('//*[* and not(*/*)]').map { |node| node.path.gsub(/\[[0-9]+\]/,'') }.uniq
=> ["/rss/channel/item/titles", "/rss/channel/item/brands"]
I'm created a library to build xpath.
xpath = Jini.new
.add_path('parent')
.add_path('child')
.add_all('toys')
.add_attr('name', 'plane')
.to_s
puts xpath // -> /parent/child//toys[#name="plane"]

Nokogiri and XML Formatting When Inserting Tags

I'd like to use Nokogiri to insert nodes into an XML document. Nokogiri uses the Nokogiri::XML::Builder class to insert or create new XML.
If I create XML using the new method, I'm able to create nice, formatted XML:
builder = Nokogiri::XML::Builder.new do |xml|
xml.product {
xml.test "hi"
}
end
puts builder
outputs the following:
<?xml version="1.0"?>
<product>
<test>hi</test>
</product>
That's great, but what I want to do is add the above XML to an existing document, not create a new document. According to the Nokogiri documentation, this can be done by using the Builder's with method, like so:
builder = Nokogiri::XML::Builder.with(document.at('products')) do |xml|
xml.product {
xml.test "hi"
}
end
puts builder
When I do this, however, the XML all gets put into a single line with no indentation. It looks like this:
<products><product><test>hi</test></product></products>
Am I missing something to get it to format correctly?
Found the answer in the Nokogiri mailing list:
In XML, whitespace can be considered
meaningful. If you parse a document
that contains whitespace nodes,
libxml2 will assume that whitespace
nodes are meaningful and will not
insert them for you.
You can tell libxml2 that whitespace
is not meaningful by passing the
"noblanks" flag to the parser. To
demonstrate, here is an example that
reproduces your error, then does what
you want:
require 'nokogiri'
def build_from node
builder = Nokogiri::XML::Builder.with(node) do|xml|
xml.hello do
xml.world
end
end
end
xml = DATA.read
doc = Nokogiri::XML(xml)
puts build_from(doc.at('bar')).to_xml
doc = Nokogiri::XML(xml) { |x| x.noblanks }
puts build_from(doc.at('bar')).to_xml
Output:
<root>
<foo>
<bar>
<baz />
</bar>
</foo>
</root>

Resources