REXML: Equivalent of javascript-DOM's .innerHTML= - ruby

Is there a way to pass a string to an REXML::Element in such a way that the string will be parsed as XML, and the elements so found inserted into the target?

You can extend the REXML::Element class to include innerHTML as shown below.
require "rexml/element"
class REXML::Element
def innerHTML=(xml)
require "rexml/document"
self.to_a.each do |e|
self.delete e
end
d = REXML::Document.new "<root>#{xml}</root>"
d.root.to_a.each do |e|
case e
when REXML::Text
self.add_text e
when REXML::Element
self.add_element e
else
puts "ERROR"
end
end
xml
end
def innerHTML
ret = ''
self.to_a.each do |e|
ret += e.to_s
end
ret
end
end
You can then use innerHTML as you would in javascript (more or less).
require "rexml/document"
doc = REXML::Document.new "<xml><alice><b>bob</b><chuck>ch<u>u</u>ck</chuck></alice><alice/></xml>"
c = doc.root.get_elements('//chuck').first
t = c.innerHTML
c.innerHTML = "#{t}<david>#{t}</david>"
c = doc.root.get_elements('//alice').last
c.innerHTML = "<david>#{t}</david>"
doc.write( $stdout, 2 )

It would help if you could provide an example to further illustrate exactly what you had in mind.
With JS innerHTML you can insert text or HTML in one shot and changes are immediately displayed in the HTML document. The only way I know how to do this in REXML is with separate steps for inserting content/elements and saving/reloading the document.
To modify the text of a specific REXML Elemement you can use the text=() method.
#e represents a REXML Element
e.text = "blah"
If you want to insert another element you have to use the add_element() method.
#e represents a REXML Element
e.add_element('blah') #adds <blah></blah> to the existing element
b = e.get_elements('blah') #empty Element named "blah"
b.text('some text') #add some text to Element blah
Then of course save the XML document with the changes. ruby-doc.org/REXML/Element

text() will return the inner content as a string

Related

how to use nokogiri to parse xml file for specific values?

I have an xml file from which I need to extract all values that contain https://www.example.com/a/b:
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
Given the above, this should return two results. I've opened the file and parsed it with Nokogiri, but I do not understand how to access the values of the //loc key.
require 'nokogiri'
require 'open-uri'
doc = File.open('./sitemap-en.xml') { |f| Nokogiri::XML(f) }
puts doc.xpath('//loc')
The above code puts the entire xml file, but I want it paired down so that I get everything under the /a/b subdirectories. How can I do this?
Both of the following solutions assume the following:
require 'nokogiri'
xml = <<-XML
<xml>
<url><loc>https://www.example.com/a/b</loc></url>
<url><loc>https://www.example.com/b/c</loc></url>
<url><loc>https://www.example.com/a/b/c</loc></url>
<url><loc>https://www.example.com/c/d</loc></url>
</xml>
XML
doc = Nokogiri::XML(xml)
To return a list of all loc elements, select only those whose inner text begins with https://www.example.com/a/b, and print the URL text:
elements = doc.xpath("//loc")
filtered_elements = elements.select do |element|
element.text.start_with? 'https://www.example.com/a/b'
end
filtered_elements.each do |element|
puts element.text
end
To capture a list of loc elements whose inner text contains the string https://www.example.com/a/b and print each URL:
elements = doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]")
elements.each do |element|
puts element.text
end
To quickly print URLs using a slightly modified version of the previous XPATH query
puts doc.xpath("//loc[contains(text(), 'https://www.example.com/a/b')]/text()")

Using Nokogiri, how to convert html to text respecting block elements (ensuring they result in line breaks)

The Nokogiri #content method does not convert block elements into paragraphs, for example:
fragment = 'hell<span>o</span><p>world<p>I am Josh</p></p>'
Nokogiri::HTML(fragment).content
=> "helloworldI am Josh"
I would expect output:
=> "hello\n\nworld\n\nI am Josh"
How to convert html to text ensuring that block elements result in line breaks and inline elements are replaced with no space.
You can use #before and #after to add newlines:
doc.search('p,div,br').each{ |e| e.after "\n" }
This is my solution:
fragment = 'hell<span>o</span><p>world<p>I am Josh</p></p>'
HtmlToText.process(fragment)
=> "hello\n\nworld\n\nI am Josh"
I traverse the nokogiri tree, building a text string as I go, wrap the text in "\n\n" for block elements and "" for inline elements. Then gsub to clean up the abundance of \n chars at the end. It's hacky but works.
require 'nokogiri'
class HtmlToText
class << self
def process html
nokogiri = Nokogiri::HTML(html)
text = ''
nokogiri.traverse do |el|
if el.class == Nokogiri::XML::Element
sep = inline_element?(el) ? "" : "\n"
if el.children.length <= 0
text += "#{sep}"
else
text = "#{sep}#{sep}#{text}#{sep}#{sep}"
end
elsif el.class == Nokogiri::XML::Text
text += el.text
end
end
text.gsub(/\n{3,}/, "\n\n").gsub(/(\A\n+)|(\n+\z)/, "")
end
private
def inline_element? el
el && el.try(:name) && inline_elements.include?(el.name)
end
def inline_elements
%w(
a abbr acronym b bdo big br button cite code dfn em i img input
kbd label map object q samp script select small span strong sub
sup textarea time tt var
)
end
end
end

Adding a XML Element to a Nokogiri::XML::Builder document

How can I add a Nokogiri::XML::Element to a XML document that is being created with Nokogiri::XML::Buider?
My current solution is to serialize the element and use the << method to have the Builder reinterpret it.
orig_doc = Nokogiri::XML('<root xmlns="foobar"><a>test</a></root>')
node = orig_doc.at('/*/*[1]')
puts Nokogiri::XML::Builder.new do |doc|
doc.another {
# FIXME: this is the round-trip I would like to avoid
xml_text = node.to_xml(:skip_instruct => true).to_s
doc << xml_text
doc.second("hi")
}
end.to_xml
# The expected result is
#
# <another>
# <a xmlns="foobar">test</a>
# <second>hi</second>
# </another>
However the Nokogiri::XML::Element is a quite big node (in the order of kilobytes and thousands of nodes) and this code is in the hot path. Profiling shows that the serialization/parsing round trip is very expensive.
How can I instruct the Nokogiri Builder to add the existing XML element node in the "current" position?
Without using a private method you can get a handle on the current parent element using the parent method of the Builder instance. Then you can append an element to that (even from another document). For example:
require 'nokogiri'
doc1 = Nokogiri.XML('<r><a>success!</a></r>')
a = doc1.at('a')
# note that `xml` is not a Nokogiri::XML::Document,
# but rather a Nokogiri::XML::Builder instance.
doc2 = Nokogiri::XML::Builder.new do |xml|
xml.some do
xml.more do
xml.parent << a
end
end
end.doc
puts doc2
#=> <?xml version="1.0"?>
#=> <some>
#=> <more>
#=> <a>success!</a>
#=> </more>
#=> </some>
After looking at the Nokogiri source I have found this fragile solution: using the protected #insert(node) method.
The code, modified to use that private method looks like this:
doc.another {
xml_text = node.to_xml(:skip_instruct => true).to_s
doc.send('insert', xml_text) # <= use `#insert` instead of `<<`
doc.second("hi")
}

How do I traverse an inner node using SAX in Nokogiri?

I'm quite new to Nokogiri and Ruby and seeking a little help.
I am parsing a very large XML file using class MyDoc < Nokogiri::XML::SAX::Document. Now I want to traverse the inner part of a block.
Here's the format of my XML file:
<Content id="83087">
<Title></Title>
<PublisherEntity id="1067">eBooksLib</PublisherEntity>
<Publisher>eBooksLib</Publisher>
......
</Content>
I can already tell if the "Content" tag is found, now I want to know how to traverse inside of it. Here's my shortened code:
class MyDoc < Nokogiri::XML::SAX::Document
#check the start element. set flag for each element
def start_element name, attrs = []
if(name == 'Content')
#get the <Title>
#get the <PublisherEntity>
#get the Publisher
end
end
def cdata_block(string)
characters(string)
end
def characters(str)
puts str
end
end
Purists may disagree with me, but the way I've been doing it is to use Nokogiri to traverse the huge file, and then use XmlSimple to work with a smaller object in the file. Here's a snippet of my code:
require 'nokogiri'
require 'xmlsimple'
def isend(node)
return (node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT)
end
reader = Nokogiri::XML::Reader(File.open('database.xml', 'r'))
# traverse the file looking for tag "content"
reader.each do |node|
next if node.name != 'content' || isend(node)
# if we get here, then we found start of node 'content',
# so read it into an array and work with the array:
content = XmlSimple.xml_in(node.outer_xml())
title = content['title'][0]
# ...etc.
end
This works very well for me. Some may object to mixing SAX and non-SAX (nokogiri and XmlSimple) in the same code, but for my purposes, it gets the job done with minimal hassle.
It's trickier to do with SAX. I think the solution will need to look something like this:
class MyDoc < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
#inside_content = true if name == 'Content'
#current_element = name
end
def end_element name
#inside_content = false if name == 'Content'
#current_element = nil
end
def characters str
puts "#{#current_element} - #{str}" if #inside_content && %w{Title PublisherEntity Publisher}.include?(#current_element)
end
end

How to search an XML when parsing it using SAX in nokogiri

I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title tag.
<root>
<site>some site</site>
<title>good title</title>
</root>
I have the following code:
require 'rubygems'
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'title'
puts "found title"
end
end
def characters(text)
puts text
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")
problem is that it prints text between all the tags. How can I just print text between the title tag?
You just need to keep track of when you're inside a <title> so that characters knows when it should pay attention. Something like this (untested code) perhaps:
class PostCallbacks < XML::SAX::Document
def initialize
#in_title = false
end
def start_element(element, attributes)
if element == 'title'
puts "found title"
#in_title = true
end
end
def end_element(element)
# Doesn't really matter what element we're closing unless there is nesting,
# then you'd want "#in_title = false if element == 'title'"
#in_title = false
end
def characters(text)
puts text if #in_title
end
end
The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title> right at the beginning.
I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.
Here's how you'd use it:
require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first
It will stop right when it finds first occurrence of <title> tag.

Resources