Ruby HTMLish tokenizer - ruby

I'm looking for a resource for tokenizing HTMLish markup. I'm creating a markup language that is a lot like (but isn't) HTML. All I want is something that can parse it up into tags, text, comments, etc. I don't need the tokens to be arranged into a tree structure or checked if they're valid tags or whatever - I'll do that myself.
So, for example, if given this string:
hello <x> dude <whatever></x>
it would return an array something like this:
hello
<x>
dude
<whatever>
</x>
It could also return objects representing those strings. Either would be cool.
I've looked into Nokogiri and Oga, but they seem to just want to parse and tree HTML. Suggestions?

If you're willing to do much of the validation yourself, could a regular expression work? Something like:
html = 'hello <x> dude <whatever></x>'
html.split(/(<[^<>]+>)/)
#=> ["hello ", "<x>", " dude ", "<whatever>", "", "</x>"]
Otherwise, I wonder: could your markup be XMLish rather than HTMLish? For example, do you need to support void elements like <whatever>, or would it be enough to support self-closing tags like <whatever />? That is, are you committed to supporting markup like hello <x> dude <whatever></x>, or would supporting hello <x> dude <whatever /></x> (with the self-closing <whatever />) be enough?
If self-closing tags are enough, it sounds like an XML parser could do the trick. Even if the parser builds a tree, you can usually flatten that into an array.
If you need custom void elements, you may need to find an HTML parser that supports those. I don't know any offhand, but it should be possible to modify Oga to do that. You could also modify Oga to support flattening a tree into an array. Something like:
module Oga
module XML
# Redefine the list of void elements.
remove_const :HTML_VOID_ELEMENTS
const_set :HTML_VOID_ELEMENTS, Whitelist.new(%w{
whatever
})
class TokenGenerator < Generator
def initialize(*args)
super
#tokens = []
end
%i[
on_element on_text on_cdata on_comment on_xml_declaration
on_processing_instruction on_doctype on_document
after_element
].each do |method|
define_method method do |content, output|
token = super(content, '')
#tokens << token if token
super(content, output)
end
end
def to_tokens
#tokens = []
to_xml
#tokens
end
end
end
end
html = Oga.parse_html('hello <x> dude <whatever></x>')
Oga::XML::TokenGenerator.new(html).to_tokens
=> ["hello ", "<x>", " dude ", "<whatever>", "</x>"]

Related

Parsing XML with Ruby

I'm way new to working with XML but just had a need dropped in my lap. I have been given an usual (to me) XML format. There are colons within the tags.
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name</PART1:Name>
</THING1:things>
It is a large file and there is much more to it than this but I hope this format will be familiar to someone. Does anyone know a way to approach an XML document of this sort?
I'd rather not just write a brute-force way of parsing the text but I can't seem to make any headway with REXML or Hpricot and I suspect it is due to these unusual tags.
my ruby code:
require 'hpricot'
xml = File.open( "myfile.xml" )
doc = Hpricot::XML( xml )
(doc/:things).each do |thg|
[ 'Id', 'Name' ].each do |el|
puts "#{el}: #{thg.at(el).innerHTML}"
end
end
...which is just lifted from: http://railstips.org/blog/archives/2006/12/09/parsing-xml-with-hpricot/
And I figured I would be able to figure some stuff out from here but this code returns nothing. It doens't error. It just returns.
As #pguardiario mentioned, Nokogiri is the de facto XML and HTML parsing library. If you wanted to print out the Id and Name values in your example, here is how you would do it:
require 'nokogiri'
xml_str = <<EOF
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name</PART1:Name>
</THING1:things>
EOF
doc = Nokogiri::XML(xml_str)
thing = doc.at_xpath('//things')
puts "ID = " + thing.at_xpath('//Id').content
puts "Name = " + thing.at_xpath('//Name').content
A few notes:
at_xpath is for matching one thing. If you know you have multiple items, you want to use xpath instead.
Depending on your document, namespaces can be problematic, so calling doc.remove_namespaces! can help (see this answer for a brief discussion).
You can use the css methods instead of xpath if you're more comfortable with those.
Definitely play around with this in irb or pry to investigate methods.
Resources
Parsing an HTML/XML document
Getting started with Nokogiri
Update
To handle multiple items, you need a root element, and you need to remove the // in the xpath query.
require 'nokogiri'
xml_str = <<EOF
<root>
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name1</PART1:Name>
</THING1:things>
<THING2:things type="Container">
<PART2:Id type="Property">2234</PART2:Id>
<PART2:Name type="Property">The Name2</PART2:Name>
</THING2:things>
</root>
EOF
doc = Nokogiri::XML(xml_str)
doc.xpath('//things').each do |thing|
puts "ID = " + thing.at_xpath('Id').content
puts "Name = " + thing.at_xpath('Name').content
end
This will give you:
Id = 1234
Name = The Name1
ID = 2234
Name = The Name2
If you are more familiar with CSS selectors, you can use this nearly identical bit of code:
doc.css('things').each do |thing|
puts "ID = " + thing.at_css('Id').content
puts "Name = " + thing.at_css('Name').content
end
If in a Rails environment, the Hash object is extended and one can take advantage of the the method from_xml:
xml = File.open("myfile.xml")
data = Hash.from_xml(xml)

Nokogiri leaving HTML entities untouched

I want Nokogiri to leave HTML entities untouched, but it seems to be converting the entities into the actual symbol. For example:
Nokogiri::HTML.fragment('<p>®</p>').to_s
results in: "<p>®</p>"
Nothing seems to return the original HTML back to me.
The .inner_html, .text, .content methods all return '®' instead of '®'
Is there a way for Nokogiri to leave these HTML entities untouched?
I've already searched stackoverflow and found similar questions, but nothing exactly like this one.
Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding:
#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>®</p>')
puts html.to_html #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' ) #=> <p>®</p>
It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original.
The root of the problem is that, in HTML, the following all describe the exact same content:
<p>®</p>
<p>®</p>
<p>®</p>
<p>®</p>
If you wanted the to_s representation of a text node to be actually ® then the markup describing that would really be: <p>&reg;</p>.
If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this (Nokogiri::XML::EntityReference):
require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo®</p>
However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of Nokogiri::XML::ParseOptions::NOENT during parsing does not appear to cause one to be created:
require 'nokogiri'
html = "<p>Foo®</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
Nokogiri::XML::ParseOptions::DEFAULT_HTML,
Nokogiri::XML::ParseOptions::DEFAULT_XML,
Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">

How to parse XML to CSV where data is in attributes only

The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.
I have this XML file:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
And I want to parse it into a text file like this with the class ref duplicated for each property:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
This is the code I have so far:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/#class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/#property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end
I'd simplify it a bit using CSS accessors:
xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
data = [ %w[ class_ref property_ref is_required UOM_ref] ]
doc.css('|prescribed_item').each do |pi|
pi.css('|prescribed_property').each do |pp|
data << [
pi['class_ref'],
pp['property_ref'],
pp['is_required'],
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
]
end
end
puts data.map{ |row| row.join('|') }
Which outputs:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"
In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.
The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.
Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.
As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.
Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.
There are definitely ways of parsing based on attributes.
The Engine yard article "Getting started with Nokogiri" has a full description.
But quickly, the examples they give are:
To match “h3″ tags that have a class
attribute, we write:
h3[#class]
To match “h3″ tags whose class
attribute is equal to the string “r”,
we write:
h3[#class = "r"]
Using the attribute matching
construct, we can modify our previous
query to:
//h3[#class = "r"]/a[#class = "l"]

Rails 3 and html_safe confusion (allow pictures (smiles) in chat but deny everything else)

I have here is a module that replaces the smilies (like ":-)") as icons:
module Smileize
PATH = "/images/smiles"
SMILES = [/\;\-?p/i, /\$\-?\)/, /8\-?\)/, /\>\:\-?\(/, /\:\-?\*/, /\:\-?o/i, /\:\-?c/i, /\;\-?\)/,
/\:\-?s/i, /\:\-?\|/, /\:\-?p/i, /\:\-?D/i, /\:\-?\?/, /\:\-?\(/, /\:\-?\)/]
def to_icon(key)
return "<img class='smiley' src='#{PATH}/smile#{SMILES.index(key) + 1}.png'/>"
end
module_function :to_icon
end
class String
def to_smile
Smileize::SMILES.each do |smile|
if self =~ smile
self.gsub!(smile, Smileize.to_icon(smile))
end
end
self
end
end
So pictures show that I'm using html_safe, like this:
<%= #message.text.to_smile.html_safe %>
But it does not suit me, because but pictures will be displayed and other tags, too.
My question is: how to display only my smile, ignoring the other tags?
I think you'll need to do it like this:
HTML encode the string.
Perform your substitution.
Mark the final result as HTML safe.
Add a helper something like this:
def expand_smilies(s)
s = ERB::Util::html_escape(s)
Smileize::SMILES.each do |smile|
s.gsub!(smile, Smileize.to_icon(smile))
end
s.html_safe
end
And then in your ERB:
<%= expand_smilies some_text %>
ERB uses ERB::Util::html_escape to encode HTML so using it yourself makes sense if you're targeting ERB. Calling html_safe on a string returns you something that ERB will leave alone when it is HTML encoding things.
Note that there is no usable html_safe! on strings and html_safe returns an ActiveSupport::SafeBuffer rather than a String so you'll have to use a helper rather than monkey patching a new method into String. ActiveSupport does patch an html_safe! method into String but all it does is raise an exception saying "don't do that":
def html_safe!
raise "You can't call html_safe! on a String"
end

How to search an XML when parsing it using SAX in nokogiri

I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title tag.
<root>
<site>some site</site>
<title>good title</title>
</root>
I have the following code:
require 'rubygems'
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'title'
puts "found title"
end
end
def characters(text)
puts text
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")
problem is that it prints text between all the tags. How can I just print text between the title tag?
You just need to keep track of when you're inside a <title> so that characters knows when it should pay attention. Something like this (untested code) perhaps:
class PostCallbacks < XML::SAX::Document
def initialize
#in_title = false
end
def start_element(element, attributes)
if element == 'title'
puts "found title"
#in_title = true
end
end
def end_element(element)
# Doesn't really matter what element we're closing unless there is nesting,
# then you'd want "#in_title = false if element == 'title'"
#in_title = false
end
def characters(text)
puts text if #in_title
end
end
The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title> right at the beginning.
I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.
Here's how you'd use it:
require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first
It will stop right when it finds first occurrence of <title> tag.

Resources