Add a class to an element with Nokogiri - ruby

Apparently Nokogiri's add_class method only works on NodeLists, making this code invalid:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor.add_class("whatever") # WHOOPS!
end
What can I do to make this code work? I figured it'd be something like
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
Nokogiri::XML::NodeSet.new(anchor).add_class("whatever")
end
but this doesn't work either. Please tell me I don't have to implement my own add_class for single nodes!

A CSS class is just another attribute on an element:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor['class']="whatever"
end
Since CSS classes are space-delimited in the attribute, if you're not sure if one or more classes might already exist you'll need something like
anchor['class'] ||= ""
anchor['class'] = anchor['class'] << " whatever"
You need to explicitly set the attribute using = instead of just mutating the string returned for the attribute. This, for example, will not change the DOM:
anchor['class'] ||= ""
anchor['class'] << " whatever"
Even though it results in more work being done, I'd probably do this like so:
class Nokogiri::XML::Node
def add_css_class( *classes )
existing = (self['class'] || "").split(/\s+/)
self['class'] = existing.concat(classes).uniq.join(" ")
end
end
If you don't want to monkey-patch the class, you could alternatively:
module ClassMutator
def add_css_class( *classes )
existing = (self['class'] || "").split(/\s+/)
self['class'] = existing.concat(classes).uniq.join(" ")
end
end
anchor.extend ClassMutator
anchor.add_css_class "whatever"
Edit: You can see that this is basically what Nokogiri does internally for the add_class method you found by clicking on the class to view the source:
# File lib/nokogiri/xml/node_set.rb, line 136
def add_class name
each do |el|
next unless el.respond_to? :get_attribute
classes = el.get_attribute('class').to_s.split(" ")
el.set_attribute('class', classes.push(name).uniq.join(" "))
end
self
end

Nokogiri's add_class, works on a NodeSet, like you found. Trying to add the class inside the each block wouldn't work though, because at that point you are working on an individual node.
Instead:
require 'nokogiri'
html = '<p>one</p><p>two</p>'
doc = Nokogiri::HTML(html)
doc.search('p').tap{ |ns| ns.add_class('boo') }.each do |n|
puts n.text
end
puts doc.to_html
Which outputs:
# >> one
# >> two
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p class="boo">one</p>
# >> <p class="boo">two</p>
# >> </body></html>
The tap method, implemented in Ruby 1.9+, gives access to the nodelist itself, allowing the add_class method to add the "boo" class to the <p> tags.

Old thread, but it's the top Google hit. You can now do this with the append_class method without having to mess with space-delimiters:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor.append_class('whatever')
end

Related

Getting page title with Ruby

I am trying to get what's inside of the title tag but I can't get to do it. I am following some of the answers around stackoverflow that are supposed to work but for me they don't.
This is what I am doing:
require "open-uri"
require "uri"
def browse startpage, depth, block
if depth > 0
begin
open(startpage){ |f|
block.call startpage, f
}
rescue
return
end
end
end
browse("https://www.ruby-lang.org/es/", 2, lambda { |page_name, web|
puts "Header information:"
puts "Title: #{web.to_s.scan(/<title>(.*?)<\/title>/)}"
puts "Base URI: #{web.base_uri}"
puts "Content Type: #{web.content_type}"
puts "Charset: #{web.charset}"
puts "-----------------------------"
})
The title output is just [], why?
open returns a File object or passes it to the block (actually a Tempfile but that doesn't matter). Calling to_s just returns a string containing the object's class and its id:
open('https://www.ruby-lang.org/es/') do |f|
f.to_s
end
#=> "#<File:0x007ff8e23bfb68>"
Scanning that string for a title is obviously useless:
"#<File:0x007ff8e23bfb68>".scan(/<title>(.*?)<\/title>/)
Instead, you have to read the file's content:
open('https://www.ruby-lang.org/es/') do |f|
f.read
end
#=> "<!DOCTYPE html>\n<html>\n...</html>\n"
You can now scan the content for a <title> tag:
open('https://www.ruby-lang.org/es/') do |f|
str = f.read
str.scan(/<title>(.*?)<\/title>/)
end
#=> [["Lenguaje de Programaci\xC3\xB3n Ruby"]]
or, using Nokogiri: (because You can't parse [X]HTML with regex)
open('https://www.ruby-lang.org/es/') do |f|
doc = Nokogiri::HTML(f)
doc.at_css('title').text
end
#=> "Lenguaje de ProgramaciĆ³n Ruby"
If you must insist on using open-uri, this one liner than get you the page title:
2.1.4 :008 > puts open('https://www.ruby-lang.org/es/').read.scan(/<title>(.*?)<\/title>/)
Lenguaje de ProgramaciĆ³n Ruby
=> nil
If you want to use something more complicated than this, please use nokogiri or mechanize. Thanks

How do I traverse an inner node using SAX in Nokogiri?

I'm quite new to Nokogiri and Ruby and seeking a little help.
I am parsing a very large XML file using class MyDoc < Nokogiri::XML::SAX::Document. Now I want to traverse the inner part of a block.
Here's the format of my XML file:
<Content id="83087">
<Title></Title>
<PublisherEntity id="1067">eBooksLib</PublisherEntity>
<Publisher>eBooksLib</Publisher>
......
</Content>
I can already tell if the "Content" tag is found, now I want to know how to traverse inside of it. Here's my shortened code:
class MyDoc < Nokogiri::XML::SAX::Document
#check the start element. set flag for each element
def start_element name, attrs = []
if(name == 'Content')
#get the <Title>
#get the <PublisherEntity>
#get the Publisher
end
end
def cdata_block(string)
characters(string)
end
def characters(str)
puts str
end
end
Purists may disagree with me, but the way I've been doing it is to use Nokogiri to traverse the huge file, and then use XmlSimple to work with a smaller object in the file. Here's a snippet of my code:
require 'nokogiri'
require 'xmlsimple'
def isend(node)
return (node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT)
end
reader = Nokogiri::XML::Reader(File.open('database.xml', 'r'))
# traverse the file looking for tag "content"
reader.each do |node|
next if node.name != 'content' || isend(node)
# if we get here, then we found start of node 'content',
# so read it into an array and work with the array:
content = XmlSimple.xml_in(node.outer_xml())
title = content['title'][0]
# ...etc.
end
This works very well for me. Some may object to mixing SAX and non-SAX (nokogiri and XmlSimple) in the same code, but for my purposes, it gets the job done with minimal hassle.
It's trickier to do with SAX. I think the solution will need to look something like this:
class MyDoc < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
#inside_content = true if name == 'Content'
#current_element = name
end
def end_element name
#inside_content = false if name == 'Content'
#current_element = nil
end
def characters str
puts "#{#current_element} - #{str}" if #inside_content && %w{Title PublisherEntity Publisher}.include?(#current_element)
end
end

Parse REXML Document, ignoring whitespace

Should REXML ignore identation or whitespacing?
I am debugging an issue with a simple HTML to Markdown convertor. For some reason it fails on
<blockquote><p>foo</p></blockquote>
But not on
<blockquote>
<p>foo</p>
</blockquote>
The reason is, that in the first case, type.children.first.value is not set, in the latter case it is.
The original code can be found at link above, but a condensed snipped to show the problem is below:
require 'rexml/document'
include REXML
def parse_string(string)
doc = Document.new("<root>\n"+string+"\n</root>")
root = doc.root
root.elements.each do |element|
parse_element(element, :root)
end
end
def parse_element(element, parent)
#output = ''
# ...
#output << opening(element, parent)
#...
end
def opening(type, parent)
case type.name.to_sym
#...
when :blockquote
# remove leading newline
type.children.first.value = ""
"> "
end
end
#Parses just fine
puts parse_string("<blockquote>\n<p>foo</p>\n</blockquote>")
# Fails with undefined method `value=' for <p> ... </>:REXML::Element (NoMethodError)
puts parse_string("<blockquote><p>foo</p></blockquote>")
I am quite certain, this is due to some parameter that makes REXML require whitespacing and identation: why else would it parse the first XML different from the latter?
Can I force REXML to parse both the same? Or am I looking at a whole different kind of bug?
Try passing the option :ignore_whitespace_nodes=>:all to Document.new().

How to search an XML when parsing it using SAX in nokogiri

I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title tag.
<root>
<site>some site</site>
<title>good title</title>
</root>
I have the following code:
require 'rubygems'
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'title'
puts "found title"
end
end
def characters(text)
puts text
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")
problem is that it prints text between all the tags. How can I just print text between the title tag?
You just need to keep track of when you're inside a <title> so that characters knows when it should pay attention. Something like this (untested code) perhaps:
class PostCallbacks < XML::SAX::Document
def initialize
#in_title = false
end
def start_element(element, attributes)
if element == 'title'
puts "found title"
#in_title = true
end
end
def end_element(element)
# Doesn't really matter what element we're closing unless there is nesting,
# then you'd want "#in_title = false if element == 'title'"
#in_title = false
end
def characters(text)
puts text if #in_title
end
end
The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title> right at the beginning.
I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.
Here's how you'd use it:
require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first
It will stop right when it finds first occurrence of <title> tag.

Ruby parameterize if ... then blocks

I am parsing a text file and want to be able to extend the sets of tokens that can be recognized easily. Currently I have the following:
if line =~ /!DOCTYPE/
puts "token doctype " + line[0,20]
#ast[:doctype] << line
elsif line =~ /<html/
puts "token main HTML start " + line[0,20]
html_scanner_off = false
elsif line =~ /<head/ and not html_scanner_off
puts "token HTML header starts " + line[0,20]
html_header_scanner_on = true
elsif line =~ /<title/
puts "token HTML title " + line[0,20]
#ast[:HTML_header_title] << line
end
Is there a way to write this with a yield block, e.g. something like:
scanLine("title", :HTML_header_title, line)
?
Don't parse HTML with regexes.
That aside, there are several ways to do what you're talking about. One:
class Parser
class Token
attr_reader :name, :pattern, :block
def initialize(name, pattern, block)
#name = name
#pattern = pattern
#block = block
end
def process(line)
#block.call(self, line)
end
end
def initialize
#tokens = []
end
def scanLine(line)
#tokens.find {|t| line =~ t.pattern}.process(line)
end
def addToken(name, pattern, &block)
#tokens << Token.new(name, pattern, block)
end
end
p = Parser.new
p.addToken("title", /<title/) {|token, line| puts "token #{token.name}: #{line}"}
p.scanLine('<title>This is the title</title>')
This has some limitations (like not checking for duplicate tokens), but works:
$ ruby parser.rb
token title: <title>This is the title</title>
$
If you're intending to parse HTML content, you might want to use one of the HTML parsers like nokogiri (http://nokogiri.org/) or Hpricot (http://hpricot.com/) which are really high-quality. A roll-your-own approach will probably take longer to perfect than figuring out how to use one of these parsers.
On the other hand, if you're dealing with something that's not quite HTML, and can't be parsed that way, then you'll need to roll your own somehow. There's a few Ruby parser frameworks out there that may help, but for simple tasks where performance isn't a critical factor, you can get by with a pile of regexps like you have here.

Resources