Add Nokogiri parse result to variable - ruby

I have an XML document:
<cred>
<login>Tove</login>
<pass>Jani</pass>
</cred>
My code is:
require 'nokogiri'
require 'selwet'
context "parse xml" do doc = Nokogiri::XML(File.open("test.xml"))
doc.xpath("cred/login").each do
|char_element|
puts char_element.text
end
should "check" do
Unit.go_to "http://www.ya.ru/"
Unit.click '.b-inline'
Unit.fill '[name="login"]', #login
end
When I run my test I get:
Tove
0
But I want to insert the parse result to #login. How can I get variables with the parsing result? Do I need to insert the login and pass values from the XML to fields in the web page?

You can get value of login from your XML with
#login = doc.xpath('//cred/login').text

I'd use something like this to get the values:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<cred>
<login>Tove</login>
<pass>Jani</pass>
</cred>
EOT
login = doc.at('login').text # => "Tove"
pass = doc.at('pass').text # => "Jani"
Nokogiri makes it really easy to access values using CSS, so use it for readability when possible. The same thing can be done using XPath:
login = doc.at('//login').text # => "Tove"
pass = doc.at('//pass').text # => "Jani"
but having to add // twice to accomplish the same thing is usually wasted effort.
The important part is at, which returns the first occurrence of the target. at allows us to use either CSS or XPath, but CSS is usually less visually noisy.

Related

Nokogiri - Checking if the value of an xpath exists and is blank or not in Ruby

I have an XML file, and before I process it I need to make sure that a certain element exists and is not blank.
Here is the code I have:
CSV.open("#{csv_dir}/products.csv","w",{:force_quotes => true}) do |out|
out << headers
Dir.glob("#{xml_dir}/*.xml").each do |xml_file|
gdsn_doc = GDSNDoc.new(xml_file)
logger.info("Processing xml file #{xml_file}")
:x
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
row = []
headers.each do |col|
row << product[col]
end
out << row
end
end
end
The following code is not working to find the "description" element and to check whether it is blank or not:
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
Here is a sample of the XML file:
<productData>
<description>Chocolate biscuits </description>
<productData>
This is how I have defined the class and Nokogiri:
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
The code had to be moved up to an earlier stage, where Nokogiri was initialised. It doesn't get runtime errors, but it does let XML files with blank descriptions get through and it shouldn't.
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
desc_exists = #doc.xpath("//productData/descriptions")
if !desc_exists.empty?
You are creating your instance like this:
gdsn_doc = GDSNDoc.new(xml_file)
then use it like this:
#desc_exists = #gdsn_doc.xpath("//productData/description")
#gdsn_doc and gdsn_doc are two different things in Ruby - try just using the version without the #:
#desc_exists = gdsn_doc.xpath("//productData/description")
The basic test is to use:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<productData>
<description>Chocolate biscuits </description>
<productData>
EOT
# using XPath selectors...
doc.xpath('//productData/description').to_html # => "<description>Chocolate biscuits </description>"
doc.xpath('//description').to_html # => "<description>Chocolate biscuits </description>"
xpath works fine when the document is parsed correctly.
I get an error "undefined method 'xpath' for nil:NilClass (NoMethodError)
Usually this means you didn't parse the document correctly. In your case it's because you're not using the right variable:
gdsn_doc = GDSNDoc.new(xml_file)
...
#desc_exists = #gdsn_doc.xpath("//productData/description")
Note that gdsn_doc is not the same as #gdsn_doc. The later doesn't appear to have been initialized.
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
While that should work, it's idiomatic to write it as:
#doc = Nokogiri::XML(File.read(xml_file))
File.open(...) do ... end is preferred if you're processing inside the block and want Ruby to automatically close the file. That isn't necessary when you're simply reading then passing the content to something else for processing, hence the use of File.read(...) which slurps the file. (Slurping isn't necessary a good practice because it can have scalability problems, but for reasonable sized XML/HTML it's OK because it's easier to use DOM-based parsing than SAX.)
If Nokogiri doesn't raise an exception it was able to parse the content, however that still doesn't mean the content was valid. It's a good idea to check
#doc.errors
to see whether Nokogiri/libXML had to do some fix-ups on the content just to be able to parse it. Fixing the markup can change the DOM from what you expect, making it impossible to find a tag based on your assumptions for the selector. You could use xmllint or one of the XML validators to check, but Nokogiri will still have to be happy.
Nokogiri includes a command-line version nokogiri that accepts a URL to the document you want to parse:
nokogiri http://example.com
It'll open IRB with the content loaded and ready for you to poke at it. It's very convenient when debugging and testing. It's also a decent way to make sure the content actually exists if you're dealing with HTML containing DHTML that loads parts of the page dynamically.

Render span-level string using Kramdown

I know that I can parse and render an HTML document with Kramdown in ruby using something like
require 'kramdown'
s = 'This is a _document_'
Kramdown::Document.new(s).to_html
# '<p>This is a <i>document</i></p>'
In this case, the string s may contain a full document in markdown syntax.
What I want to do, however, is to parse s assuming that it only contains span-level markdown syntax, and obtain the rendered html. In particular there should be no <p>, <blockquote>, or, e.g., <table> in the rendered html.
s = 'This is **only** a span-level string'
# .. ??? ...
# 'This is <b>only</b> a span-level string'
How can I do this?
I would post-process the output with the sanitize gem.
require 'sanitize'
html = Kramdown::Document.new(s).to_html
output = Sanitize.fragment(html, elements:['b','i','em'])
The elements are a whitelist of allowed tags, just add all the tags you want. The gem has a set of predefined whitelists, but none match exactly what you're looking for. (BTW, if you want a list of all the HTML5 elements allowed in a span, see the WHATWG's list of "phrasing content").
I know this wasn't tagged rails, but for the benefit of readers using Rails: use the built-in sanitize helper.
You can create a custom parser, and empty its internal list of block-level parsers.
class Kramdown::Parser::SpanKramdown < Kramdown::Parser::Kramdown
def initialize(source, options)
super
#block_parsers = []
end
end
Then you can use it like this:
text = Kramdown::Document.new(text, :input => 'SpanKramdown').to_html
This should do what you want "the right way".

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Screen scraping with Nokogiri and Each method returning zero

I'm running the following code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://sfbay.craigslist.org/search/sss?query=bike&catAbb=sss&srchType=A&minAsk=&maxAsk="
doc = Nokogiri::HTML(open(url))
doc.css(".row").each do |row|
row.css("a").text
end
The only thing I get returned is 0. However, when I just run doc.css(".row"), I get the entire list of rows from the CL. Why is it returning zero when I use the each method and how do I fix it?
.each doesn't return anything, it's a simple iterator. Perhaps you are looking for .map?
This will return an array of the anchor element text:
doc.css(".row").map {|row| row.css("a").text }
You don't need to issue two different css queries; you can combine them:
doc.css(".row > a").map(&:text)

How to visit a URL with Ruby via http and read the output?

So far I have been able to stitch this together :)
begin
open("http://www.somemain.com/" + path + "/" + blah)
rescue OpenURI::HTTPError
#failure += painting.permalink
else
#success += painting.permalink
end
But how do I read the output of the service that I would be calling?
Open-URI extends open, so you'll get a type of IO stream returned:
open('http://www.example.com') #=> #<StringIO:0x00000100977420>
You have to read that to get content:
open('http://www.example.com').read[0 .. 10] #=> "<!DOCTYPE h"
A lot of times a method will let you pass different types as a parameter. They check to see what it is and either use the contents directly, in the case of a string, or read the handle if it's a stream.
For HTML and XML, such as RSS feeds, we'll typically pass the handle to a parser and let it grab the content, parse it, and return an object suitable for searching further:
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com'))
doc.class #=> Nokogiri::HTML::Document
doc.to_html[0 .. 10] #=> "<!DOCTYPE h"
doc.at('h1').text #=> "Example Domains"
doc = open("http://etc..")
content = doc.read
More often people want to be able to parse the returned document, for this use something like hpricot or nokogiri
I'm not sure if you want to do this yourself for the hell of it or not but if you don't.. Mecanize is a really nice gem for doing this.
It will visit the page you want and automatically wrap the page with nokogiri so that you can access it's elements with css selectors such as "div#header h1". Ryan Bates has a video tutorial on it which will teach you everything you need to know to use it.
Basically you can just
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
agent.get("http://www.google.com")
agent.page.at("some css selector").text
It's that simple.

Resources