I'm a newbie to programmer so excuse my noviceness. So I'm using Nokogiri to scrape a police crime log. Here is the code below:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.sfsu.edu/~upd/crimelog/index.html"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css(".brief").each do |brief|
puts brief.at_css("h3").text
end
I used the selector gadget bookmarklet to find the CSS selector for the log (.brief). When I pass "h3" through brief.at_css I get all of the h3 tags with the content inside.
However, if I add the .text method to remove the tags, I get NoMethod error.
Is there any reason why this is happening? What am I missing? Thanks!
To clarify if you look at the structure of the HTML source you will see that the very first occurrence of <div class="brief"> does not have a child h3 tag (it actually only has a child <p> tag).
The Nokogiri Docs say that
at_css(*rules)
Search this node for the first occurrence of CSS rules. Equivalent to css(rules).first See Node#css for more information.
If you call at_css(*rules) the docs states it is equivalent to css(rules).first. When there are items (your .brief class contains a h3) then an Nokogiri::XML::Element object is returned which responds to text, whereas if your .brief does not contain a h3 then a NilClass object is returned, which of course does not respond to text
So if we call css(rules) (not at_css as you have) we get a Nokogiri::XML::NodeSet object returned, which has the text() method defined as (notice the alias)
# Get the inner text of all contained Node objects
def inner_text
collect{|j| j.inner_text}.join('')
end
alias :text :inner_text
because the class is Enumerable it iterates over it's children calling their inner_text method and joins them all together.
Therefore you can either perform a nil? check or as #floatless correctly stated just use the css method
You just need to replace at_css with css and everything should be okay.
Related
I tried this:
xml_parser = Nori.new
xml_parser.parse "<FareReference ResBookDesigCode='Q'>Value</FareReference>"
And the result is:
{"FareReference"=>"Value"}
I wanted to retrieve the ResBookDesigCode value also.
Nokogiri is my recommended tool since Nori doesn't appear it's being actively supported.
require 'nokogiri'
doc = Nokogiri::XML("<FareReference ResBookDesigCode='Q'>Value</FareReference>")
doc now contains the DOM for the XML.
We can access the content for the FareReference node easily, along with its parameters:
doc.at('FareReference').text # => "Value"
doc.at('FareReference')['ResBookDesigCode'] # => "Q"
at basically means find the first node containing that selector. The documentation and tutorials describe the sibling methods.
Using Cheezy's page-object gem I've come across the ability to have dynamic element locators. (Noted at this github issue: https://github.com/cheezy/page-object/issues/203).
So for example I can do span_element(id: 'some id'), div_element(class: 'some_class'), etc. However what can I do if I need to locate a generic element? For example I could be working on a page that has angular so the elements are not traditional (like instead of a traditional html select control with options, it is a custom angular dropdown). I've tried element_element(class: 'class_name') and just element(class: 'class_name') but ruby says method missing
The generic dynamic element locator is defined in PageObject::ElementLocators#element as:
def element(tag, identifier={:index => 0})
platform.element_for(tag, identifier.clone)
end
The first argument is the element's tag name. If you don't know the tag name, you can specify "element" for any tag. For example:
class MyPage
include PageObject
def do_stuff
element('element', class: 'class_name').text
end
end
So I have setup a Capybara, Cucumber project with SitePrism for POM and for the most part it works. When I use:
Then('Find object name') do
expect(#page).to have_object_name
end
it works just fine but when I come to use:
Then('Assign object names text to a variable') do
expect(#page).to have_object_name
valueA = #page.find('object_name').text
end
this doesnt work and throws an error
Unable to find css "object_name" (Capybara::ElementNotFound)
However, if I use:
Then('Assign object names text to a variable') do
expect(#page).to have_object_name
valueA = #page.find(:xpath, 'object_name_XPath').text
end
this works out just fine as well, but this sort of defeats the point of POM as it would greatly increase maintenance.
I assume I must be missing something to get the page.find locate the object_name from my page but I have danced around it and searched high and low but can't seem to figure the problem out.
Help? :)
#page.find takes a CSS selector (by default - it's actually just the Capybara find method), not the name of a SitePrism element.
If #page is a SitePrism::Page or SitePrism::Section object and you have defined elements on it, then you just access that element as a method on #page - See https://github.com/natritmeyer/site_prism#accessing-the-individual-element
#page.object_name
I'm still new(ish), to POM, but I've found the syntax and general structure quite strong, so now I'm looking to advanced techniques.
I have a dynamic page, and for each of the sections I am running the following code/psuedo code
if has_SECTVAR1?
$LOG.info("Stuff")
end
if has_SECTVAR2?
$LOG.info("Stuff")
end
What I want to do is something like this.
ALLSECTIONARRAYS.each do |var|
if has_var?
$LOG.info("Stuff")
end
end
Any thoughts?
You can get an array of element names using #mapped_items. The more interesting part is checking if those exist on the page by calling #has_element?.
The abstract version of what you want to do is call a method on an object given its name as a string. To do this, use #send:
MyObject.send("method_name", *args)
Or in your case:
MyPage.send("has_element?")
Finally, to iterate over all elements:
MyPage.mapped_items.each do |item|
if MyPage.send("has_#{item}?")
$LOG.info("Stuff")
end
end
run the following, and its supposed to return the company name. The xpath works in firefox, and it returns the company name. however in nokogiri, this isn't happening, it jsut returns empty string!
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.careerbuilder.com/JobSeeker/Jobs/JobDetails.aspx?IPath=QHK
CV&ff=21&APath=2.21.0.0.0&job_did=J3G71D73BM9HCK1M84Z&cbRecursionCnt=1&cbsid=6d2aee1515ed404b8306d1a583592cd4-314600403-JQ-5'))
companyname = doc.xpath("/html[1]/body[1]/div[1]/div[1]/form[1]/div[1]/table[1]/tbody[1]/tr[2]/td[1]/div[1]/table[1]/tbody[1]/tr[1]/td[1]/div[1]/div[2]/table[1]/tbody[1]/tr[1]/td[2]").to_s
puts companyname
Your xpath is not correct :)
You should omit the tbody part, this is generated by the browser but not by nokogiri!
doc.xpath("/html[1]/body[1]/div[1]/div[1]/form[1]/div[1]/table[1]/tr[2]/td[1]/div[1]/table[1]/tr[1]/td[1]/div[1]/div[2]/table[1]/tr[1]/td[2]").to_s
NB: Also you xpath will be more stable against changes of the HTML page if you use any class or id attributes to selected nodes, rather than the full path. For example you could use
doc.xpath("//div[#class='job_desc'][1]/table[1]/tr[1]/td[2]")
or even simple just use a css selector
doc.css("div.job_desc td")[1]